Re: Speech Synthesis Manager Latency
Re: Speech Synthesis Manager Latency
- Subject: Re: Speech Synthesis Manager Latency
- From: Victor Tsaran <email@hidden>
- Date: Thu, 10 Apr 2014 23:16:43 -0700
Hi Bryan,
Did you try conducting similar experiments with the Carbon speech API? I know Apple doesn’t recommend it much these days, but it may be a great test to try.
On Apr 7, 2014, at 4:19 PM, Bryan Smart <email@hidden> wrote:
> I’ve made some observations recently regarding the Speech Synthesis Manager and latency. I’m not sure if they indicate a bug, or if they are design limitations. Hopefully someone here can provide feedback.
>
> I’ve noticed that any attempts to speak text take place after a significant and variable delay of about 200-300MS. When run, the following simple Python script will say the letter “T” each time the enter key is pressed.
>
> import Cocoa
>
> synth = Cocoa.NSSpeechSynthesizer.alloc().init()
> synth.setRate_(500)
>
> while (True):
> raw_input()
> synth.startSpeakingString_("T")
>
> Try pressing enter, waiting about a second, and pressing it again. Observe the short lag each time between when you press enter, and when you hear speech. Then, rapidly and repeatedly, press enter, so to interrupt the speech. Notice the long and variable length of the lag that is taken to silence the speech, and the long and variable amount of silence before new speech begins.
>
> I know that some text-to-speech engines, particularly those with large sample sets, will not always be able to quickly respond to requests for speech. However, I tried the above script with both the Nuance compact voices (normally used on embedded systems), as well as Fred, and all experienced the same delay before speaking. I’m running this on a fast I7-based Mac. Fred originally ran on 30Mhz 68030 Macs and lower, so I couldn't believe that this voice would have any marginal processing consequence today. In all cases, I thought that something must be wrong in order for it to take so long to first stop speaking, and then to begin speaking the next phrase.
>
> I later tried a custom synth, outputting audio through various means, including to a file, and as best as I could tell, the lag happened before the actual synth started speaking. It seemed to me that something in the Speech Synthesis Manager was taking too long to queue up speech, and that some long process was taking place when-ever new speech was silencing old speech. These delays didn't seem to involve the actual synthesizers themselves, though.
>
> Here is another example, so that you can get a feel for the difference between how speech currently reacts, and how it should react. When you run the following Python script, it will also say “T” when you press enter, but it does this by playing a sound file, instead of synthesizing the speech.
>
> Before running the script, create the sound file by entering this in a terminal:
>
> say -o /tmp/t.aiff [[rate 500]] T
>
> Then, run this script:
>
> import Cocoa
>
> sound = Cocoa.NSSound.alloc().retain()
> sound.initWithContentsOfFile_byReference_(“/tmp/t.aiff", True)
>
> while (True):
> raw_input()
> sound.stop()
> sound.play()
>
> Notice that when you press enter with this script, you hear “T” almost instantaneously. If you rapidly and repeatedly press enter, you hear the new speech smoothly interrupting the old speech.
>
> Obviously, the second example, playing from a file, has less overhead than synthesizing speech. However, speech on other platforms behaves far more like the second example than the first. As an example, on Windows, SAPI does not. This surprised me, as Windows speech must deal with the poor state of Windows audio hardware as compared to the Mac. When outputting PCM data through the Windows multimedia APIs, most built-in audio hardware cannot achieve output latency below 20-30MS or so, and, yet, their TTS voices respond quickly. Core Audio on the Mac provides extremely low output latency for even built-in sound hardware , and I expected extremely snappy speech, given this capability.
>
> My conclusion is that something is wrong in the way that the Speech Synthesis Manager prepares to begin speaking new text, and the way that it silences existing speech, that causes it to take significantly longer time for both than would be reasonably expected.
>
> Why does this matter?
>
> Because of this limitation, every Mac application that uses text-to-speech feels more sluggish than it should. I’ve often wondered why the “say” command in the terminal, and the Applescript “say” commands seemed to feel jerky when interrupting speech, and now the cause is clear. This also explains why VoiceOver never feels anywhere near as responsive as a Windows or Linux screen reader when arrowing through documents character-by-character, arrowing through lists, etc. Further, the long and variable lag when interrupting speech explains why, in VoiceOver, attempting to do something like quickly tap CTRL+Option+RightArrow to skim through a document has a jerky feel to it. VoiceOver isn’t slow, but is, instead, dependent on the Speech Manager for speech output and sync, and the Speech Manager is slow to start speaking, and jerky when interrupted.
>
> However, is this a bug, or a design limitation? Is the Speech Manager just incapable of not being laggy because of some other need that it has, or of some way that it outputs audio? If this is a bug, and it could be fixed, it would positively impact a huge number of speech user's daily experiences on the Mac.
>
> Bryan
>
>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> Accessibility-dev mailing list (email@hidden)
> Help/Unsubscribe/Update your Subscription:
>
> This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Accessibility-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden