Re: Realtime AEC + VAD
Re: Realtime AEC + VAD
- Subject: Re: Realtime AEC + VAD
- From: π via Coreaudio-api <email@hidden>
- Date: Thu, 17 Oct 2024 07:04:16 +0100
Thankyou for the replies. I am glad to see that this mailing-list is still
alive, despite the dwindling traffic this last few years.
Can I not encapsulate a VPIO unit, and control the input/output
audio-streams by implementing input/render callbacks, or making connections?
I'm veering towards this approach of manual implementation: Just to use a
(misnamed as it's I/O) HALInput unit on macOS or a RemoteIO unit on the
mobile platforms to access the raw I/O buffers, and write my own pipeline.
Would it be a good idea to use https://github.com/apple/AudioUnitSDK to
wrap this? My hunch is to minimize the layers/complexity and NOT use this
framework.
And for the AEC/VAD, can anyone offer a perspective? Arshia? The two
obvious candidates I see are WebRTC and SpeeX. GPT4o reckons WebRTC will be
the most-advanced / best-performant solution, with the downside that it's a
big project (and maybe a more complicated build process), while SpeeX is
more light-weight and will probably do the job well enough for my purposes.
And as both are open-source, I may have the option of pulling out the
minimal-dependency files and building just those.
The last question is regarding system-wide audio output. It's easy for me
to get the audio-output-stream for MY app (it just comes in over the
websocket), but I may wish to toggle whether I want my AEC to be cancelling
out any output-audio generated by other processes on my mac. e.g. if I am
watching a YouTube video, maybe I want my AI to listen to that, and maybe I
want it subtracted. So do I have the option to listen to SYSTEM-level audio
output (so as to feed it into my AEC impl)? It must be possible on macOS,
as apps like soundFlower or blackHole are able to do it. But mobile, I'm
not so sure. My memory of iPhone audio dev (~2008) is that it was
impossible to access this. But there's now some mention of v3 audio-units
being able to process inter-app audio.
π
On Wed, 16 Oct 2024 at 19:35, Arshia Cont via Coreaudio-api <
email@hidden> wrote:
> Hi π,
>
> From my experience that’s not possible. VPIO is an option for the lower
> level IO device; so is VAD. You don’t have much control over their
> internals, routing and wirings! Also, from our experience, VPIO has
> different behaviour on different devices. On some iPads we saw “gating”
> instead of actually removing echo (be aware of that!). In the end for a
> similar use-case we ended up doing our own AEC and Activity Detection.
>
> Cheers,
>
> Arshia Cont
> metronautapp.com
>
>
>
> On 15 Oct 2024, at 18:08, π via Coreaudio-api <
> email@hidden> wrote:
>
> Dear Audio Engineers,
>
> I'm writing an app to interact with OpenAI's 'realtime' API (bidirectional
> realtime audio over websocket with AI serverside).
>
> To do this, I need to be careful that the AI-speak doesn't make its way
> out of the speakers, back in thru the mic, and back to their server (else
> it starts to talk to itself, and gets very confused).
>
> So I need AEC, which I've actually got working,
> using kAudioUnitSubType_VoiceProcessingIO
> and AudioUnitSetProperty(kAUVoiceIOProperty_BypassVoiceProcessing, setting
> to False).
>
> Now I also wish to detect when the speaker (me) is speaking or not
> speaking, which I've also managed to do
> via kAudioDevicePropertyVoiceActivityDetectionEnable.
>
> But getting them to play together is another matter, and I'm struggling
> hard here.
>
> I've rigged up a simple test (
> https://gist.github.com/p-i-/d262e492073d20338e8fcf9273a355b4), where a
> 440Hz sinewave is generated in the render-callback, and mic-input is
> recorded to file in the input-callback.
>
> So the AEC works delightfully, subtracting the sinewave and recording my
> voice.
> And if I turn the sine-wave amplitude down to 0, the VAD correctly
> triggers the speech-started and speech-stopped events.
>
> But if I turn up the sine-wave, it messes up the VAD.
>
> Presumably the VAD is working over the pre-EchoCancelled audio, which is
> most undesirable.
>
> How can I progress here?
>
> My thought was to create an audio pipeline, using AUGraph, but my efforts
> have thus far been unsuccessful, and I lack confidence that I'm even
> pushing in the right direction.
>
> My thought was to have an IO unit that interfaces with the hardware
> (mic/spkr), which plugs into an AEC unit, which plugs into a VAD unit.
>
> But I can't see how to set this up.
>
> On iOS there's a RemoteIO unit to deal with the hardware, but I can't see
> any such unit on macOS. It seems the VoiceProcessing unit wants to do that
> itself.
>
> And then I wonder: Could I make a second VoiceProcessing unit, and have
> vp1_aec split send its bus[1(mic)].outputScope to vp2_vad.bus[1].inputScope?
>
> Can I do this kind of work by routing audio, or do I need to get my hands
> dirty with input/render callbacks?
>
> It feels like I'm going hard against the grain if I am faffing with these
> callbacks.
>
> If there's anyone out there that would care to offer me some guidance
> here, I am most grateful!
>
> π
>
> PS Is it not a serious problem that VAD can't operate on post-AEC input?
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> Coreaudio-api mailing list (email@hidden)
> Help/Unsubscribe/Update your Subscription:
>
>
> This email sent to email@hidden
>
>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> Coreaudio-api mailing list (email@hidden)
> Help/Unsubscribe/Update your Subscription:
>
> This email sent to email@hidden
>
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Coreaudio-api mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden