Re: Realtime AEC + VAD
Re: Realtime AEC + VAD
- Subject: Re: Realtime AEC + VAD
- From: Arshia Cont via Coreaudio-api <email@hidden>
- Date: Thu, 17 Oct 2024 10:40:10 +0200
Using VPIO, you won’t be able to access the audio before processing. It’s
internally implemented in the RemoteIO. It actually makes sense: for AEC, you
need system level input AND output and that’s what RemoteIO with VPIO enabled
is doing but without giving you what was happening before.
Last time I looked at WebRTC it was using VPIO on Apple Platforms! :) It’s
actually worth looking at their code since they went over all the possibilities
(and bugs!!!) therein.
We ended up implementing our own AEC which is geared towards Music (and not
speech). We did it by implementing two AUnits: One at the Input level (that
does the job) and another at the output level which is basically a passthrough
that provides output buffers to the other (and that’s dangerous). If I had to
re-do everything I would avoid using AUnit and directly use AVAudioEngine
callbacks!!! AUs in AVAudioEngine is a nightmare without documentation!
SpeeX can be a good starting point for you. A simple search on Speech AEC
actually reveals a bunch of more recent neural net approaches that are open
sourced with good performance on paper that are worth looking.
Note: An important part of AEC is to detect the Output-to-input latency.. even
small it highly affects results and when your user switches to
Bluetooth/AirPlaySpeakers it becomes even more important!
Arshia Cont
metronautapp.com
> On 17 Oct 2024, at 08:04, π via Coreaudio-api <email@hidden>
> wrote:
>
> Thankyou for the replies. I am glad to see that this mailing-list is still
> alive, despite the dwindling traffic this last few years.
>
> Can I not encapsulate a VPIO unit, and control the input/output audio-streams
> by implementing input/render callbacks, or making connections?
>
> I'm veering towards this approach of manual implementation: Just to use a
> (misnamed as it's I/O) HALInput unit on macOS or a RemoteIO unit on the
> mobile platforms to access the raw I/O buffers, and write my own pipeline.
>
> Would it be a good idea to use https://github.com/apple/AudioUnitSDK to wrap
> this? My hunch is to minimize the layers/complexity and NOT use this
> framework.
>
> And for the AEC/VAD, can anyone offer a perspective? Arshia? The two obvious
> candidates I see are WebRTC and SpeeX. GPT4o reckons WebRTC will be the
> most-advanced / best-performant solution, with the downside that it's a big
> project (and maybe a more complicated build process), while SpeeX is more
> light-weight and will probably do the job well enough for my purposes.
>
> And as both are open-source, I may have the option of pulling out the
> minimal-dependency files and building just those.
>
> The last question is regarding system-wide audio output. It's easy for me to
> get the audio-output-stream for MY app (it just comes in over the websocket),
> but I may wish to toggle whether I want my AEC to be cancelling out any
> output-audio generated by other processes on my mac. e.g. if I am watching a
> YouTube video, maybe I want my AI to listen to that, and maybe I want it
> subtracted. So do I have the option to listen to SYSTEM-level audio output
> (so as to feed it into my AEC impl)? It must be possible on macOS, as apps
> like soundFlower or blackHole are able to do it. But mobile, I'm not so sure.
> My memory of iPhone audio dev (~2008) is that it was impossible to access
> this. But there's now some mention of v3 audio-units being able to process
> inter-app audio.
>
> π
>
> On Wed, 16 Oct 2024 at 19:35, Arshia Cont via Coreaudio-api
> <email@hidden <mailto:email@hidden>> wrote:
>> Hi π,
>>
>> From my experience that’s not possible. VPIO is an option for the lower
>> level IO device; so is VAD. You don’t have much control over their
>> internals, routing and wirings! Also, from our experience, VPIO has
>> different behaviour on different devices. On some iPads we saw “gating”
>> instead of actually removing echo (be aware of that!). In the end for a
>> similar use-case we ended up doing our own AEC and Activity Detection.
>>
>> Cheers,
>>
>> Arshia Cont
>> metronautapp.com <http://metronautapp.com/>
>>
>>
>>
>>> On 15 Oct 2024, at 18:08, π via Coreaudio-api
>>> <email@hidden <mailto:email@hidden>>
>>> wrote:
>>>
>>> Dear Audio Engineers,
>>>
>>> I'm writing an app to interact with OpenAI's 'realtime' API (bidirectional
>>> realtime audio over websocket with AI serverside).
>>>
>>> To do this, I need to be careful that the AI-speak doesn't make its way out
>>> of the speakers, back in thru the mic, and back to their server (else it
>>> starts to talk to itself, and gets very confused).
>>>
>>> So I need AEC, which I've actually got working, using
>>> kAudioUnitSubType_VoiceProcessingIO and
>>> AudioUnitSetProperty(kAUVoiceIOProperty_BypassVoiceProcessing, setting to
>>> False).
>>>
>>> Now I also wish to detect when the speaker (me) is speaking or not
>>> speaking, which I've also managed to do via
>>> kAudioDevicePropertyVoiceActivityDetectionEnable.
>>>
>>> But getting them to play together is another matter, and I'm struggling
>>> hard here.
>>>
>>> I've rigged up a simple test
>>> (https://gist.github.com/p-i-/d262e492073d20338e8fcf9273a355b4), where a
>>> 440Hz sinewave is generated in the render-callback, and mic-input is
>>> recorded to file in the input-callback.
>>>
>>> So the AEC works delightfully, subtracting the sinewave and recording my
>>> voice.
>>> And if I turn the sine-wave amplitude down to 0, the VAD correctly triggers
>>> the speech-started and speech-stopped events.
>>>
>>> But if I turn up the sine-wave, it messes up the VAD.
>>>
>>> Presumably the VAD is working over the pre-EchoCancelled audio, which is
>>> most undesirable.
>>>
>>> How can I progress here?
>>>
>>> My thought was to create an audio pipeline, using AUGraph, but my efforts
>>> have thus far been unsuccessful, and I lack confidence that I'm even
>>> pushing in the right direction.
>>>
>>> My thought was to have an IO unit that interfaces with the hardware
>>> (mic/spkr), which plugs into an AEC unit, which plugs into a VAD unit.
>>>
>>> But I can't see how to set this up.
>>>
>>> On iOS there's a RemoteIO unit to deal with the hardware, but I can't see
>>> any such unit on macOS. It seems the VoiceProcessing unit wants to do that
>>> itself.
>>>
>>> And then I wonder: Could I make a second VoiceProcessing unit, and have
>>> vp1_aec split send its bus[1(mic)].outputScope to vp2_vad.bus[1].inputScope?
>>>
>>> Can I do this kind of work by routing audio, or do I need to get my hands
>>> dirty with input/render callbacks?
>>>
>>> It feels like I'm going hard against the grain if I am faffing with these
>>> callbacks.
>>>
>>> If there's anyone out there that would care to offer me some guidance here,
>>> I am most grateful!
>>>
>>> π
>>>
>>> PS Is it not a serious problem that VAD can't operate on post-AEC input?
>>> _______________________________________________
>>> Do not post admin requests to the list. They will be ignored.
>>> Coreaudio-api mailing list (email@hidden
>>> <mailto:email@hidden>)
>>> Help/Unsubscribe/Update your Subscription:
>>>
>>> This email sent to email@hidden
>>> <mailto:email@hidden>
>>
>> _______________________________________________
>> Do not post admin requests to the list. They will be ignored.
>> Coreaudio-api mailing list (email@hidden
>> <mailto:email@hidden>)
>> Help/Unsubscribe/Update your Subscription:
>>
>> This email sent to email@hidden <mailto:email@hidden>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> Coreaudio-api mailing list (email@hidden)
> Help/Unsubscribe/Update your Subscription:
>
> This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Coreaudio-api mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden