Re: Realtime AEC + VAD
Re: Realtime AEC + VAD
- Subject: Re: Realtime AEC + VAD
- From: π via Coreaudio-api <email@hidden>
- Date: Fri, 18 Oct 2024 15:09:52 +0100
Yikes!
Well, the purpose of my project was to investigate the possibilities of
using OpenAI's realtime API on Apple tech, and I'm indeed discovering the
gotchas.
So, IIUC:
- on macOS I can get beneath the AudioUnit level and go straight to
AudioDevice; down to the wire, so to speak. And roll my own AEC & VAD.
Alternatively I can use VoiceProcessingIO AudioUnit which gives me AEC &
VAD tho' they don't play nice together, but if I roll my own VAD (using the
WebRTC code) I'm good to go.
- on iOS I can't get at the AudioDevice, but still have the
VoiceProcessingIO technique available as above. Alternatively I could use
RemoteIO AudioUnit and roll my own AEC+VAD. But then I'm not getting system
audio-out. hum ho. liveable-withable.
- on WatchOS, we don't have AudioDevice OR VoiceProcessingIO audioUnit, but
we DO still have RemoteIO audiounit.
So if I want something that works on all 3, I kinda need to roll my own
AEC+VAD.
I'm struggling really hard to extract aec3 out of WebRTC. Whereas VAD was
pretty straight forward, aec3 seems to have a dependency on something
called abseil.
It seems AEC is far from a "Solved Problem". I see
https://www.microsoft.com/en-us/research/academic-program/acoustic-echo-cancellation-challenge-icassp-2023/
Microsoft
have recently (2023) issued a challenge inviting novel AEC solutions
(presumably the current AI boom is gona shake loose some new approaches),
but as an outsider I don't get to see the submissions. e.g. the winning
non-microsoft entry is behind a paywall
https://ieeexplore.ieee.org/document/10096411 (though maybe the same as
https://arxiv.org/pdf/2303.06828).
I wonder whether "cheating" buys much; i.e. emitting a periodic sweep/chirp
from the speakers to estimate the impulse-response of the acoustic
environment, in order to deduce an inverse-IR. Then I think the AEC is just
applying that, possibly together with some delay to compensate for I/O
latency.
Does anyone have an intuition whether it's even sensible to be considering
realtime AEC on WatchOS? Just from a performance PoV it might rinse out the
battery really fast.
π
On Fri, 18 Oct 2024 at 11:55, Tamás Zahola via Coreaudio-api <
email@hidden> wrote:
> Hold on a sec, how are you planning to use the AudioDevice VAD on watchOS?
> It is a macOS-only API. It's not available on watchOS, neither on iOS.
>
> Now, considering what Julian wrote, I think your problem might he that
> you're using the VPIO unit in conjunction with the AudioDevice VAD. Because
> if what Julian wrote is true, that AudioDevice already has echo
> cancellation when VAD is enabled, then what could be happening is that your
> output signal is subtracted *twice* from the input: first by the echo
> canceller of AudioDevice, and then by the VPIO unit. So in effect the VPIO
> unit ends up re-adding the echo with inverted phase.
>
> I would recommend trying just AudioDevice directlt, without the VPIO unit.
>
> Obviously, this is all macOS-only. On iOS (and I guess watchOS) you only
> have AudioUnits, so you must use your own VAD.
>
> Regards,
> Tamás Zahola
>
> On 2024. Oct 18., at 12:30, π via Coreaudio-api <
> email@hidden> wrote:
>
>
> Thanks for the pointer Tamás!
>
> Pulling out VAD from WebRTC worked a treat.
>
> I started with https://github.com/daanzu/py-webrtcvad-wheels and knocked
> together a hello.cpp and CMakeLists.txt (
> https://gist.github.com/p-i-/598da13d2a1a1e2a6ec978e15fa7d892)
>
> I have to say, it feels hella awkward that I cannot control the pipeline
> and use native AudioUnits for this kind of work.
>
> Surely it is a mistake on Apple's part to put VAD before AEC, if this is
> really what they're doing... it's gona trigger VAD callback on
> incoming/remote audio, rather than user-speech.
>
> For a low-power usage scenario (say WatchOS), I really want to be
> dynamically rerouting -- if there's no audio being sent thru the speaker, I
> don't want AEC eating CPU cycles, but I DO want VAD detecting user-speech
> onset. And if audio IS being sent thru the speaker, I want AEC to be
> subtracting it, and VAD to be operating on this "cleaned" mic-input. I'd
> love it if VoiceProcessingIO unit took care of all of this.
>
> I haven't yet managed to scientifically determine exactly what
> VoiceProcessingIO unit is actually doing, but if I engage its AEC and VAD
> and play a sine-wave, it disturbs the VAD callbacks, yet successfully
> subtracts the sinewave from mic-audio. So I strongly suspect they have
> these two subcomponents wired up in the wrong order.
>
> If this is indeed the case, is there any liklihood of a future fix? Do
> Apple core-audio devs listen in on this list?
>
> π
>
> On Thu, 17 Oct 2024 at 10:24, Tamás Zahola via Coreaudio-api <
> email@hidden> wrote:
>
>> You can extract the VAD algorithm from WebRTC by starting at this file:
>> https://chromium.googlesource.com/external/webrtc/stable/src/+/master/common_audio/vad/vad_core.h
>>
>> You'll also need some stuff from the common_audio/signal_processing
>> folder, but otherwise it's self-contained.
>>
>> It's easy for me to get the audio-output-stream for MY app (it just comes
>> in over the websocket), but I may wish to toggle whether I want my AEC to
>> be cancelling out any output-audio generated by other processes on my mac.
>>
>>
>> From macOS Ventura onwards it is possible to capture system audio with
>> the ScreenCaptureKit framework, although your app will need extra privacy
>> permissions.
>>
>> It must be possible on macOS, as apps like soundFlower or blackHole are
>> able to do it.
>>
>>
>> BlackHole and SoundFlower are using an older technique, where they
>> install a virtual loopback audio device on the system (you can see it
>> listed in Audio MIDI Settings as e.g. "BlackHole 2 ch"), and change the
>> system's default output device to that, then capture from the input port of
>> this loopback device. But this requires installing the virtual device
>> in /Library/Audio/Plug-Ins/HAL, which requires admin privileges.
>>
>> But mobile, I'm not so sure. My memory of iPhone audio dev (~2008) is
>> that it was impossible to access this. But there's now some mention of v3
>> audio-units being able to process inter-app audio.
>>
>>
>> On iOS you must use the voice-processing I/O unit. Normal apps cannot
>> capture the system audio output. Technically there is a way to do it with
>> the ReplayKit framework, but it's a pain in the ass to use, and the primary
>> purpose of that framework is capturing screen content, not audio. If you
>> try e.g. Facebook Messenger on iOS, and initiate screen-sharing in a video
>> call, that's going to use ReplayKit.
>>
>> Regards,
>> Tamás Zahola
>>
>> On 17 Oct 2024, at 08:04, π via Coreaudio-api <
>> email@hidden> wrote:
>>
>> Thankyou for the replies. I am glad to see that this mailing-list is
>> still alive, despite the dwindling traffic this last few years.
>>
>> Can I not encapsulate a VPIO unit, and control the input/output
>> audio-streams by implementing input/render callbacks, or making connections?
>>
>> I'm veering towards this approach of manual implementation: Just to use a
>> (misnamed as it's I/O) HALInput unit on macOS or a RemoteIO unit on the
>> mobile platforms to access the raw I/O buffers, and write my own pipeline.
>>
>> Would it be a good idea to use https://github.com/apple/AudioUnitSDK to
>> wrap this? My hunch is to minimize the layers/complexity and NOT use this
>> framework.
>>
>> And for the AEC/VAD, can anyone offer a perspective? Arshia? The two
>> obvious candidates I see are WebRTC and SpeeX. GPT4o reckons WebRTC will be
>> the most-advanced / best-performant solution, with the downside that it's a
>> big project (and maybe a more complicated build process), while SpeeX is
>> more light-weight and will probably do the job well enough for my purposes.
>>
>> And as both are open-source, I may have the option of pulling out the
>> minimal-dependency files and building just those.
>>
>> The last question is regarding system-wide audio output. It's easy for me
>> to get the audio-output-stream for MY app (it just comes in over the
>> websocket), but I may wish to toggle whether I want my AEC to be cancelling
>> out any output-audio generated by other processes on my mac. e.g. if I am
>> watching a YouTube video, maybe I want my AI to listen to that, and maybe I
>> want it subtracted. So do I have the option to listen to SYSTEM-level audio
>> output (so as to feed it into my AEC impl)? It must be possible on macOS,
>> as apps like soundFlower or blackHole are able to do it. But mobile, I'm
>> not so sure. My memory of iPhone audio dev (~2008) is that it was
>> impossible to access this. But there's now some mention of v3 audio-units
>> being able to process inter-app audio.
>>
>> π
>>
>> On Wed, 16 Oct 2024 at 19:35, Arshia Cont via Coreaudio-api <
>> email@hidden> wrote:
>>
>>> Hi π,
>>>
>>> From my experience that’s not possible. VPIO is an option for the lower
>>> level IO device; so is VAD. You don’t have much control over their
>>> internals, routing and wirings! Also, from our experience, VPIO has
>>> different behaviour on different devices. On some iPads we saw “gating”
>>> instead of actually removing echo (be aware of that!). In the end for a
>>> similar use-case we ended up doing our own AEC and Activity Detection.
>>>
>>> Cheers,
>>>
>>> Arshia Cont
>>> metronautapp.com
>>>
>>>
>>>
>>> On 15 Oct 2024, at 18:08, π via Coreaudio-api <
>>> email@hidden> wrote:
>>>
>>> Dear Audio Engineers,
>>>
>>> I'm writing an app to interact with OpenAI's 'realtime' API
>>> (bidirectional realtime audio over websocket with AI serverside).
>>>
>>> To do this, I need to be careful that the AI-speak doesn't make its way
>>> out of the speakers, back in thru the mic, and back to their server (else
>>> it starts to talk to itself, and gets very confused).
>>>
>>> So I need AEC, which I've actually got working,
>>> using kAudioUnitSubType_VoiceProcessingIO
>>> and AudioUnitSetProperty(kAUVoiceIOProperty_BypassVoiceProcessing, setting
>>> to False).
>>>
>>> Now I also wish to detect when the speaker (me) is speaking or not
>>> speaking, which I've also managed to do
>>> via kAudioDevicePropertyVoiceActivityDetectionEnable.
>>>
>>> But getting them to play together is another matter, and I'm struggling
>>> hard here.
>>>
>>> I've rigged up a simple test (
>>> https://gist.github.com/p-i-/d262e492073d20338e8fcf9273a355b4), where a
>>> 440Hz sinewave is generated in the render-callback, and mic-input is
>>> recorded to file in the input-callback.
>>>
>>> So the AEC works delightfully, subtracting the sinewave and recording my
>>> voice.
>>> And if I turn the sine-wave amplitude down to 0, the VAD correctly
>>> triggers the speech-started and speech-stopped events.
>>>
>>> But if I turn up the sine-wave, it messes up the VAD.
>>>
>>> Presumably the VAD is working over the pre-EchoCancelled audio, which is
>>> most undesirable.
>>>
>>> How can I progress here?
>>>
>>> My thought was to create an audio pipeline, using AUGraph, but my
>>> efforts have thus far been unsuccessful, and I lack confidence that I'm
>>> even pushing in the right direction.
>>>
>>> My thought was to have an IO unit that interfaces with the hardware
>>> (mic/spkr), which plugs into an AEC unit, which plugs into a VAD unit.
>>>
>>> But I can't see how to set this up.
>>>
>>> On iOS there's a RemoteIO unit to deal with the hardware, but I can't
>>> see any such unit on macOS. It seems the VoiceProcessing unit wants to do
>>> that itself.
>>>
>>> And then I wonder: Could I make a second VoiceProcessing unit, and have
>>> vp1_aec split send its bus[1(mic)].outputScope to vp2_vad.bus[1].inputScope?
>>>
>>> Can I do this kind of work by routing audio, or do I need to get my
>>> hands dirty with input/render callbacks?
>>>
>>> It feels like I'm going hard against the grain if I am faffing with
>>> these callbacks.
>>>
>>> If there's anyone out there that would care to offer me some guidance
>>> here, I am most grateful!
>>>
>>> π
>>>
>>> PS Is it not a serious problem that VAD can't operate on post-AEC input?
>>> _______________________________________________
>>> Do not post admin requests to the list. They will be ignored.
>>> Coreaudio-api mailing list (email@hidden)
>>> Help/Unsubscribe/Update your Subscription:
>>>
>>>
>>> This email sent to email@hidden
>>>
>>>
>>> _______________________________________________
>>> Do not post admin requests to the list. They will be ignored.
>>> Coreaudio-api mailing list (email@hidden)
>>> Help/Unsubscribe/Update your Subscription:
>>>
>>>
>>> This email sent to email@hidden
>>>
>> _______________________________________________
>> Do not post admin requests to the list. They will be ignored.
>> Coreaudio-api mailing list (email@hidden)
>> Help/Unsubscribe/Update your Subscription:
>>
>> This email sent to email@hidden
>>
>>
>> _______________________________________________
>> Do not post admin requests to the list. They will be ignored.
>> Coreaudio-api mailing list (email@hidden)
>> Help/Unsubscribe/Update your Subscription:
>>
>>
>> This email sent to email@hidden
>>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> Coreaudio-api mailing list (email@hidden)
> Help/Unsubscribe/Update your Subscription:
>
> This email sent to email@hidden
>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> Coreaudio-api mailing list (email@hidden)
> Help/Unsubscribe/Update your Subscription:
>
> This email sent to email@hidden
>
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Coreaudio-api mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden