Here at Speechmatics, audio is the lifeblood of everything we do, from training our models right through to crafting effective demos of our technology. One of the best examples of this is our Portal translation demo, which allows the user to see their speech translated into a number of languages in realtime. However, accessing media devices through the browser isn't straightforward. Browsers require the user to explicitly permit access to the media device, and to make things even more complicated, each browser engine has its own quirks that have to be handled. In this article, I'll walk through how we were able to provide a consistent and straightforward microphone access experience for our demos across all the major browsers and devices.
What Are We Going To Build?โ
In order to demonstrate how to use the browser MediaDevices API, we're going to build a simple Next.js demo app. In this web app, I'll show how we can take microphone inputs, but also how to make sure that a good user experience is built around asking for and granting microphone permissions.
I've chosen Next.js because it's the main recommended runtime for React, it's fairly easy to get up and running, and it's also the front end framework of choice for our own internal sites. If you're using another front-end development framework, such as Svelte or Angular, don't worry - many of the concepts covered in this article still apply.
Once we're done, our sample web app will be able to do the following:
- Start recording the mic and pipe the audio input into the Speechmatics realtime client
- Prompt the user to give our web app access to the microphone
- Print out the realtime transcription
- Close the WebSocket connection and stop recording from the audio device
The complete code for this blog can be found on our GitHub. Let's start by talking about how we can gain access to the user's microphone.
What Exactly Is The MediaDevices API?โ
Browsers are wonderful things. Whenever you access a webpage, the browser downloads all the HTML and CSS and renders it onto your screen in a matter of milliseconds. It also downloads and runs any JavaScript associated with the webpage on your browser using the browser's JavaScript engine.
One of the nice things about browsers is that they know the web is a scary place. You never quite know what the website you're accessing might contain - it could even be malicious code that wants to access files on your computer. Because of this, browser engines are designed to provide a sandbox environment. They heavily restrict the kinds of things that JavaScript can do in order to secure the user's computer.
For the most part, this is great - we don't want pesky hackers taking control of our local PC. But sometimes, a website might have legitimate reasons to want to access the underlying local machine. For example, I may want to upload a file from my file system. Or, as in our case, I may want to access the computer's microphone inputs in order to allow the user to record some audio.
In order to facilitate this conditional access to the underlying computer, browsers expose a set of common APIs that require user interaction in order to be accessed.
The MediaDevices API is one such example. It provides secure access to the computer's media inputs, including audio, camera and screen devices.
We'll be using this API, and in particular the navigator.mediaDevices.getUserMedia()
function call, to get a list of the user's microphones and to access the input in the form of a MediaStream object.
One more thing. It would be lovely if everyone exposed the same API all the time, wouldn't it? Unfortunately, they don't. Different browsers operate different JavaScript engines. For example, Chrome and Edge run the Chromium engine, whereas Firefox is built on its own engine written in Rust ๐ฆ๐ฆ๐ฆ. Because of these differences, their APIs have some divergent behaviour and they don't all support the same API calls. We'll get round to how that's going to impact us in a bit.
Accessing Microphones Through The Browserโ
When accessing microphones through the browser, there are a number of key behaviours we want to cover:
- Prompt the user for permission to access the microphone if we don't have it. This should only happen on a user action - either trying to select a microphone or opening the stream with the default microphone.
- Automatically enumerate all the audio devices if permission is already given
- Allow the user to select which microphone they want
- Open the stream for a given device, with a fallback to the default input device
- Automatically update the device list when the connected devices change
- Make it clear to the user when they have denied audio permissions
Let's start with the code for getting user permissions.
Getting User Permissionsโ
In order to prompt the user for permissions, we use the MediaDevices API that we discussed before. At its most basic, we can call that API with the following code:
await navigator.mediaDevices
.getUserMedia({ audio: true, video: false })
.then((stream) => {
return true;
})
// If there is an error, we can't get access to the mic
.catch((err) => {
throw new Error('Unexpected error getting microphone access');
});
We call getUserMedia
which tells the browser to ask the user for permission to access the microphone.
We return true
if our call to the function is successful.
In the case that the user rejects permission, or something unexpected goes wrong, we throw an error.
This is very basic, but it's not perfect - we can do better! In fact, many modern browsers have a direct Permissions API that can be used to check the permissions the user has given the current web page. Let's make a function to wrap the Permissions API so we can ask whether the user has given permissions already. You'll see why this is useful in a minute.
// getPermissions is used to access the permissions API
// This API is not fully supported in all browsers so we first check the availability of the API
async function getPermissions() {
if (!!navigator?.permissions) {
return (
navigator.permissions
// @ts-ignore - ignore because microphone is not in the enum of name for all browsers
?.query({ name: 'microphone' })
.then((result) => result.state)
.catch((err) => {
return 'prompt';
})
);
}
return 'prompt';
}
This API can return three possible results: granted
, prompt
or denied
.
If permissions are granted
, we don't need to ask for them again, which can save us an extra code call.
prompt
is the default state, where the browser will ask the user for permission.
However, the most useful state here is denied
.
If the result is denied
, that means the user has explicitly blocked this web page from accessing the microphone.
It's useful to know this because we can then show a message telling the user they won't be able to use the web page without altering the permissions in their browser settings page.
I won't go into that here, but it's worth keeping in mind.
Finally, note that this permissions API does not have universal support, so we check if it is defined before calling it.
Now we can ask for and query user permissions. This is useful code, but we also need to discuss how and when to run it. There are two situations we might want to check - when the user selects their microphone input, and when the user starts the transcription session. Let's start with the former.
Enumerating Devicesโ
In the previous section, we laid out the code we need to get permission to access audio input devices. Now it's time to use that code to enumerate audio devices for the user and allow them to select their desired device.
The browser provides us with an API to enumerate devices, so let's use it:
// now we have permissions, we attempt to get the audio devices
return navigator.mediaDevices
.enumerateDevices()
.then(async (devices: MediaDeviceInfo[]) => {
return devices.filter((device: MediaDeviceInfo) => {
return device.kind == 'audioinput';
});
});
The above code is great and should work on most browsers. But it's a bit basic. It doesn't cover all the user experience bases that we discussed earlier. We can get permissions and access media devices, but we're not letting the user know they're blocked, and we're not automatically updating the list when the devices change. Let's start by refactoring our code a bit and moving into a class so we have nice reusable chunks:
// this is a class so that we can use EventTarget, but a singleton, as in the browser the active devices are external to React and can be managed app-wide.
class AudioDevices extends EventTarget {
private busy = false;
private _denied = false;
private _devices: MediaDeviceInfo[] = [];
get denied() {
return this._denied;
}
set denied(denied) {
if (denied !== this._denied) {
this._denied = denied;
this.dispatchEvent(new Event('changeDenied'));
}
}
get devices() {
return this._devices;
}
set devices(devices) {
if (devices !== this._devices) {
this._devices = devices;
this.dispatchEvent(new Event('changeDevices'));
}
}
constructor() {
super();
if (typeof window !== 'undefined') {
this.updateDeviceList();
// We don't need to unsubscribe as this class is a singleton
navigator.mediaDevices.addEventListener('devicechange', () => {
this.updateDeviceList();
});
}
}
// A wrapped getUserMedia that manages denied and device state
public getUserMedia = async (constraints: MediaStreamConstraints) => {
let stream: MediaStream;
try {
stream = await navigator.mediaDevices.getUserMedia(constraints);
this.denied = false;
} catch (ex) {
this.denied = true;
}
this.updateDeviceList();
return stream;
};
// getDevices is used to prompt the user to give permission to audio inputs
public getDevices = async () => {
// We first check if the system is busy - we don't want to prompt for permissions if the user is already prompted for permissions
if (!this.busy) {
this.busy = true;
await this.promptAudioInputs();
this.busy = false;
} else {
console.warn('getDevices already in progress');
}
};
// updateDeviceList is used to handle device enumeration once permissions have been given
private updateDeviceList = async () => {
const devices: MediaDeviceInfo[] =
await navigator.mediaDevices.enumerateDevices();
const filtered = devices.filter((device: MediaDeviceInfo) => {
return (
device.kind === 'audioinput' &&
device.deviceId !== '' &&
device.label !== ''
);
});
this.devices = filtered;
};
private promptAudioInputs = async () => {
const permissions = await getPermissions();
if (permissions === 'denied') {
this.denied = true;
return;
}
// If permissions are prompt, we need to call getUserMedia to ask the user for permission
if (permissions === 'prompt') {
await this.getUserMedia({
audio: true,
video: false,
});
} else {
this.updateDeviceList();
}
};
}
const audioDevices = new AudioDevices();
// getPermissions is used to access the permissions API
// This API is not fully supported in all browsers so we first check the availability of the API
async function getPermissions() {
if (navigator?.permissions) {
return (
navigator.permissions
// @ts-ignore - ignore because microphone is not in the enum of name for all browsers
?.query({ name: 'microphone' })
.then((result) => result.state)
.catch((err) => {
return 'prompt';
})
);
}
return 'prompt';
}
There's not too much we've added here, we've just spruced up our previous code by using a class and adding some event handling. By moving it into a class, we're able to keep track of important state information, including whether permissions are blocked and also whether the permissions flow is busy. This is useful as we don't want to re-prompt for permissions if they're already mid-flow.
You'll notice we've added an event listener to the mediaDevices API, called devicechange
.
This is a useful event that is triggered whenever the list of media devices on the underlying hardware changes.
That allows us to keep track of changes in the underlying device list and update our own state on the UI without the need for user input.
You'll also notice we added a few event dispatchers - changeDevices
and changeDenied
. What are these for?
Well, this is React so we need a way for the UI to react! These event dispatchers are going to be useful when implementing hooks.
Hook, Line and Syncerโ
It's tempting when dealing with async data fetching of this sort to reach for the well-known useEffect
hook.
I certainly did, until one of my colleagues clued me into a better option.
You see, useEffect
can be an error-prone choice in this case.
In order for it to trigger, you need one of its dependencies to change and in practice, this often means that it is usually retriggered by user actions.
The situation that we're actually faced with is a need to sync to an external but local store of information that is only loosely coupled to user activity.
Step forward, useSyncExternalStore
.
The useSyncExternalStore hook is the React team's recommended way to handle exactly this kind of scenario.
But, amazingly, it's not very widely known even amongst the React dev community. That includes yours truly.
useSyncExternalStore works by accepting two required arguments. The first subscribes to a store and returns a callback that unsubscribes. The second fetches a snapshot of the underlying store's state whenever it changes.
In our case, the custom class we have created is our store, and there are two different aspects of this store we want to subscribe to. First, the devices list, and second, whether we are blocked. Finally, we also want a hook that will allow the user to manually prompt a fetch of the underlying devices, with a possible permissions check. So let's add this code in:
// Here we subscribe to the device state browser event
// When devices change, the getDevices callback is invoked
function subscribeDevices(callback) {
audioDevices.addEventListener('changeDevices', callback);
return () => {
audioDevices.removeEventListener('changeDevices', callback);
};
}
const getDevices = () => audioDevices.devices;
export function useAudioDevices() {
return useSyncExternalStore(subscribeDevices, getDevices, getDevices);
}
// Here we subscribe to the user's provided permissions
// When the permission state changes, the useAudioDevices hook is called
function subscribeDenied(callback) {
audioDevices.addEventListener('changeDenied', callback);
return () => {
audioDevices.removeEventListener('changeDenied', callback);
};
}
const getDenied = () => audioDevices.denied;
export function useAudioDenied() {
return useSyncExternalStore(subscribeDenied, getDenied, getDenied);
}
export function useRequestDevices() {
return useCallback(() => audioDevices.getDevices(), []);
}
We create three hooks, two of which are subscribed to the underlying store and the final of which just returns a callback that can manually prompt for devices. Let's have a look at how this can be used on the front end:
export default function OuterComponent() {
...
// Get devices using our custom hook
const devices = useAudioDevices();
const denied = useAudioDenied();
const requestDevices = useRequestDevices();
// useEffect listens for changes in devices
// It sets a default deviceId if no valid deviceId is already set
useEffect(() => {
if (
devices.length &&
!devices.some((item) => item.deviceId === audioDeviceId)
)
setAudioDeviceId(devices[0].deviceId);
if (denied) setSessionState('blocked');
}, [devices, denied]);
...
return (
<div>
<MicSelect
disabled={!['configure', 'blocked'].includes(sessionState)}
onClick={requestDevices}
value={audioDeviceId}
options={devices.map((item) => {
return { value: item.deviceId, label: item.label };
})}
onChange={(e) => {
if (sessionState === 'configure') {
setAudioDeviceId(e.target.value);
} else if (sessionState === 'blocked') {
setSessionState('configure');
setAudioDeviceId(e.target.value);
} else {
console.warn('Unexpected mic change during state:', sessionState);
}
}}
/>
</div>
)
}
Our outer component calls our custom hooks. We then have a simple select component that allows us to select a microphone. The code for that can be found in the GitHub repo. The placeholder has a default value of "Default Audio Input". This is useful as it allows us to let the user start transcribing without selecting a device explicitly.
Okay, so we now have user permissions, we have a list of audio devices, and we're able to set the device ID. Next, we just need a way to capture the audio data and send it through the WebSocket. Before we get to that, let's briefly talk a bit more about browser APIs, and specifical a few APIs that are going to help us capture audio input.
Script Processor, Audio Worklets and Media Recorderโ
We've now got access to the user's microphone - great! But we can't stop there. The stream represents our data input, and in order to send it through our WebSocket we need some way to get the streamed data as an output we can pass to another function call.
Reading data from a stream and passing its output to another function call is not a simple task. The stream of data we are processing is continuously outputting new data for us to read. If we were to naively handle this audio synchronously then the other processes in our web app would completely stop because the thread would be blocked. That's why modern browsers provide us with ways to handle this audio processing so that our main script thread won't be blocked.
If you've worked with modern browser audio APIs before, you may have heard of a few things: ScriptProcessorNode, AudioWorklet and MediaRecorder. I was a bit confused by what each of these were and their use cases, so I thought I'd share what I've learned here for the sake of clarification:
ScriptProcessorNodeโ
ScriptProcessorNode is a way for web developers to handle audio processing. It is effectively an all-in-one audio processing solution.
Instantiating a ScriptProcessorNode creates an input buffer and an output buffer.
Whenever the input buffer reaches capacity, the audioprocess
event is triggered, and a user-supplied audio processing callback is run.
Each event has an input buffer that can be read from and an output buffer that can be written to. Let's look at a quick example:
const myScript = document.querySelector("script");
const myPre = document.querySelector("pre");
const playButton = document.querySelector("button");
// Create AudioContext and buffer source
const audioCtx = new AudioContext();
let source = null;
const outputAudio = [];
// Create a ScriptProcessorNode with a bufferSize of 4096 and a single input and output channel
const scriptNode = audioCtx.createScriptProcessor(4096, 1, 1);
// Get the media stream source from our mic
navigator.mediaDevices.getUserMedia({ audio }).then(async (stream) => {
source = this.audioContext.createMediaStreamSource(stream);
});
// Give the node a function to process audio events
scriptNode.onaudioprocess = (audioProcessingEvent) => {
const inputData = audioProcessingEvent.inputBuffer.getChannelData(0);
outputAudio = [...outputAudio, ...inputData];
};
// Wire up the play button
playButton.onclick = () => {
if (source) {
source.connect(scriptNode);
scriptNode.connect(audioCtx.destination);
source.start();
}
};
// When the buffer source stops playing, disconnect everything
source.onended = () => {
source.disconnect(scriptNode);
scriptNode.disconnect(audioCtx.destination);
};
You can see from this sample that getting access to data is relatively straightforward. And the ScriptProcessorNode API has been around for a long time, so it has excellent cross-browser support.
You might be wondering, why not stop there? It works on every browser, it does what we need and it's been around for ages. Unfortunately, there are a few very important limitations. The first is that this audio processing happens asynchronously. In order to process the incoming audio I need to wait for my buffer to fill up, which means that I will always have at least a fixed amount of latency equivalent to the size of the buffer.
The second problem relates to what I mentioned earlier about blocking the main code execution. ScriptProcessorNode executes on the same thread as the rest of the UI. This means if I'm doing heavy work rendering the UI and running JavaScript logic, whilst also doing heavy work processing audio, I'll end up introducing lag in one or more of these areas. This could result in glitchy graphics or audio instabilities such as stutter. Indeed, maybe you've experienced such online audio stutter yourself.
In order to address these issues, a new API was developed based on Web Workers.
AudioWorkletโ
The AudioWorklet system was developed as a solution to the problems of ScriptProcessorNode. It is part of the most recent generation of Web Audio APIs. It allows developers to define an extension of the AudioWorklet class in a new file. This new class is then used to process audio within the audio processing thread rather than the main UI execution thread. This both avoids the need to wait for a full buffer to process async, and makes sure execution doesn't lead to blocking of the main thread and glitchy behaviours.
As an aside, the general AudioWorklet framework also includes a whole range of useful new systems, like params that can be updated on the fly, oscillators, delays and other interesting audio processing tools. Let's take a quick look at the use of AudioWorklet. Note that this example is taken straight from the Mozilla docs.
First, the new processor needs to be defined in a separate script file. Here we create a RandomNoiseProcessor which generates random noise.
// random-noise-processor.js
class RandomNoiseProcessor extends AudioWorkletProcessor {
process(inputs, outputs, parameters) {
const output = outputs[0];
output.forEach((channel) => {
for (let i = 0; i < channel.length; i++) {
channel[i] = Math.random() * 2 - 1;
}
});
return true;
}
}
registerProcessor("random-noise-processor", RandomNoiseProcessor);
Then we load this processor script into our main script and send its output to the speaker, which is defined by the audioContext.destination.
const audioContext = new AudioContext();
await audioContext.audioWorklet.addModule("random-noise-processor.js");
const randomNoiseNode = new AudioWorkletNode(
audioContext,
"random-noise-processor",
);
randomNoiseNode.connect(audioContext.destination);
You can read more about AudioWorklets, how to work with them, and the problems they address here.
AudioWorklets have good cross-browser support, but they do have a few limitations. First, they can only be used in secure contexts - that is, HTTPS websites - through http://localhost is regarded as secure in order to support development work.
Second, AudioWorklets do not on their own provide a solution for capturing the audio output as an array. This means that they can't be used to capture the audio and save it in a file or send it via an internet connection. In order to do that, we need the MediaRecorder API.
MediaRecorderโ
MediaRecorder, part of the MediaStream Recording API is used to get access to the audio within a stream in the form of a Blob. This Blob can then be used to store the stream data to disk, or to transmit the media data via TCP/UDP. In other words, if AudioWorklet replaces the data manipulation part of ScriptProcessorNode, then MediaRecorder replaces its capturing functionality.
Let's look at a quick example of how we might use MediaRecorder.
// Assuming we have some kind of media stream available, with options specifying it's format
const mediaRecorder = new MediaRecorder(stream, options);
const recodedChunks = [];
mediaRecorder.ondataavailable = handleDataAvailable;
mediaRecorder.start();
function handleDataAvailable(event) {
console.log("data-available");
if (event.data.size > 0) {
recordedChunks.push(event.data);
console.log(recordedChunks);
} else {
// โฆ
}
}
The MediaRecorder provides us with a simple callback to get access to the media chunks, which we can then process any way we like - in this case, we're just adding them to an array. Unfortunately, there's a catch. There's always a catch.
Although it seems that between AudioWorklet and MediaRecorder we have all the functionality of ScriptProcessorNode in a more performant package, the MediaRecorder element has not yet been fully adopted into modern browser standards. In particular, Opera, Firefox and some other browsers for Android may not support it. And like AudioWorklet, it's only available in secure contexts. You can keep up-to-date on support for MediaRecorder at caniuse.com.
You may now be asking yourself, which one should I use?
Which One Should I Use?โ
Well, it depends. AudioWorklet and MediaRecorder offer excellent performance compared to ScriptProcessorNode. And in theory, ScriptProcessorNode is deprecated whereas the others are not, so you can expect AudioWorklet and MediaRecorder to develop and improve over time.
However, if you're developing an app that you know will be used on the built-in Android Browser, then MediaRecorder in particular may simply be a no-go for you. Since many people would expect at least some of their users to be on Android, this has meant that in practice, many, many people still use the ScriptProcessorNode approach. It's very stable, has excellent cross-browser support, and does not require a secure context.
As long as the performance demands of your web app are low, then it may well make sense to use this approach. It's also worth noting that if your performance demands are very high, then you may well want to explore using WASM to handle audio processing. Although it still ultimately has to use the browser APIs, the fact that audio data processing can be handled in a lower-level language opens up the possibility of much greater performance optimisation.
We at Speechmatics chose the ScriptProcessorNode approach because of the pitfalls listed above. However, I thought for the heck of it, in this article, I'd use the MediaRecorder API, partly because it's less commonly used and therefore harder to find examples of in the wild.
Transcribing The Microphone Input Using The Speechmatics SDKโ
Now that we've got a better understanding of the various audio capture APIs available, let's actually input some audio to the Speechmatics WebSocket.
To do this, I'm going to create a new class. A class is useful here because there are some values I need to keep in order to call cleanup methods on them later. Let's have a look at the initialisation logic that we want:
// AudioRecorder is a class that wraps the MediaRecorder and device stream items
// It also provides methods for starting and stopping recording
export class AudioRecorder {
stream: MediaStream;
recorder: MediaRecorder;
audioContext: AudioContext;
mediaStreamSource: MediaStreamAudioSourceNode;
dataHandlerCallback?: (data: Float32Array | Buffer) => void;
// The data handler callback is called when audio data is available
// It is used to send data to the websocket
constructor(dataHandlerCallback: (data: Float32Array | Buffer) => void) {
this.dataHandlerCallback = dataHandlerCallback;
}
...
We create a class with a data handler callback passed into the constructor. This callback will be responsible for sending audio data to the WebSocket. The remaining values will just hold a few properties for the duration of the session to perform cleanup at the end.
Now let's look at what we need to do to start the recording:
export class AudioRecorder {
...
async startRecording(deviceId: string) {
const AudioContext = globalThis.window?.AudioContext;
this.audioContext = new AudioContext({ sampleRate: SAMPLE_RATE_48K });
// We first check mic permissions in case they are explicitly denied
if ((await getPermissions()) === "denied") {
throw new Error("Microphone permission denied.");
}
// Here we set the sample rate and the deviceId that the user has selected
// If deviceId isn't set, then we get a default device (which is expected behaviour)
let audio: MediaTrackConstraintSet = {
sampleRate: SAMPLE_RATE_48K,
deviceId,
};
// Now we open the stream
let stream = await navigator.mediaDevices.getUserMedia({ audio });
// Store the stream so we can close it on session end
this.stream = stream;
// Instantiate the MediaRecorder instance
this.recorder = new MediaRecorder(stream);
// This is the event listening function that gets called when data is available
this.recorder.ondataavailable = (ev: BlobEvent) => {
ev.data
.arrayBuffer()
.then((data) => this.dataHandlerCallback?.(Buffer.from(data)));
};
// Start recording from the device
// The number passed in indicates how frequently in milliseconds ondataavailable will be called
this.recorder.start(500);
// return the sample rate
return { sampleRate: this.audioContext.sampleRate };
}
...
What we're doing here consists of 3 small steps. We first open the stream using the getUserMedia call. Coincidentally, if the user hasn't already given microphone permissions, this will request them for us and allow us to use a default device.
We then pass the stream into a MediaRecorder constructor to create a MediaRecorder instance, and assign our create stream and recorder to the class properties to persist them.
From this instance, we then create the ondataavailable
callback function where the BlobEvent is converted into a simple audio buffer to make it compatible with the server API.
Finally, we call the recorder.start()
method with a value of 500
.
This value sets the interval in milliseconds at which the ondataavailable
callback is invoked with the current audio buffer.
If we were to call this function, then we'd be recording audio data from the user's microphone! The only remaining question is what we do when we're finished. Let's define a few simple methods to handle cleanup.
export class AudioRecorder {
...
// stopRecording is called when the session ends
// It shuts down the stream and recorder and sets all properties to null
async stopRecording() {
this.mediaStreamSource?.disconnect();
this.recorder?.stop();
this.stopStream();
this.resetRecordingProperties();
}
// stopStream stops all tracks in the stream
private stopStream() {
this.stream?.getTracks().forEach((track) => track.stop()); //stop each one
}
// resetRecordingProperties makes sure we have a clean slate for the next session startup
private resetRecordingProperties() {
this.stream = null;
this.mediaStreamSource = null;
}
...
To stop recording, we disconnect the media stream source, stop the recorder and shut down the stream. We also set the stream and stream source to null, ready to be reset the next time around.
Great - we now have a working audio recorder! The final step is to memoise this class in the main component of our app and to start recording based on the user's input. In the simplest terms, it might look something like this:
export default function Main({ jwt }: MainProps) {
const [transcription, setTranscription] = useState<
RealtimeRecognitionResult[]
>([]);
const [audioDeviceId, setAudioDeviceId] = useState<string>("");
const [sessionState, setSessionState] = useState<SessionState>("configure");
const rtSessionRef = useRef(new RealtimeSession(jwt));
// Get devices using our custom hook
const devices = useAudioDevices();
const denied = useAudioDenied();
const requestDevices = useRequestDevices();
// useEffect listens for changes in devices
// It sets a default deviceId if no valid deviceId is already set
useEffect(() => {
if (
devices.length &&
!devices.some((item) => item.deviceId === audioDeviceId)
)
setAudioDeviceId(devices[0].deviceId);
if (denied) setSessionState('blocked');
}, [devices, denied]);
// sendAudio is used as a wrapper for the WebSocket to check the socket is finished init-ing before sending data
const sendAudio = (data: Float32Array | Buffer) => {
if (
rtSessionRef.current.rtSocketHandler &&
rtSessionRef.current.isConnected
) {
rtSessionRef.current.sendAudio(data);
}
};
// Memoise AudioRecorder so it doesn't get recreated on re-render
const audioRecorder = useMemo(() => new AudioRecorder(sendAudio), []);
// Attach our event listeners to the realtime session
rtSessionRef.current.addListener("AddTranscript", (res) => {
setTranscription([...transcription, ...res.results]);
});
// start audio recording once the web socket is connected
rtSessionRef.current.addListener("RecognitionStarted", async () => {
setSessionState("running");
});
rtSessionRef.current.addListener("EndOfTranscript", async () => {
setSessionState("configure");
await audioRecorder.stopRecording();
});
rtSessionRef.current.addListener("Error", async () => {
setSessionState("error");
await audioRecorder.stopRecording();
});
// Call the start method on click to start the WebSocket
const startTranscription = async () => {
setSessionState("starting");
await audioRecorder.startRecording(audioDeviceId)
.then(async () => {
setTranscription([]);
await rtSessionRef.current.start({
transcription_config: { max_delay: 2, language: "en" },
audio_format: {
type: "file",
},
});
}).catch(err => setSessionState("blocked"))
};
// Stop the transcription on click to end the recording
const stopTranscription = async () => {
await audioRecorder.stopRecording();
await rtSessionRef.current.stop();
};
return (
return (
<div>
<div className='flex-row'>
<p>Select Microphone</p>
{sessionState === 'blocked' && (
<p className='warning-text'>Microphone permission is blocked</p>
)}
</div>
<MicSelect
disabled={!['configure', 'blocked'].includes(sessionState)}
onClick={requestDevices}
value={audioDeviceId}
options={devices.map((item) => {
return { value: item.deviceId, label: item.label };
})}
onChange={(e) => {
if (sessionState === 'configure') {
setAudioDeviceId(e.target.value);
} else if (sessionState === 'blocked') {
setSessionState('configure');
setAudioDeviceId(e.target.value);
} else {
console.warn('Unexpected mic change during state:', sessionState);
}
}}
/>
<TranscriptionButton
sessionState={sessionState}
stopTranscription={stopTranscription}
startTranscription={startTranscription}
/>
{sessionState === 'error' && (
<p className='warning-text'>Session encountered an error</p>
)}
{['starting', 'running', 'configure', 'blocked'].includes(
sessionState,
) && <p>State: {sessionState}</p>}
<p>
{transcription.map(
(item, index) =>
(index && !['.', ','].includes(item.alternatives[0].content)
? ' '
: '') + item.alternatives[0].content,
)}
</p>
</div>
);
}
...
We can then pass the stop and start functions to a button which will be able to perform these operations for the user. Congrats, that's pretty much it!
Conclusionโ
In this article, we've seen how we can use browser APIs to ask for user audio devices permission, to enumerate audio devices, and to capture audio device output and input into a realtime system of our choice - in this case the Speechmatics API!
There's a lot more to talk about here. We can go in-depth on error handling and the UX that surrounds it. I've also skipped a lot of details around the Speechmatics API, as that wasn't the point of this article.
Hopefully, you should now have a good understanding of how to work with browser audio APIs. And remember, if you want to see the complete working example, including error handling and all the custom styles and components, you can see it on our GitHub.
Finally, if you do have any questions, don't hesitate to reach out to someone on the Speechmatics team through one of Twitter/X, GitHub, Stackoverflow or LinkedIn.