Adding Sidetone to Skype
- Posted: Jun 09, 2009 at 11:44 AM
- 2,030 Views
- 1 Comment
Loading User Information from Channel 9
Something went wrong getting user information from Channel 9
Loading User Information from MSDN
Something went wrong getting user information from MSDN
Loading Visual Studio Achievements
Something went wrong getting the Visual Studio Achievements
Ever
use a headset with Skype – and were frustrated that it was too quiet? This article shows how to add the sound of your own voice to the headset and not feel exhausted from shouting to be heard.
This article will discuss how to build a software tool that makes it easier to talk into Skype with a headset.
Recently my family began experimenting with Skype, but we found that talking over Skype can be exhausting with headphones. After doing a little research, I learned that the problem was that we only heard the other party. You'd think that this is a good thing, but we have a social brain that works hard to gauge our behavior and adjust.
The answer is to feed a little bit of the microphone back into the earpiece, so that our brain knows how loud we're talking. This is called side tone in the telecommunication industry. Without it, we talk louder and louder until we're sure that we'll be heard.
I couldn't find a Skype plug-in to do this… but I had just read a Coding4Fun article, by Mark Heath, about adding audio effects to Skype. I decided I was going to write a tool to add this feedback.
First, let's take a look at how to use the application. Once it is running, there will be a microphone icon in the lower right hand corner of the screen:
Figure 1: Icon in the system notification tray
Clicking on it, will get the application control window:
Figure 2: The application's controls
Let's look at the side tone controls:
You can do a “sound check” to see if the feedback is working by clicking on the Sound Check button, and adjusting the volume. I found that the volume setting that works best in a conversation is much, much lower than what works in a sound check.
Next, let's look at the AGC (“Automatic Gain Control”) section. When making a call, the software can automatically adjust how loud you sound to the other party:
The Automatic Gain Control has three sliders:
I originally started this project by creating an effect for the Skype Voice Changer. My plan was to open a WaveStream, begin playing it on the headphones and copy the microphone stream to it.
I quickly found that this was not the way to add feedback. The underlying “WaveOut” system had a huge latency: Everything I heard in the headphones was at least a second or more behind what I was saying. This made it even harder to talk than before, and I had to abandon it.
While researching the problem I found a DirectSound code sample that I could modify into doing what I wanted. (This initial prototype didn't coordinate with Skype – it attached to the microphone and copied the sounds to the output, at a lower volume. But it was on always on.) The sound in the headphones is still slightly behind (the microphone) but is barely perceptible, and we'll muffle it a bit more to make it less distinguishable.
From here on I shall describe the major – or technically interesting – components of the program. We'll look at:
Note: I won't be describing how to connect to Skype. Mark Heath's description is very good.
The DirectSound code is in AudioLoop.cs. The module sets DirectSound to capture sound from the default recording device at 16bits / sample at either 16000 or 44100 samples per second. The capture and playback buffers are configured in “looping” mode to act as circular buffers. The capture buffer eventually overwrites samples, and we'll lose them if we don't act fast enough; if we don't update the playback buffer, it will repeat the same sound over and over.
The StartMicrophone() procedure sets up the capture and playback buffers. Then it creates a thread to do the work. The thread is at a high priority so that if the OS has a choice between (say) email or processing sound, it does the sound.
The StopMicrophone() procedure stops the worker thread and cleans up the resources.
The software processes a fixed number of samples at a time, called the “sample window.” The buffers are several times the size of sample window, so that the system can keep capturing and playing while the software is processing them. The sound processing loop is the heart of the application:
C#
while (Runnable) { for (int I = 0, offset = 0; Runnable && I < _bufferPositions; I++, offset += bufferSize) { // Wait for the sample areas to be ready notificationEvent[I].WaitOne(Timeout.Infinite, true); // Get the sound samples byte[] buffer = (byte[]) captureBuffer.Read(offset, typeof (byte), LockFlag.None, bufferSize); // Convert samples to 16bit PCM for (int L = buffer.Length, J = 10, K = 0; K < L; K += 2) PCM16Buffer[J++] = (Int16) ((UInt16) buffer[K] | (((UInt16) buffer[K + 1]) << 8)); // Play them out to the ear, if applicable if (null != playbackBuffer) { // Perform a low pass filter to "muffle" the sound Butterworth(PCM16Buffer, 10, LPSample, Coefs); // put the muffled sample into the output buffer // -- The lock flag seems to work, but others may work too playbackBuffer.Write(Idx, LPSample, LockFlag.None); Idx += buffer.Length; if (Idx >= 4*bufferSize) Idx -= 4*bufferSize; if (!playing) { playbackBuffer.Volume = _Volume; playbackBuffer.Play(0, BufferPlayFlags.Looping); playing = true; } } // Process the sound and deliver it to Skype if (null != outStream) { int L = AGC.Process(PCM16Buffer, 10, buffer); if (0 != L) outStream.BeginSend(buffer, 0, L, SocketFlags.None, SendCallback, null); // Note: could send out pink noise if L == 0 } // Move the sliding window of the previous 10 samples into the start // of the PCM16Buffer for (int K = 0, J = PCM16Buffer.Length - 10; K < 10; K++, J++) PCM16Buffer[K] = PCM16Buffer[J]; } }
Note the “for” loop at the bottom of the code. This preserves the last 10 incoming samples at the start of the buffer. This is needed to make the sound processing smooth, and will be discussed a bit later.
How big should the sample window be? This is bit of a trade off in responsiveness and design complexity.
I chose a window big enough to hold 10 milliseconds of sound. Since the ear is sensitive sound to delays of even 30 milliseconds, I cut this done so that a delay wouldn't be perceptible. (When I tried a 50 millisecond window, my voice came out the headphone sounding like an echo... and I found myself talking slower and slower.) The sample window could be made smaller, but I am sure that there is a point where the OS won't schedule the audio loop to wake-n-run more frequently. And, as the sample window gets smaller, the processing may drop in quality, because it doesn't have enough to work with.
The capture buffer is 8 times the size of the sample window. This ratio is arbitrary, but I wanted the buffer to be about an order of magnitude larger. My rationale is that if the processing falls behind, the sound – for the Skype call – won't be dropped. I feel that it is more important to preserve sound quality for the other party than to preserve the quality of feedback.
The playback buffer is four times the sample window. I wanted it small, so that if the processing fell behind, the replaying of a sound will seem to be a continuation of a current sound.
When writing the sound to the playback buffer, we have to track where in the buffer to put the samples. I tried to use GetCurrentPosition() to find where to write to next into the playback buffer; this created terrible sound. Instead, the software uses a local variable to track where to write next.
How do we keep in sync with the sound capture – how do we know when a sample buffer is ready?
The application gives a table of buffer indices and WaitHandle's to DirectSound. When the capture buffer's write index reaches one of those indices, it signals the corresponding WaitHandle. The worker thread cycles performs a WaitOne() one each of the WaitHandle's, one at a time. As a convenience, we use a specific kind of WaitHandle called AutoResetEvent. This type of WaitHandle sets itself back to a “wait” state once WaitOne() returns.
If the thread has gotten behind, the WaitOne will return immediately, the loop processes the sample, and begins to catch up with the work.
We must use a separate AutoResetEvent for each of the 8 capture windows. The AutoResetEvent doesn't tell us if it was signaled multiple times. If only one AutoResetEvent handle were used, it wouldn't know that two (or more) sample windows were ready. Instead, it would process just one, falling further behind, adding latency. This would happen randomly overtime, and be hard to test consistently.
This project came together so quickly, so easy – once I found the right approach – that I couldn't resist getting fancy. I added a low-pass filter to muffle the feedback a little. And I added automatic gain control, as an experimental option.
For both of these I used a filtering algorithm called “IIR” (this stands for Infinite Impulse Response – but that term is a confusing mouthful, so let's just call it IIR). IIR is a special purpose virtual machine. Low-pass filters, high-pass filters, combinations of those filters, and even equalizers, can be specified, and use very specific techniques (like a compiler) to convert them into an IIR implementation.
(You could, instead, “compile” the filters to be the resistor values to use in a hardware circuit. That's programming in solder!)
The machine code for these IIR virtual machines is just two list of coefficients, called A & B. The software emulator is code that looks like the following bit of code:
C#
Out[0] = Sample[0]; Out[1] = Sample[1]; for (int Idx = 2; Idx < N; Idx++) { Out[Idx] = B0 * Sample[Idx] + B1 * Sample[Idx - 1] + B2 * Sample[Idx - 2] // … more like this … // Next, the feed back - A1 * Out[Idx - 1] - A2 * Out[Idx - 2] // … more like this … ; }
IIRs are easy to implement - and take less CPU power than other methods. But sometimes they sound poor; if they sound too bad, you'll want to use a different technique. I found that the low-pass filters in this project work will for some microphones, and add a slight crackle to others.
For the low pass filter to create the muffling, I used a Butterworth filter, using the code below. It takes a buffer of signed, 16-bit samples, and then converts the 16-bit values into a byte array suitable for the sound buffer.
The filter code is a bit different than the example code in the previous section. Most of the differences are for speed.
There is one difference that is not for speed. These are tricks done to make the filter smooth, and needed because the sample window is so small. They preserve the state of filter. If we didn't preserve them, the filter would be starting and stopping so frequently that it would add distracting clicks to the output sound. The filters performance would be weakened, because the sample window isn't big enough to hold sounds lower than (about) 200 Hz. Preserving these values, the filter isn't starting and stopping, and doesn't really know about the sample window. All of the IIR filters in this program use similar techniques.
C#
static double O_1 = 0.0, O_2=0.0; static void Butterworth(Int16[] InBuffer, int Ofs, byte[] OutBuffer, double[] Coefs) { double C0=Coefs[0], C1=Coefs[1], C2 = Coefs[2], C3=Coefs[3]; double I_1=InBuffer[Ofs-1], I_2= InBuffer[Ofs-2]; for (int L = InBuffer . Length, J=0, I = Ofs; I < L; I++) { double I_0 = InBuffer[I]; // Filter the samples double A = (I_0 + I_2) * C0 + I_1 * C1; I_2 = I_1; I_1 = I_0; A = A - O_1 * C2 - O_2 * C3; O_2 = O_1; O_1 = A; // Convert it back to 16 bit Int16 S; if (A < -32767) S = -32767; if (A > 32767) S = 32767; else S = (Int16) A; // Store it OutBuffer[J++] = (byte)(S & 0xFF); OutBuffer[J++] = (byte)(((UInt16)S >> 8) & 0xFF); } }
Automatic Gain Control
I decided next to tackle a problem where my wife's voice did not carry well on calls. This happens a lot to her with cell phones – and answering machines. I was pretty sure that the problem was poor automatic-gain-control (AGC). The typical amplifier in a headset (and in Skype) estimates how loud our voice is, then increases – or decreases – the volume to a reasonable level. It was deciding that my wife's voice was background noise, and cutting her off.
I chose to write an alternate gain control that amplified the sound and passed it to Skype. That way we'd have four to choose from: The one built into the Microphone, the Soundcard's, Mine, and Skype's. (To be fair, these automatic gain controls work well in most cases).
The main portion of the gain control is implemented in the file GainControl.cs. The control algorithm is:
The portion of code that calculates the gain looks like (CutOff_dB, LowGain_dB, and TgtGain_dB are the three slider values):
C#
if (!AutoGain) Gain = 1.0; else { double MaxGain; double dB = Analyze(InBuffer, Ofs, out MaxGain); if (dB < CutOff_dB) Gain = 0.0; else if (dB < LowGain_dB + 4.0) { Gain = Math.Exp((LowGain_dB - dB) * db2log); } else { Gain = Math.Exp((TgtGain_dB - dB) * db2log); } Gain = (0.4 * Gain + 0.6 * PrevGain); if (Gain > MaxGain) Gain = MaxGain; PrevGain = Gain; } // Skip further process if there is silence if (0.0 == Gain) { return 0; }
If you look at the code, you'll see that we don't compare directly with LowGain_db; rather we compare the estimate volume with LowGain_db+4. This gives a little “hysteresis” – if we raise our voice momentarily, the software won't suddenly make it the highest possible volume. Instead, the software lowers the volume a little bit.
When the software changes the sample rate, it basically needs to know how many input samples to skip. At the start of a call, the software computes this, calling it InInc:
C#
// Calculate how we resample to 16Khz InInc = (int)(1024.0 * SampleRate / 16000.0);
The process of applying the gain adjustment, performing a low pass filter and re-sampling is below:
C#
int NumSamples = InBuffer.Length; int End = OutBuffer.Length; int NextIdxForOut = -InInc; int OutIdx = 0; for (int I = Ofs; OutIdx < End && I < NumSamples; I++) { // Retrieve the sample Int16 S = InBuffer[I]; double I_0 = S; // Apply Gain I_0 *= Gain; // 8khz low pass filter if (DoLP) { // Simple Butterworth 8KHz low-pass filter double A = (I_0 + DS_I_2) * LP[0] + DS_I_1 * LP[1]; DS_I_2 = DS_I_1; DS_I_1 = I_0; I_0 = A - DS_O_1 * LP[2] - DS_O_2 * LP[3]; DS_O_2 = DS_O_1; DS_O_1 = I_0; // Change sample rate int Tmp = NextIdxForOut / 1024; if (I < Tmp) continue; NextIdxForOut += InInc; } // Convert it back to 16 bit S = (Int16)I_0; // Store it OutBuffer[OutIdx++] = (byte)(S & 0xFF); OutBuffer[OutIdx++] = (byte)(((UInt16)S >> 8) & 0xFF); }
Estimating Volume
How loud “it should be” is controlled by a slider on the screen. The software estimates how loud the sound is by using an algorithm devised by David Robinson that takes into account how it sounds to a person. This way we can increase the gain on hard to hear sounds, and reduce the gain on sounds that a person is very sensitive to.
The loudness estimator, implemented in the Analysis procedure in GainAnalysis.cs, uses the following algorithm:
The code to “normalize” the sound into how a person hears it is below. Along the way it computes the square of the samples (used in step 2). Like the earlier IIR filters, these preserves their variables across the calls. The first IIR is a yulewalk filter, but it preserves it old intermediary values in an array. Like the trick in AudioLoop, where we copy the last 10 samples into the start of the current buffer, the analysis procedure copies the last 10 immediate values into the start of YuleTmp array.
The output of the yulewalk filter is feed into a 150Hz high pass filter. It is essentially the same as the low-pass filter described earlier.
C#
for (int L = Samples.Length, N = Ofs; N < L; N++) { int _V = Samples[N]; double V = _V; if (_V > MaxSample) MaxSample = _V; if (-_V > MaxSample) MaxSample = -_V; // Perform yulewalk filter double S = V * YuleCoefs[0]; for (int J = N - 1, I = 1; I < 11; I++, J--) S += Samples[J] * YuleCoefs[I]; for (int J = N - 1, I = 11; I < 21; I++, J--) S -= YuleTmp[J] * YuleCoefs[I]; // Store for the feedback into the next stage of the yule walk YuleTmp[N] = S; // Perform butterworth high-pass filter stage, using S as an input double Accum = (S + GA_I_2) * HPCoefs[0] + GA_I_1 * HPCoefs[1]; GA_I_2 = GA_I_1; GA_I_1 = S; Accum = Accum - GA_O_1 * HPCoefs[2] - GA_O_2 * HPCoefs[3]; GA_O_2 = GA_O_1; GA_O_1 = Accum; // The square of the filtered results Sum += Accum * Accum; } // Copy the intermediate yulewalk state for the next // (this is needed since we are looking at a fairly small time window) for (int I = 0, J = YuleTmp.Length - 10; I < 10; I++) YuleTmp[I] = YuleTmp[J];
The mean-squared is computed:
C#
// The mean square of the filtered results double MS = Sum / NumSamples;
Tracking the last 750ms of samples is a simple matter of putting it into a circular queue:
C#
MSQueue[QIdx++] = MS;
if (QIdx >= MSQueue . Length)
QIdx = 0;
Next, is the code to finding the first non-zero value 95% of the way into the buffer. It is a straightforward copy-the-array, sort it, and fetch:
C#
Array.Copy(MSQueue, SortedQ, MSQueue.Length); Array.Sort(SortedQ); // Return the 95% double X = SortedQ[Q95Idx]; for (int I = Q95Idx +1; X < 400.0 && I < SortedQ . Length; I++) X = SortedQ[I];
Next, override this value if the current sample window is very, very quiet – that is, the user stopped talking. (If we don't do this, we'll amplify background noise between words)
C#
if (MS < X * 0.40 && MS < 12800.0)
X = MS;
Finally, convert the result into decibels (or a reasonable approximation of a decibel)
C#
return 10.0 * Math . Log10 (X * 0.5 + double . Epsilon);
Note: The logarithm function takes a positive, non-zero floating point number. However, the value we pass to it can be zero; if we pass a zero, though, the Logarithm function would return a bad value. The simplest thing to do is check to see if the value we are passing is “zero” and not call Logarithm. However, I learned a long time ago to just add “epsilon” to the value to whatever we pass. This can really improve performance on number crunching.
This concludes how to add a little bit feedback and fancy amplification to you Skype phone calls.
If you want to try this out, the download link for the source code is at the top of the article!
If you'd like to experiment further, here are ideas of what can be done:
Randall Maas writes firmware for medical devices, and consults in embedded software. Before that he did a lot of other things… like everyone else in the software industry. You can contact him at randym@acm.org.
Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation,
please create a new thread in our Forums,
or
Contact Us and let us know.
Follow the Discussion
Oops, something didn't work.
What does this mean?
Following an item on Channel 9 allows you to watch for new content and comments that you are interested in. You need to be signed in to Channel 9 to use this feature.What does this mean?
Following an item on Channel 9 allows you to watch for new content and comments that you are interested in and view them all on your notifications page.sign up for email notifications?
Randall,
Thanks for preparing this work; looks very promising - I really dislike the absence of sidetones in Skype and because I instinctively end up talking extra loud on my headset (to hear myself) it bothers folks around me.
However, I'm having trouble with the code - specifically, the microphone initialization is failing - and sst aborts. Much of the time there actually is no mic on my system - it is present only when I actually plug in the headset - that sst would fail under those conditions I understand. -- But even when my headset/mic are plugged in sst fails to init the microphone object.
Presumably this is due to the transient nature of my usage - do you have any idea why the initialization would fail even when a headset/mic IS plugged in?
Remove this comment
Remove this thread
close