Loading user information from Channel 9

Something went wrong getting user information from Channel 9

Latest Achievement:

Loading user information from MSDN

Something went wrong getting user information from MSDN

Visual Studio Achievements

Latest Achievement:

Loading Visual Studio Achievements

Something went wrong getting the Visual Studio Achievements


We've all cringed as a hopelessly out of tune contestant appears on the latest episode of “American Idol.” Occasionally, there's a contestant who manages to be pitch perfect all the way through—right until they flub the final note. And in the cutthroat world of televised auditions, sing one slightly flat note and you're out.

So what takes care of a bad-pitch day? Autotune—an effect that corrects the pitch of your voice so you'll never again sing out of tune. And now, with the power of modern microprocessors, autotune is possible in real-time, allowing singers to benefit from its almost magical powers during live concerts.

The company most famous for its autotune effect is Antares. Antares Auto-Tune currently retails for $249, and a stripped down version is available for $100. In addition to simply improving the pitch of a dodgy singer, autotune can be used to create unique robotic sounding vocal effects, a technique massively popular in recent years thanks to its use by artists such as T-Pain and the group behind the “Auto-Tune the News” YouTube videos. In 1998, when the effect was first used on Cher's “Believe” single, the producer used such extreme settings that instead of subtly adjusting the pitch, autotune “snapped” instantaneously to the nearest “correct” note.

Here is a nerdy example of what Autotune can do.

How does Autotune work?

An autotune effect has two parts. The first is pitch detection, which calculates the dominant frequency of the incoming signal, and is the reason autotune is normally used on monophonic audio sources (i.e. playing one note at a time, not whole chords). So, if your guitar is out of tune, you're out of luck (Celemony's Melodyne product, however, features some incredible capabilities for pitch-shifting polyphonic audio).

The second stage is pitch shifting, or “correcting” a given note. However, the bigger the pitch shift required, the more artificial the end result will be, and it is worth noting that absolutely perfect pitch is not always desirable. Sometimes the blended notes resulting from vibrato, for example, are an important part of the performance, and eliminating them would be detrimental.

Creating a .NET Autotune Algorithm

For this project, we will be creating an autotune effect for .NET. Serious recording enthusiasts should just go out and buy a decent autotune effect, but to have a little fun, we'll see if we can make an autotune that can give us a poor man's Cher effect (or T-Pain effect, if you prefer).

To get started, I searched to see if there were some pre-existing open source autotune implementations, which brought me to awesomebox, a project created by Ravi Parikh and Keegan Poppen while they were students at Stanford University. They kindly gave me permission to make use of their code, which uses an auto-correlator for pitch detection and an open source pitch-shifting algorithm from audio DSP expert Stephan M. Bernsee.

Porting C++ to C#

Although porting from C/C++ to C# is not exactly fun, there is enough similarity in the syntax that it is possible to complete without too many changes. You need to remember, however, that a long in C is an int in C# (i.e. 32 bits long not 64).

Additionally, the C# compiler is fussier than C/C++ when it comes to casting between floats, doubles, and ints. Putting the “f” suffix on numeric literals sorts out most of these compiler errors.

Pointers can be a pain. I tend to replace them with integer variables used to index into an array. You can of course use unsafe code, but that limits your options if you plan to port to Silverlight or Windows Phone 7 at a later date, neither of which allow unsafe code or interop into unmanaged code. The necessary mathematical functions are available in the System.Math class.

To see an example, compare this pitch shifting C++ source file with my C# conversion of it.

Capturing Audio with NAudio

Interop wrappers for the NAudio Windows WaveIn APIs capture the audio. Here is the code used to start recording:


waveIn = new WaveIn();
waveIn.DeviceNumber = recordingDevice;
waveIn.DataAvailable += waveIn_DataAvailable;
waveIn.RecordingStopped += new EventHandler(waveIn_RecordingStopped);
waveIn.WaveFormat = recordingFormat;


waveIn = New WaveIn
waveIn.DeviceNumber = recordingDevice
AddHandler waveIn.DataAvailable, AddressOf waveIn_DataAvailable
AddHandler waveIn.RecordingStopped, AddressOf waveIn_RecordingStopped
waveIn.WaveFormat = _recordingFormat

The steps are:

  1. Create a new WaveIn device
  2. [Optional] set up the device number (0 for default recording device)
  3. Add a handler to the DataAvailable event—this is where we will receive the raw audio data
  4. Add a handler for the RecordingStopped event. This allows us to close the temporary WAV file we created
  5. Set up the recording format. For this project we are going to record in mono (i.e. one channel), 16 bit, 44.1kHz audio—the default setting for most microphones
  6. Call the StartRecording method

Whenever the soundcard reports a new buffer of recorded audio, we receive it in the DataAvailable event handler:


void waveIn_DataAvailable(object sender, WaveInEventArgs e)
    byte[] buffer = e.Buffer;
    int bytesRecorded = e.BytesRecorded;
    WriteToFile(buffer, bytesRecorded);

    for (int index = 0; index < e.BytesRecorded; index += 2)
        short sample = (short)((buffer[index + 1] << 8) |
                                buffer[index + 0]);
        float sample32 = sample / 32768f;


Private Sub waveIn_DataAvailable(ByVal sender As Object, ByVal e As WaveInEventArgs)
    Dim buffer() = e.Buffer
    Dim bytesRecorded = e.BytesRecorded
    WriteToFile(buffer, bytesRecorded)

    For index = 0 To e.BytesRecorded - 1 Step 2
        Dim sample = CShort(buffer(index + 1)) << 8 Or CShort(buffer(index + 0))
        Dim sample32 = sample / 32768.0F
    Next index
End Sub

The WaveInEventArgs contains the number of bytes recorded (e.BytesRecorded) and a pointer to the buffer containing those bytes (e.Buffer). The handler does two things with the recorded data. First, it calls WriteToFile, which uses the WaveFileWriter class from NAudio to write the data to disk:


// before we start recording, set up a WaveFileWriter...
writer = new WaveFileWriter(waveFileName, recordingFormat);

// ... every block we receive we write it to the WaveFileWriter:
writer.WriteData(buffer, 0, bytesRecorded);

// ... and when recording stops we must call Dispose to finalize the
// .WAV file properly


writer = New WaveFileWriter(waveFileName, _recordingFormat)
writer.WriteData(buffer, 0, bytesRecorded)

Converting Audio to Floating Point and Back Again

Once recording has completed, we have a WAV file on which to perform our autotune effect. However, our WAV file consists of 16 bit samples (i.e. System.Int16 aka short). In other words, we have a sequence of byte pairs, each of which represent a number in the range -32768 to 32767. For the digital signal processing we will be performing, it's best to have a sequence of floating point numbers (System.Single or float) in the range -1.0f to 1.0f. This is a common requirement, so NAudio provides a utility class to convert audio from short to float called Wave16ToFloatProvider. Here's the code that takes a WAV file and implements the autotune algorithm on it:


public static void ApplyAutoTune(string fileToProcess, string tempFile, AutoTuneSettings autotuneSettings)
    using (WaveFileReader reader = new WaveFileReader(fileToProcess))
        IWaveProvider stream32 = new Wave16toFloatProvider(reader);
        IWaveProvider streamEffect = new AutoTuneWaveProvider(stream32, autotuneSettings);
        IWaveProvider stream16 = new WaveFloatTo16Provider(streamEffect);
        using (WaveFileWriter converted = new WaveFileWriter(tempFile, stream16.WaveFormat))
            // buffer length needs to be a power of 2 for FFT to work nicely
            // however, make the buffer too long and pitches aren't detected fast enough
            // successful buffer sizes: 8192, 4096, 2048, 1024
            // (some pitch detection algorithms need at least 2048)
            byte[] buffer = new byte[8192]; 
            int bytesRead;
                bytesRead = stream16.Read(buffer, 0, buffer.Length);
                converted.WriteData(buffer, 0, bytesRead);
            } while (bytesRead != 0 && converted.Length < reader.Length);


Public Shared Sub ApplyAutoTune(ByVal fileToProcess As String,
    ByVal tempFile As String,
    ByVal autotuneSettings As AutoTuneSettings)
    Using reader As New WaveFileReader(fileToProcess)
        Dim stream32 As IWaveProvider = New Wave16ToFloatProvider(reader)
        Dim streamEffect As IWaveProvider = New AutoTuneWaveProvider(stream32, autotuneSettings)
        Dim stream16 As IWaveProvider = New WaveFloatTo16Provider(streamEffect)
        Using converted As New WaveFileWriter(tempFile, stream16.WaveFormat)
            ' buffer length needs to be a power of 2 for FFT to work nicely
            ' however, make the buffer too long and pitches aren't detected fast enough
            ' successful buffer sizes: 8192, 4096, 2048, 1024
            ' (some pitch detection algorithms need at least 2048)
            Dim buffer(8191) As Byte
            Dim bytesRead As Integer
                bytesRead = stream16.Read(buffer, 0, buffer.Length)
                converted.WriteData(buffer, 0, bytesRead)
            Loop While bytesRead <> 0 AndAlso converted.Length < reader.Length
        End Using
    End Using
End Sub

Here's how it works:

  1. First we use a WaveFileReader to open the file that we've just created, which contains 16 bit audio samples
  2. Then we use the Wave16ToFloatProvider to perform the conversion to floating point samples
  3. Next we pass it through our autotune effect (the AutotuneWaveProvider). We'll explain how this works this later
  4. Then we use the WaveFloatTo16Provider to convert back to 16 bit samples ready for saving to WAV (we could save a 32 bit WAV, but it would be rather wasteful of disk space)
  5. Having set up the audio pipeline, we can read from the WaveFloatTo16Provider and pull audio right through from the WAV file. We need to read in block sizes that are a power of 2, since we are passing the data through FFTs. If we want to read arbitrary block sizes, we need to introduce another element into our pipeline to buffer up enough data to pass through FFTs
  6. Finally, we use the WaveFileWriter to write the data we have read into a WAV file

The AutoTuneWaveProvider

As we saw in the last code snippet, the AutoTuneWaveProvider is the piece in our audio pipeline that actually performs the autotune effect. It implements the NAudio IWaveProvider interface, which allows it to be used in the pipeline for real-time playback if necessary, even though our example code is not doing this (see the section on performance later). Here's the AutoTuneWaveProvider constructor:


public AutoTuneWaveProvider(IWaveProvider source, AutoTuneSettings autoTuneSettings)
    this.autoTuneSettings = autoTuneSettings;
    if (source.WaveFormat.SampleRate != 44100)
        throw new ArgumentException("AutoTune only works at 44.1kHz");
    if (source.WaveFormat.Encoding != WaveFormatEncoding.IeeeFloat)
        throw new ArgumentException("AutoTune only works on IEEE floating point audio data");
    if (source.WaveFormat.Channels != 1)
        throw new ArgumentException("AutoTune only works on mono input sources");

    this.source = source;
    this.pitchDetector = new AutoCorrelator(source.WaveFormat.SampleRate);
    this.pitchShifter = new SmbPitchShifter(Settings);
    this.waveBuffer = new WaveBuffer(8192);


Public Sub New(ByVal source As IWaveProvider, ByVal autoTuneSettings As AutoTuneSettings)
    Me.autoTuneSettings = autoTuneSettings
    If source.WaveFormat.SampleRate <> 44100 Then
        Throw New ArgumentException("AutoTune only works at 44.1kHz")
    End If
    If source.WaveFormat.Encoding <> WaveFormatEncoding.IeeeFloat Then
        Throw New ArgumentException("AutoTune only works on IEEE floating point audio data")
    End If
    If source.WaveFormat.Channels <> 1 Then
        Throw New ArgumentException("AutoTune only works on mono input sources")
    End If

    Me.source = source
    Me.pitchDetector = New AutoCorrelator(source.WaveFormat.SampleRate)
    ' alternative pitch detector:
    ' Me.pitchDetector = New FftPitchDetector(source.WaveFormat.SampleRate)
    Me.pitchShifter = New SmbPitchShifter(Settings, source.WaveFormat.SampleRate)
    Me.waveBuffer = New WaveBuffer(8192)
End Sub

Some points to notice:

  1. We pass in a source IWaveProvider—this is where the data will be coming from
  2. We check that the source is in the right format—floating point mono input.
  3. We also pass in an AutoTuneSettings object. This not only encapsulates the settings for autotune, it is important if you want to adjust the settings in real-time while the effect is running
  4. We then create the two key components of our autotune effect: a pitch detector (which uses an autocorrelator), and a pitch shifter
  5. Finally we create a buffer to use for audio processing. This can be a byte[] array, but we use WaveBuffer from NAudio because it uses a clever trick that allows us to cast a byte[] into a float[] without using unsafe code or having to copy all of the data

The key method on any implementation of IWaveProvider is its Read method. This is where the audio consumer, usually the sound card or a WaveFileWriter, asks for data. The data must be supplied as a byte array, and if at all possible you should return exactly the number of bytes you were asked for (if you can't, an extra layer of buffering is usually required, or audio playback will be choppy). Here's our implementation of the Read method:


public int Read(byte[] buffer, int offset, int count)
    if (waveBuffer == null || waveBuffer.MaxSize < count)
        waveBuffer = new WaveBuffer(count);

    int bytesRead = source.Read(waveBuffer, 0, count);

    // the last bit sometimes needs to be rounded up:
    if (bytesRead > 0) bytesRead = count;

    int frames = bytesRead / sizeof(float); 
    float pitch = pitchDetector.DetectPitch(waveBuffer.FloatBuffer, frames);
    // an attempt to make it less "warbly" by holding onto the pitch 
    // for at least one more buffer
    if (pitch == 0f && release < maxHold)
        pitch = previousPitch;
        this.previousPitch = pitch;
        release = 0;
    WaveBuffer outBuffer = new WaveBuffer(buffer);

    pitchShifter.ShiftPitch(waveBuffer.FloatBuffer, pitch, 0.0f, outBuffer.FloatBuffer, frames);

    return frames * 4;


Public Function Read(ByVal buffer() As Byte, ByVal offset As Integer,
                        ByVal count As Integer) As Integer Implements NAudio.Wave.IWaveProvider.Read
    If waveBuffer Is Nothing OrElse waveBuffer.MaxSize < count Then
        waveBuffer = New WaveBuffer(count)
    End If

    Dim bytesRead = source.Read(waveBuffer, 0, count)
    'Debug.Assert(bytesRead = count)

    ' the last bit sometimes needs to be rounded up:
    If bytesRead > 0 Then
        bytesRead = count
    End If

    Dim frames = bytesRead \ Len(New Single) ' MRH: was count
    Dim pitch = pitchDetector.DetectPitch(waveBuffer.FloatBuffer, frames)

    ' MRH: an attempt to make it less "warbly" by holding onto the pitch for at least one more buffer
    If pitch = 0.0F AndAlso release < maxHold Then
        pitch = previousPitch
        release += 1
        Me.previousPitch = pitch
        release = 0
    End If

    Dim midiNoteNumber = 40
    Dim targetPitch = CSng(8.175 * Math.Pow(1.05946309, midiNoteNumber))

    Dim outBuffer As New WaveBuffer(buffer)

    pitchShifter.ShiftPitch(waveBuffer.FloatBuffer, pitch, targetPitch, outBuffer.FloatBuffer, frames)

    Return frames * 4
End Function

Here's what's going on

  1. First we need to read from our source (in our case, a WAV file converted to floating point samples)
  2. If we get less data than we were expecting, we know that means we're at the end of the file, so we'll just pretend we got a full buffer
  3. We then work out how many audio ‘frames' are present (which is the same as the number of samples since this is mono audio). It's floating point audio, so frames equal bytes divided by four
  4. We then pass the data through our pitch detector algorithm (see below)
  5. Next we use some experimental code to stabilize pitch detection by reporting the previous frequency when no pitch is picked up
  6. Finally we pass the data into our pitch shifter, including details of the detected pitch

Now that we've seen the big picture of the AutotuneWaveProvider, let's drill down into its two main components—the pitch detector and pitch shifter.

Pitch Detection with Autocorrelation

The pitch detection part of autotune is vital to getting good results. If it can't accurately detect the input pitch, it will incorrectly calculate how much the pitch needs to be adjusted. However, high quality pitch detection is quite difficult to get right. First of all, the microphone may well pick up background noise. Second, when you sing a into a microphone, the signal consists not only of a single frequency, but also “harmonics” at different frequencies.

The good news is that we need to detect only the primary pitch.

The awesomebox algorithm makes use of “autocorrelation” for its pitch detection, but I made a few small tweaks to how the algorithm is implemented in an attempt to improve its accuracy. Autocorrelation has the advantage of being a relatively quick process. The basic principle is that if a signal is periodic, it will “correlate” well with itself when shifted forward (or backwards) one cycle.

Let's say we are looking to see if the note “Middle C” is being sung. The frequency of Middle C is around 262Hz. If we are sampling at 44.1kHz (which is standard for CD quality audio), then we will expect the signal to repeat at approximately every 168 samples (44100/262). Accordingly, for every sample in the buffer, we calculate the sum of squares of that sample and the sample 168 samples previous. We do this for every possible offset that measures a frequency in the range we want to detect (I am using 85Hz to 300Hz, which is adequate for pitch detecting vocals). The offset with the highest score is the most likely frequency.

Let's have a look at the code for an autocorrelation algorithm, starting with the constructor for the AutoCorrelator class:


public AutoCorrelator(int sampleRate)
    this.sampleRate = (float)sampleRate;
    int minFreq = 85;
    int maxFreq = 255;

    this.maxOffset = sampleRate / minFreq;
    this.minOffset = sampleRate / maxFreq;


Public Sub New(ByVal sampleRate As Integer)
    Me.sampleRate = CSng(sampleRate)
    Dim minFreq = 85
    Dim maxFreq = 255

    Me.maxOffset = sampleRate \ minFreq
    Me.minOffset = sampleRate \ maxFreq
End Sub

First of all, we pre-calculate some values based on the minimum and maximum frequencies we are looking for. Remember that lower frequencies are harder to detect than higher frequencies, so don't set minFreq too low. MaxOffset and MinOffset are the maximum and minimum backwards distances we will be seeking while looking for a match.


public float DetectPitch(float[] buffer, int frames)
    if (prevBuffer == null)
        prevBuffer = new float[frames];

    float maxCorr = 0;
    int maxLag = 0;

    // starting with low frequencies, working to higher
    for (int lag = maxOffset; lag >= minOffset; lag--)
        float corr = 0; //  sum of squares
        for (int i = 0; i < frames; i++)
            int oldIndex = i - lag;
            float sample = ((oldIndex < 0) ? prevBuffer[frames + 
            corr += (sample * buffer[i]);
        if (corr > maxCorr)
            maxCorr = corr;
            maxLag = lag;

    for (int n = 0; n < frames; n++)
        prevBuffer[n] = buffer[n]; 
    float noiseThreshold = frames / 1000f;

    if (maxCorr < noiseThreshold || maxLag == 0) return 0.0f;
    return this.sampleRate / maxLag;


Public Function DetectPitch(ByVal buffer() As Single, ByVal frames As Integer) As Single Implements IPitchDetector.DetectPitch
    If prevBuffer Is Nothing Then
        prevBuffer = New Single(frames - 1){}
    End If
    Dim secCor As Single = 0
    Dim secLag = 0

    Dim maxCorr As Single = 0
    Dim maxLag = 0

    ' starting with low frequencies, working to higher
    For lag = maxOffset To minOffset Step -1
        Dim corr As Single = 0 ' this is calculated as the sum of squares
        For i = 0 To frames - 1
            Dim oldIndex = i - lag
            Dim sample = (If(oldIndex < 0, prevBuffer(frames + oldIndex), buffer(oldIndex)))
            corr += (sample * buffer(i))
        Next i
        If corr > maxCorr Then
            maxCorr = corr
            maxLag = lag
        End If
        If corr >= 0.9 * maxCorr Then
            secCor = corr
            secLag = lag
        End If
    Next lag
    For n = 0 To frames - 1
        prevBuffer(n) = buffer(n)
    Next n
    Dim noiseThreshold = frames / 1000.0F
    'Debug.WriteLine(String.Format("Max Corr: {0} ({1}), Sec Corr: {2} ({3})", Me.sampleRate / maxLag, maxCorr, Me.sampleRate / secLag, secCor))
    If maxCorr < noiseThreshold OrElse maxLag = 0 Then
        Return 0.0F
    End If
    'Return 44100.0f / secLag '--works better for singing
    Return Me.sampleRate / maxLag
End Function

A few things to notice:

  1. Notice that the audio comes in as an array of floating point numbers. NAudio performs this conversion from 16 bit audio for us by using Wave16ToFloatProvider
  2. We store the previous buffer. This allows us to look backwards for correlation
  3. We then work through each and every possible integer offset within our range, and calculate a correlation value
  4. The correlation is calculated as the sum of squares
  5. If it is the largest so far, we store the “lag” value (i.e. number of samples back that we correlated with)
  6. Notice that we return 0 (i.e. no frequency detected) if we don't find a strong frequency. This noise threshold may need to be tweaked depending on your input
  7. Finally, we convert into a frequency with the formula sampleRate / maxLag

I wrote some unit tests to measure the accuracy of detection with sine waves (which admittedly are the easiest to detect). Here are the results for audio sampled at 44.1kHz:

Test Frequency

Detected Pitch





































Notice that the detected frequencies from the final two tests are actually half the correct amount. This doesn't actually matter for our purposes, since this just means the frequency has been detected as one octave below the correct note.

To improve on the accuracy of the autocorrelator's results, there are a couple of things you can do:

  • Run a band-pass filter before-hand, to remove any frequencies outside of the desired range
  • Combine the results with those obtained from a different technique, such as counting zero-crossings

Pitch Detection with the Fast Fourier Transform

I decided to implement an alternative pitch detection algorithm to see if I could get better results. A different approach is to use the Fast Fourier Transform, which converts signals from the “time domain” into the “frequency domain.”

The basic approach is to take a block of samples (which must be a power of 2 – e.g. 1024), and run the FFT on them. The FFT takes complex numbers as inputs, which for audio signals are entirely real. The implementation I am using expects real and complex parts interleaved for the input buffer. Here's our code setting up fftBuffer with interleaved samples:


private float[] fftBuffer;
private float[] prevBuffer;

public float DetectPitch(float[] buffer, int inFrames)
    Func<int, int, float> window = HammingWindow;
    if (prevBuffer == null)
        prevBuffer = new float[inFrames];
    // double frames since we are combining present and previous buffers
    int frames = inFrames * 2;
    if (fftBuffer == null)
        fftBuffer = new float[frames * 2]; // times 2 because it is complex input
    for (int n = 0; n < frames; n++)
        if (n < inFrames)
            fftBuffer[n * 2] = prevBuffer[n] * window(n, frames);
            fftBuffer[n * 2 + 1] = 0; // need to clear out as fft modifies buffer
            fftBuffer[n * 2] = buffer[n-inFrames] * window(n, frames);
            fftBuffer[n * 2 + 1] = 0; // need to clear out as fft modifies buffer


Private fftBuffer() As Single
Private prevBuffer() As Single

Public Function DetectPitch(ByVal buffer() As Single,
                            ByVal inFrames As Integer) As Single Implements IPitchDetector.DetectPitch
    Dim window As Func(Of Integer, Integer, Single) = AddressOf HammingWindow
    If prevBuffer Is Nothing Then
        prevBuffer = New Single(inFrames - 1) {}
    End If

    ' double frames since we are combining present and previous buffers
    Dim frames = inFrames * 2
    If fftBuffer Is Nothing Then
        fftBuffer = New Single(frames * 2 - 1) {} ' times 2 because it is complex input
    End If

    For n = 0 To frames - 1
        If n < inFrames Then
            fftBuffer(n * 2) = prevBuffer(n) * window(n, frames)
            fftBuffer(n * 2 + 1) = 0 ' need to clear out as fft modifies buffer
            fftBuffer(n * 2) = buffer(n - inFrames) * window(n, frames)
            fftBuffer(n * 2 + 1) = 0 ' need to clear out as fft modifies buffer
        End If
    Next n

Notice that we prepend the previous buffer we were passed. This is a common way of increasing the accuracy and resolution of an FFT by using overlapping windows, and can be further extended to store three previous buffers, allowing us to have 75% overlapping windows instead of just the 50% that we have in this example.

For better peak frequency detection, the signal that is passed into the FFT is best pre-processed with a “windowing” function. There are several to choose from, each with its own strengths and weaknesses. I used the Hamming window, which is a fairly common choice:


private float HammingWindow(int n, int N) 
    return 0.54f - 0.46f * (float)Math.Cos((2 * Math.PI * n) / (N - 1));


Private Function HammingWindow(ByVal n As Integer, ByVal _N As Integer) As Single
    Return 0.54F - 0.46F * CSng(Math.Cos((2 * Math.PI * n) / (_N - 1)))
End Function

The next step is to pass on our interleaved buffer to the FFT algorithm. I am using Stephan Bernsee's here, though there is an alternative implementation in NAudio that I could have used. Since the same function can be used for an inverse FFT, the -1 parameter means (rather counter-intuitively), do a forwards FFT. It processes the data in place, which is fine since we don't need to keep the contents of the input buffer:


// assuming frames is a power of 2
SmbPitchShift.smbFft(fftBuffer, frames, -1);


' assuming frames is a power of 2
SmbPitchShift.smbFft(fftBuffer, frames, -1)

Once we have completed the FFT, we are ready to interpret its output. The output of the FFT consists of complex numbers (again real followed by imaginary in our buffer), which represent frequency “bins.”

We start off by calculating the bin size and working out which bins correspond to the range of frequencies we are interested in detecting:


float binSize = sampleRate / frames;
int minBin = (int)(85 / binSize);
int maxBin = (int)(300 / binSize);


Dim binSize = sampleRate / frames
Dim minBin = CInt(Fix(85 / binSize))
Dim maxBin = CInt(Fix(300 / binSize))

For example, if our sample rate is 44.1kHz and we analyse a block of 1024 samples, then each bin represents 43Hz, which is hardly the granularity we are looking for. To increase resolution, our options are to either sample at a higher rate or analyse a bigger chunk. Our approach is to use overlapping blocks of 8192 samples, as we read 4096 samples each time. This means we have a resolution of around 5Hz, which is much more acceptable.

Now we can calculate the magnitude or “intensity” for each frequency by calculating the sum of squares (strictly we should then take the square root, but we don't need to since we are just looking for the largest value):


float maxIntensity = 0f;
int maxBinIndex = 0;

for (int bin = minBin; bin <= maxBin; bin++)
    float real = fftBuffer[bin * 2];
    float imaginary = fftBuffer[bin * 2 + 1];
    float intensity = real * real + imaginary * imaginary;
    if (intensity > maxIntensity)
        maxIntensity = intensity;
        maxBinIndex = bin;


Dim maxIntensity = 0.0F
Dim maxBinIndex = 0
For bin = minBin To maxBin
    Dim real = fftBuffer(bin * 2)
    Dim imaginary = fftBuffer(bin * 2 + 1)
    Dim intensity = real * real + imaginary * imaginary
    If intensity > maxIntensity Then
        maxIntensity = intensity
        maxBinIndex = bin
    End If
Next bin

Since we have identified the bin with the maximum intensity, we can calculate the detected frequency:


return binSize * maxBinIndex;


Return binSize * maxBinIndex

I don't currently specify a minimum threshold for maxIntensity, but perhaps if it were very low, the FFT pitch detector would return zero to indicate no pitch detected instead of returning an answer that is probably not accurate.

Let's have a look at how the FFT pitch detector does:

Test Frequency

Detected Pitch





































As can be seen, it correctly picks out the primary frequencies of the higher notes, but overall it doesn't get that much closer than the autocorrelator, so I've left that as the default algorithm. You, however, can swap in the FFT detector in the code if it works better for the material you are auto-tuning.

There are ways of using the phase information from the FFT output to increase the accuracy of pitch detection even further, but I have left that as an exercise for the reader!

Pitch Shifting

The next step is to determine how much we will shift the pitch. The simplest way to do this is to look for the musical pitch that is closest to the detected pitch. Then, the amount of the shift by is simply the ratio of those two notes.

There are, however, some additional considerations. First, we may want to select a subset of musical notes that are acceptable. For example, only notes in the key of C#, or maybe F# minor pentatonic. This may require a slightly more radical adjustment.

Second, depending on the effect we are after, we may not want to instantaneously jump to the new frequency. The code I am using utilizes a fairly rudimentary “attack” time parameter, allowing you to gradually move to the new frequency.

The actual DSP for the pitch-shifting effect is more or less untouched from Stephan Bernsee's code, and this is because it works really well. The Bernsee's code makes use of the Fast Fourier Transform, plus a bunch of clever mathematics, which I almost understand, but not quite well enough to try and explain here! You're better off reading an article in which the man himself explains how it works.

The class that manages the pitch-shifting algorithm is called SmbPitchShifer and inherits from a PitchShifter base class. This does the bulk of its work in the ShiftPitch function:


public void ShiftPitch(float[] inputBuff, float inputPitch,
                       float targetPitch, float[] outputBuff, int nFrames)
     detectedPitch = inputPitch;


Public Sub ShiftPitch(ByVal inputBuff() As Single, ByVal inputPitch As Single,
                    ByVal targetPitch As Single, ByVal outputBuff() As Single, ByVal nFrames As Integer)
    detectedPitch = inputPitch

The inputPitch parameter is set to the frequency detected by the PitchDetector. The targetPitch parameter is currently unused, but will be used to specify the target pitch in real-time when accepting input from, say, a MIDI keyboard. In any case, we call UpdateSettings in order to see if any of the autotune algorithm settings have changed since last time.

Next we calculate the amount we need to shift the pitch shift. A shift factor of 1 means no change. We don't allow the shift factor to go above 2 or below 0.5, since these figures represent a whole octave change:


float shiftFactor = 1.0f;

if (inputPitch > 0)
    shiftFactor = snapFactor(inputPitch);

if (shiftFactor > 2.0) shiftFactor = 2.0f;
if (shiftFactor < 0.5) shiftFactor = 0.5f;


Dim shiftFactor = 1.0F

If inputPitch > 0 Then
   shiftFactor = snapFactor(inputPitch)
   shiftFactor += addVibrato(nFrames)
End If

If shiftFactor > 2.0 Then
   shiftFactor = 2.0F
End If
If shiftFactor < 0.5 Then
   shiftFactor = 0.5F
End If

The decision of what the target note is takes place in the snapFactor function:


protected float snapFactor(float freq)
    float previousFrequency = 0.0f;
    float correctedFrequency = 0.0f;
    int previousNote = 0;
    int correctedNote = 0;
    for (int i = 1; i < 120; i++)
        bool endLoop = false;
        foreach (int note in this.settings.AutoPitches)
            if (i % 12 == note)
                previousFrequency = correctedFrequency;
                previousNote = correctedNote;
                correctedFrequency = (float)(8.175 * Math.Pow(1.05946309, (float)i));
                correctedNote = i;
                if (correctedFrequency > freq) { endLoop = true; }
        if (endLoop)
    if (correctedFrequency == 0.0) { return 1.0f; }
    int destinationNote = 0;
    double destinationFrequency = 0.0;
    // decide whether we are shifting up or down
    if (correctedFrequency - freq > freq - previousFrequency)
        destinationNote = previousNote;
        destinationFrequency = previousFrequency;
        destinationNote = correctedNote;
        destinationFrequency = correctedFrequency;
    if (destinationNote != currPitch)
        numElapsed = 0;
        currPitch = destinationNote;
    if (attack > numElapsed)
        double n = (destinationFrequency - freq) / attack * numElapsed;
        destinationFrequency = freq + n;
    return (float)(destinationFrequency / freq);


Protected Function snapFactor(ByVal freq As Single) As Single
    Dim previousFrequency = 0.0F
    Dim correctedFrequency = 0.0F
    Dim previousNote = 0
    Dim correctedNote = 0
    For i = 1 To 119
        Dim endLoop = False
        For Each note As Integer In Me.settings.AutoPitches
            If i Mod 12 = note Then
                previousFrequency = correctedFrequency
                previousNote = correctedNote
                correctedFrequency = CSng(8.175 * Math.Pow(1.05946309, CSng(i)))
                correctedNote = i
                If correctedFrequency > freq Then
                    endLoop = True
                End If
                Exit For
            End If
        Next note
        If endLoop Then
            Exit For
        End If
    Next i
    If correctedFrequency = 0.0 Then
        Return 1.0f
    End If
    Dim destinationNote = 0
    Dim destinationFrequency = 0.0
    ' decide whether we are shifting up or down
    If correctedFrequency - freq > freq - previousFrequency Then
        destinationNote = previousNote
        destinationFrequency = previousFrequency
        destinationNote = correctedNote
        destinationFrequency = correctedFrequency
    End If
    If destinationNote <> currPitch Then
        numElapsed = 0
        currPitch = destinationNote
    End If
    If attack > numElapsed Then
        Dim n = (destinationFrequency - freq) / attack * numElapsed
        destinationFrequency = freq + n
    End If
    numElapsed += 1
    Return CSng(destinationFrequency / freq)
End Function

The way this function works is that it runs through the MIDI notes 0-120 and, if that note is selected as one of the valid pitches we support, we remember the “corrected frequency,” which can be calculated from the MIDI note number with the following formula:


correctedFrequency = (float)(8.175 * Math.Pow(1.05946309, (float)midiNoteNumber));


correctedFrequency = CSng(8.175 * Math.Pow(1.05946309, CSng(i)))

Obviously, a pitch is likely to fall somewhere in between two valid notes, so we choose the which pitch to correct by determining which one is closest to the detected frequency.

The snapFactor function is also responsible for implementing the attack time parameter. This allows the destinationFrequency to be slowly moved to the target note over the duration of the attack period. Having calculated our shift factor, we are now ready to pass our data on to the actual pitch-shifting algorithm:


int fftFrameSize = 2048;
int osamp = 8; // 32 is best quality
SmbPitchShift.smbPitchShift(shiftFactor, nFrames, fftFrameSize, osamp, this.sampleRate, inputBuff, outputBuff);


Dim fftFrameSize = 2048
Dim osamp = 8 ' 32 is best quality
SmbPitchShift.smbPitchShift(shiftFactor, nFrames, fftFrameSize, osamp, Me.sampleRate, inputBuff, outputBuff)

The final thing we do in the ShiftPitch function is keep a record of the pitch shifts we have made. These are stored in a queue (maximum of 5000 entries) and are very useful for diagnosing what is going on if you are not getting the results you wanted from the algorithm:


shiftedPitch = inputPitch * shiftFactor;
updateShifts(detectedPitch, shiftedPitch, this.currPitch);


shiftedPitch = inputPitch * shiftFactor
updateShifts(detectedPitch, shiftedPitch, Me.currPitch)


Performance, as you might expect in a managed application that has not been extensively optimized, was not good. Using my laptop, I could autotune one minute of audio in about 90 seconds. Obviously, that rules out real-time autotuning. I decided to profile the application to see if there were any quick ways I could improve things.

The profiling tools in Visual Studio revealed that 20% of the time was spent on pitch detection and 80% pitch shifting. Unfortunately, there were not too many options available for optimisation, since further investigation pointed to calls to Math.Sin taking the bulk of the time. Possibly creating lookup tables could save a bit more time.

Fortunately, we have another option for speeding things up. The pitch-shifting algorithm takes an “oversampling” parameter, which by default is set to 32, the highest value. However, we can trade off speed for quality. Setting it to 16 meant that I could autotune a minute of audio in 55ms (on my 2.4GHz Core2Duo laptop) – realtime but only just. Setting it to 8 reduced that down to 36s. The results still sounded reasonable, so I have left it set at 8 in the code.

An alternative way of speeding it up though would be to swap in a different pitch-shifting algorithm. You could start by trying out one I created as part of the Skype Voice Changer project previously featured on Coding4Fun, which is also able to operate in real-time (although I haven't done any quality comparisons).

Creating a Test GUI

Rather than starting from scratch, I decided to build upon .NET Voice Recorder, a WPF application I created for a previous Coding4Fun article. This takes advantage of the NAudio .NET audio library for audio recording and playback capabilities. The GUI has three screens. On the first is the input device used for recording. The second records a short voice clip. And the third allows you to edit a small portion of saved audio.

Here's a screenshot of the second screen showing a recording in progress:


And here's the screen that allows you to trim the recording, preview it, and save it as WAV:


As you can see, I have added a new button allowing access to the autotune effect settings. On this screen, you can select which notes are valid, and you can also adjust the “attack time” if you prefer to not go for the robotic effect. I've included a drop-down menu that automatically selects the appropriate notes from various keys.


When you click “Apply,” the autotune effect is applied (while you wait on a background thread) and then you are returned to the screen, allowing you to play back your recording and see how it sounds. If you'd like, you can then go back and change the autotune settings (or turn it off).

MVVM Light

The original VoiceRecorder application used a MVVM (model-view-viewmodel) architecture for binding data to each view. I have updated it to make use of Laurent Bugnion's excellent MVVM Light library. This removes the need for my own RelayCommand and ViewModelBase classes, and also enables me to replace my ViewManager with a more extensible framework using the event aggregator (“Messenger”) that is included with MVVM light. This allows me to quickly navigate from one view to another by sending out a message on the event aggregator:


private void NavigateToSaveView()
   Messenger.Default.Send(new NavigateMessage(SaveViewModel.ViewName, this.voiceRecorderState));


Private Sub NavigateToSaveView()
    Messenger.Default.Send(New NavigateMessage(SaveViewModel.ViewName, Me.voiceRecorderState))
End Sub

Getting the Best out of Autotune

Unfortunately, Autotune is an effect that doesn't always produce the desired result . Obviously, if you want great autotune, you're best off buying a commercial implementation, but here are a few tips for getting the most out of an autotune algorithm:

  • Get a good quality recording. Avoid background noise, hum, and too quiet or too loud (distorted) recordings. If you need to sing against a backing track, play it in headphones so the microphone doesn't pick it up.
  • Know what key you are singing in. This is where the test application won't help you out, since if you don't know what key you are singing in, you can hardly expect to be able to select the appropriate key from the list. You might even be singing in between two keys! If you have an in-tune musical instrument at hand, play a note to give yourself a starting pitch.
  • Choose your scale. The easiest option is to just go with a chromatic scale, which means all 12 notes are valid. However, you can try to make the autotune force you into a monotone. Pentatonic scales are useful for instant gratification. They have five notes in them, and so long as you stick to those five, almost anything you sing will fit in with a backing track in your chosen key.
  • Adjust the attack time. An attack time of zero is great for the robot effect. A longer attack time will smooth out transitions.
  • Why is it “warbling”? With this autotune algorithm, it is quite common to get a “warbling” effect. This is because it is either changing its mind about what note to pitch shift to (because the pitch detector is not providing stable pitch detection), or because the pitch detector couldn't determine a pitch at all, so your voice isn't being pitch shifted. You can play around with the release slider (and may need to modify the release algorithm) if you want to eliminate warbling.

Taking it further

.NET Voice Recorder is open source and hosted on CodePlex in a Mercurial repository. So what you waiting for? Make a fork and have a go at improving it:

  • Improve the pitch detector algorithm. I have already given some suggestions for how this can be done. One idea that might be worth experimenting with is making it “hold” the detected frequency for a short period until a strong, new frequency is detected.
  • Display the detected pitches. The autotune effect stores data of the detected pitches as well as the pitches it attempts to convert to. You might display this information to the user, perhaps underneath the waveform, so they can see what it is detecting. (It currently outputs this information with Debug.WriteLine, which is useful for debugging purposes).
  • Suggest an appropriate scale? Instead of leaving it to the user to select what key to snap the notes to, how about auto-selecting the detected notes?
  • Allow direct input of desired pitch. Instead of letting the autotune effect try to work out what note it should be targeting at any given time, it is far more effective to let the user input what note they would like to shift to. This could be done by entering the note using a MIDI keyboard in real-time, or by drawing the notes in, perhaps on a “piano-roll” control. Or, users could implement a simple, domain-specific language to specify the desired note. For example:
    • 0:00.0 C#
    • 0:01.5 E
    • 0:02.7 G#
  • Port it to Windows Phone 7. This would be quite a cool effect to have on your phone. Of course, you might need to optimize performance a bit more to save battery life. And you'll want to put your graphic designer hat on to give it a more beautiful look than it currently has.

About the Author

Mark Heath is the author of several open source .NET applications and libraries, including NAudio and the Skype Voice Changer. He works for NICE Systems, developing applications that search, display, and play back vast amounts of multimedia data. He has a blog, Sound Code, and you can follow him on his sporadically updated Twitter account.


Follow the discussion

  • Oops, something didn't work.

    Getting subscription
    Subscribe to this conversation
  • EricEric

    That's cool!  Smiley  

  • John DyerJohn Dyer

    Great work Mark. Now I just have to find a use for it.

  • Lauri KreutzwaldLauri Kreutzwald

    Nice work Mark.

  • RaviRavi

    Whoa awesome! Great to see Awesomebox being used in the wild Smiley

  • I have also implemented a pitch detection algorithm which is used to display a realtime pitch graph in a VST plugin (although VST uses an unmanaged API, my plugin is written in C# and uses reverse P/Invoke). This plugin is used as a visual guide to train a singer to sing in key (or for vocal exercises), and can also display a "grade" based on previously entered notes (or via a MIDI clip) that has to be hit throughout the song. My algorithm is loosely based on auto-correlation, but it is heavily modified to solve the following two problems with it:

    1. Auto-correlation is slow:
      Since you need to do a compare of all samples with all of the shifted samples, and repeat that once for every single frequency you want to detect, this quickly becomes slow.

      The way I solved this is by first down-sampling to reduce the overall number of samples to work with (with a filter to prevent anti-alias noise - also conveniently removes frequencies I don't care about), and then to have 3 passes, one each with low, medium, and high resolution.

      In the first pass, I only test a total of 5 samples that are spaced out within the two sample windows. You can quickly tell if there just isn't going to be correlation if the 5 samples are all different from the shifted window's equally spaced out samples. In the second pass, there are more samples and I use the detected frequency as a starting point for the final pass, and in the last step I use more samples.

    2. Auto-correlation is inaccurate, especially at higher frequencies:
      As you have clearly seen in your results, the higher the frequency, the less accurate it becomes. This is because a higher frequency waveform has fewer total samples per cycle, and you are stepping in whole sample values, so the frequency steps become courser the higher it is.

      To solve this, for the 3rd pass in my algorithm (at which point the search is centered around the frequency that was detected in the second pass), I use interpolation in order to compare samples that are not limited to whole numbers. So sample 0 will be at position 0.0, sample 1 will be at position 0.674, or whatever. This allows me to space the sample steps so that it is exactly at the frequency I want to detect during that pass, as opposed to being quantized into ever-more courser frequency steps. I use a 4-point, 3rd-order Hermite interpolator.

      Each high resolution pass is 1.005 times the previous frequency, so I don't use linear frequency steps. Also, once I found the two passes with the highest correlation, I interpolate between those two, so the final detected frequency is even higher resolution than the 3'rd pass' step size.

    This results in a very fast and accurate pitch detection algorithm. From my tests, the accuracy is within 0.1% of the input frequency, which IIRC, is about 50 times higher than what humans can distinguish.

  • marymary

    can you please give the code for pitch detection?  i have implemented fft algoritm, and jus need to detect the pitched at a high frequency. How can i choose the amplitude threshold to get the corect peaks? If the amplitue thresh is to high i get 0 peaks, if its; too low i  get incorect peaks

  • marymary

    @BitFlipper:do u have the pitch detection code? I implement FFT and need pitch detection algorithm. I need to detect all peaks from the spectrum. All  peaks at 18000hz frequency for example. I also work with amplitude threshold..but it is;s too high it doesn;t show me the peaks, if it;s too low it dowsnt show corectly

  • @mary:

    Unfortunately right now my code isn't quite fit for public release. I think there are some dependencies on parts of other code that I don't want to post and which isn't really pitch-related. For instance I have a DSP class that the pitch-correction algorithm uses, but most of that code is unrelated to it.

    If I have some time available I will clean it up and post it. Most likely before the end of the weekend.

  • OK I had some time to clean up my code. I am working on creating a CodePlex project in order to publish it. I also first want to create some sort of sample app in order to demonstrate the code in use (even though you need just three lines of code to instantiate and get your first pitch results back). Hopefully I will be done with it by this weekend.

  • @Christian Louboutin Bottes:

    Wow is that spam or some weird C9 bug? I can't tell.

  • It's Spam that's been through an auto-tune like comment modifier to make it look like a C9 bug.

  • TaniaTania

    Hi Mark,
    "Autotune.NET" is really very different topic. I am very excited to read this blog. Very interesting and useful.After reading your blog only i came to know about this technology in dotnet. Thanks for giving your knowledge to us. Your code also works well.

  • OK, I finally had some time to isolate my pitch tracker class and create a CodePlex project. I would be interested to find out whether you can use my pitch tracker in your project, and what the results are.

    From my tests my algorithm has an error of less that 0.02% over the frequency range of 55Hz to 1.5kHz. Accuracy is unaffected by amplitude, frequency or complexity of the waveform.

    Please let me know if you end up trying it out.

  • hit BitFlipper, looks interesting. Fancy submitting a patch to the VoiceRecorder project (http://voicerecorder.codeplex.com/) that uses your algorithm?

  • EugeneEugene

    Hi Mark, thanks for the code, it is rather interesting, but I can not understand one thing - in the code
    public float DetectPitch(float[] buffer, int frames){ if (prevBuffer == null) { prevBuffer = new float[frames]; } float maxCorr = 0; int maxLag = 0; for (int lag = maxOffset; lag >= minOffset; lag--) { float corr = 0; // sum of squares for (int i = 0; i < frames; i++) { int oldIndex = i - lag; float sample = ((oldIndex < 0) ? prevBuffer[frames +oldIndex]:prevBuffer[oldIndex]);  corr += (sample * buffer[i]); } ..........................................
    Here we first initialize the array -  prevBuffer = new float[frames] - and evidentially that all its members have zero as value. 
    And  then we have "sample = prevBuffer[frames +oldIndex] or sample=prevBuffer[oldIndex]", but as the prevBuffer has only zero values so sample will always have the same zero value.
    Could you explain this thing or maybe I am wrong?

  • WondeWonde

    Hey. It is nice lib. But I have a qns. How to record a voice for longer time ? say an hour or more.

  • @Wonde, to record for that length of time I would recommend storing the saved audio in a WAV file rather than the current implementation which keeps it in memory. Use the WaveFileWriter class.

  • @Eugene - look at the end of the function - the contents of the current buffer are copied across into prevBuffer

  • @Eric:Hi, it is a very nice introduction for recording voice on .net or from microphone. I am looking for a streaming audio recorder to record voice or sound that can be passed through on my computer audio device.

  • Asad SiddiqiAsad Siddiqi

    Hi there,
    I want to know if its possible (or is there an algorithm I could use ) to convert a byte array sound data and change the pitch, frequency ... I did it with SoundEffectInstance in XNA but there is no way I could save the stream that was passed in. I would greatly appreciate any help. This is some code that i did (I want to change the pitch of bStream). Thank you very much:

    if (states == PlayerStates.Ready || states == PlayerStates.Stopped)

    byte []bStream = stream.ToArray();

    sound = new SoundEffect(bStream, microphone.SampleRate, AudioChannels.Mono);
    SoundEffectInstance soundInstance = sound.CreateInstance();
    soundInstance.Pitch += 1;
    textBlock1.Text = "Now Playing";
    states = PlayerStates.Playing;

Remove this comment

Remove this thread


Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.