Autotune.NET
- Posted: Jan 05, 2011 at 2:19 PM
- 58,256 Views
- 20 Comments
Loading User Information from Channel 9
Something went wrong getting user information from Channel 9
Loading User Information from MSDN
Something went wrong getting user information from MSDN
Loading Visual Studio Achievements
Something went wrong getting the Visual Studio Achievements
We've all cringed as a hopelessly out of tune contestant appears on the latest episode of “American Idol.” Occasionally, there's a contestant who manages to be pitch perfect all the way through—right until they flub the final note. And in the cutthroat world of televised auditions, sing one slightly flat note and you're out.
So what takes care of a bad-pitch day? Autotune—an effect that corrects the pitch of your voice so you'll never again sing out of tune. And now, with the power of modern microprocessors, autotune is possible in real-time, allowing singers to benefit from its almost magical powers during live concerts.
The company most famous for its autotune effect is Antares. Antares Auto-Tune currently retails for $249, and a stripped down version is available for $100. In addition to simply improving the pitch of a dodgy singer, autotune can be used to create unique robotic sounding vocal effects, a technique massively popular in recent years thanks to its use by artists such as T-Pain and the group behind the “Auto-Tune the News” YouTube videos. In 1998, when the effect was first used on Cher's “Believe” single, the producer used such extreme settings that instead of subtly adjusting the pitch, autotune “snapped” instantaneously to the nearest “correct” note.
Here is a nerdy example of what Autotune can do.
An autotune effect has two parts. The first is pitch detection, which calculates the dominant frequency of the incoming signal, and is the reason autotune is normally used on monophonic audio sources (i.e. playing one note at a time, not whole chords). So, if your guitar is out of tune, you're out of luck (Celemony's Melodyne product, however, features some incredible capabilities for pitch-shifting polyphonic audio).
The second stage is pitch shifting, or “correcting” a given note. However, the bigger the pitch shift required, the more artificial the end result will be, and it is worth noting that absolutely perfect pitch is not always desirable. Sometimes the blended notes resulting from vibrato, for example, are an important part of the performance, and eliminating them would be detrimental.
For this project, we will be creating an autotune effect for .NET. Serious recording enthusiasts should just go out and buy a decent autotune effect, but to have a little fun, we'll see if we can make an autotune that can give us a poor man's Cher effect (or T-Pain effect, if you prefer).
To get started, I searched to see if there were some pre-existing open source autotune implementations, which brought me to awesomebox, a project created by Ravi Parikh and Keegan Poppen while they were students at Stanford University. They kindly gave me permission to make use of their code, which uses an auto-correlator for pitch detection and an open source pitch-shifting algorithm from audio DSP expert Stephan M. Bernsee.
Although porting from C/C++ to C# is not exactly fun, there is enough similarity in the syntax that it is possible to complete without too many changes. You need to remember, however, that a long in C is an int in C# (i.e. 32 bits long not 64).
Additionally, the C# compiler is fussier than C/C++ when it comes to casting between floats, doubles, and ints. Putting the “f” suffix on numeric literals sorts out most of these compiler errors.
Pointers can be a pain. I tend to replace them with integer variables used to index into an array. You can of course use unsafe code, but that limits your options if you plan to port to Silverlight or Windows Phone 7 at a later date, neither of which allow unsafe code or interop into unmanaged code. The necessary mathematical functions are available in the System.Math class.
To see an example, compare this pitch shifting C++ source file with my C# conversion of it.
Interop wrappers for the NAudio Windows WaveIn APIs capture the audio. Here is the code used to start recording:
c#:
waveIn = new WaveIn(); waveIn.DeviceNumber = recordingDevice; waveIn.DataAvailable += waveIn_DataAvailable; waveIn.RecordingStopped += new EventHandler(waveIn_RecordingStopped); waveIn.WaveFormat = recordingFormat; waveIn.StartRecording();
VB.Net:
waveIn = New WaveIn waveIn.DeviceNumber = recordingDevice AddHandler waveIn.DataAvailable, AddressOf waveIn_DataAvailable AddHandler waveIn.RecordingStopped, AddressOf waveIn_RecordingStopped waveIn.WaveFormat = _recordingFormat waveIn.StartRecording()
The steps are:
Whenever the soundcard reports a new buffer of recorded audio, we receive it in the DataAvailable event handler:
c#:
void waveIn_DataAvailable(object sender, WaveInEventArgs e) { byte[] buffer = e.Buffer; int bytesRecorded = e.BytesRecorded; WriteToFile(buffer, bytesRecorded); for (int index = 0; index < e.BytesRecorded; index += 2) { short sample = (short)((buffer[index + 1] << 8) | buffer[index + 0]); float sample32 = sample / 32768f; sampleAggregator.Add(sample32); } }
VB.Net
Private Sub waveIn_DataAvailable(ByVal sender As Object, ByVal e As WaveInEventArgs) Dim buffer() = e.Buffer Dim bytesRecorded = e.BytesRecorded WriteToFile(buffer, bytesRecorded) For index = 0 To e.BytesRecorded - 1 Step 2 Dim sample = CShort(buffer(index + 1)) << 8 Or CShort(buffer(index + 0)) Dim sample32 = sample / 32768.0F _sampleAggregator.Add(sample32) Next index End Sub
The WaveInEventArgs contains the number of bytes recorded (e.BytesRecorded) and a pointer to the buffer containing those bytes (e.Buffer). The handler does two things with the recorded data. First, it calls WriteToFile, which uses the WaveFileWriter class from NAudio to write the data to disk:
c#:
// before we start recording, set up a WaveFileWriter... writer = new WaveFileWriter(waveFileName, recordingFormat); // ... every block we receive we write it to the WaveFileWriter: writer.WriteData(buffer, 0, bytesRecorded); // ... and when recording stops we must call Dispose to finalize the // .WAV file properly writer.Dispose()
VB.Net:
writer = New WaveFileWriter(waveFileName, _recordingFormat)
writer.WriteData(buffer, 0, bytesRecorded)
writer.Dispose()
Once recording has completed, we have a WAV file on which to perform our autotune effect. However, our WAV file consists of 16 bit samples (i.e. System.Int16 aka short). In other words, we have a sequence of byte pairs, each of which represent a number in the range -32768 to 32767. For the digital signal processing we will be performing, it's best to have a sequence of floating point numbers (System.Single or float) in the range -1.0f to 1.0f. This is a common requirement, so NAudio provides a utility class to convert audio from short to float called Wave16ToFloatProvider. Here's the code that takes a WAV file and implements the autotune algorithm on it:
c#:
public static void ApplyAutoTune(string fileToProcess, string tempFile, AutoTuneSettings autotuneSettings) { using (WaveFileReader reader = new WaveFileReader(fileToProcess)) { IWaveProvider stream32 = new Wave16toFloatProvider(reader); IWaveProvider streamEffect = new AutoTuneWaveProvider(stream32, autotuneSettings); IWaveProvider stream16 = new WaveFloatTo16Provider(streamEffect); using (WaveFileWriter converted = new WaveFileWriter(tempFile, stream16.WaveFormat)) { // buffer length needs to be a power of 2 for FFT to work nicely // however, make the buffer too long and pitches aren't detected fast enough // successful buffer sizes: 8192, 4096, 2048, 1024 // (some pitch detection algorithms need at least 2048) byte[] buffer = new byte[8192]; int bytesRead; do { bytesRead = stream16.Read(buffer, 0, buffer.Length); converted.WriteData(buffer, 0, bytesRead); } while (bytesRead != 0 && converted.Length < reader.Length); } } }
VB.Net
Public Shared Sub ApplyAutoTune(ByVal fileToProcess As String, ByVal tempFile As String, ByVal autotuneSettings As AutoTuneSettings) Using reader As New WaveFileReader(fileToProcess) Dim stream32 As IWaveProvider = New Wave16ToFloatProvider(reader) Dim streamEffect As IWaveProvider = New AutoTuneWaveProvider(stream32, autotuneSettings) Dim stream16 As IWaveProvider = New WaveFloatTo16Provider(streamEffect) Using converted As New WaveFileWriter(tempFile, stream16.WaveFormat) ' buffer length needs to be a power of 2 for FFT to work nicely ' however, make the buffer too long and pitches aren't detected fast enough ' successful buffer sizes: 8192, 4096, 2048, 1024 ' (some pitch detection algorithms need at least 2048) Dim buffer(8191) As Byte Dim bytesRead As Integer Do bytesRead = stream16.Read(buffer, 0, buffer.Length) converted.WriteData(buffer, 0, bytesRead) Loop While bytesRead <> 0 AndAlso converted.Length < reader.Length End Using End Using End Sub
Here's how it works:
As we saw in the last code snippet, the AutoTuneWaveProvider is the piece in our audio pipeline that actually performs the autotune effect. It implements the NAudio IWaveProvider interface, which allows it to be used in the pipeline for real-time playback if necessary, even though our example code is not doing this (see the section on performance later). Here's the AutoTuneWaveProvider constructor:
c#:
public AutoTuneWaveProvider(IWaveProvider source, AutoTuneSettings autoTuneSettings) { this.autoTuneSettings = autoTuneSettings; if (source.WaveFormat.SampleRate != 44100) throw new ArgumentException("AutoTune only works at 44.1kHz"); if (source.WaveFormat.Encoding != WaveFormatEncoding.IeeeFloat) throw new ArgumentException("AutoTune only works on IEEE floating point audio data"); if (source.WaveFormat.Channels != 1) throw new ArgumentException("AutoTune only works on mono input sources"); this.source = source; this.pitchDetector = new AutoCorrelator(source.WaveFormat.SampleRate); this.pitchShifter = new SmbPitchShifter(Settings); this.waveBuffer = new WaveBuffer(8192); }
VB.Net
Public Sub New(ByVal source As IWaveProvider, ByVal autoTuneSettings As AutoTuneSettings)
Me.autoTuneSettings = autoTuneSettings
If source.WaveFormat.SampleRate <> 44100 Then
Throw New ArgumentException("AutoTune only works at 44.1kHz")
End If
If source.WaveFormat.Encoding <> WaveFormatEncoding.IeeeFloat Then
Throw New ArgumentException("AutoTune only works on IEEE floating point audio data")
End If
If source.WaveFormat.Channels <> 1 Then
Throw New ArgumentException("AutoTune only works on mono input sources")
End If
Me.source = source
Me.pitchDetector = New AutoCorrelator(source.WaveFormat.SampleRate)
' alternative pitch detector:
' Me.pitchDetector = New FftPitchDetector(source.WaveFormat.SampleRate)
Me.pitchShifter = New SmbPitchShifter(Settings, source.WaveFormat.SampleRate)
Me.waveBuffer = New WaveBuffer(8192)
End Sub
Some points to notice:
The key method on any implementation of IWaveProvider is its Read method. This is where the audio consumer, usually the sound card or a WaveFileWriter, asks for data. The data must be supplied as a byte array, and if at all possible you should return exactly the number of bytes you were asked for (if you can't, an extra layer of buffering is usually required, or audio playback will be choppy). Here's our implementation of the Read method:
c#:
public int Read(byte[] buffer, int offset, int count) { if (waveBuffer == null || waveBuffer.MaxSize < count) { waveBuffer = new WaveBuffer(count); } int bytesRead = source.Read(waveBuffer, 0, count); // the last bit sometimes needs to be rounded up: if (bytesRead > 0) bytesRead = count; int frames = bytesRead / sizeof(float); float pitch = pitchDetector.DetectPitch(waveBuffer.FloatBuffer, frames); // an attempt to make it less "warbly" by holding onto the pitch // for at least one more buffer if (pitch == 0f && release < maxHold) { pitch = previousPitch; release++; } else { this.previousPitch = pitch; release = 0; } WaveBuffer outBuffer = new WaveBuffer(buffer); pitchShifter.ShiftPitch(waveBuffer.FloatBuffer, pitch, 0.0f, outBuffer.FloatBuffer, frames); return frames * 4; }
VB.Net:
Public Function Read(ByVal buffer() As Byte, ByVal offset As Integer,
ByVal count As Integer) As Integer Implements NAudio.Wave.IWaveProvider.Read
If waveBuffer Is Nothing OrElse waveBuffer.MaxSize < count Then
waveBuffer = New WaveBuffer(count)
End If
Dim bytesRead = source.Read(waveBuffer, 0, count)
'Debug.Assert(bytesRead = count)
' the last bit sometimes needs to be rounded up:
If bytesRead > 0 Then
bytesRead = count
End If
'pitchsource->getPitches()
Dim frames = bytesRead \ Len(New Single) ' MRH: was count
Dim pitch = pitchDetector.DetectPitch(waveBuffer.FloatBuffer, frames)
' MRH: an attempt to make it less "warbly" by holding onto the pitch for at least one more buffer
If pitch = 0.0F AndAlso release < maxHold Then
pitch = previousPitch
release += 1
Else
Me.previousPitch = pitch
release = 0
End If
Dim midiNoteNumber = 40
Dim targetPitch = CSng(8.175 * Math.Pow(1.05946309, midiNoteNumber))
Dim outBuffer As New WaveBuffer(buffer)
pitchShifter.ShiftPitch(waveBuffer.FloatBuffer, pitch, targetPitch, outBuffer.FloatBuffer, frames)
Return frames * 4
End Function
Here's what's going on
Now that we've seen the big picture of the AutotuneWaveProvider, let's drill down into its two main components—the pitch detector and pitch shifter.
The pitch detection part of autotune is vital to getting good results. If it can't accurately detect the input pitch, it will incorrectly calculate how much the pitch needs to be adjusted. However, high quality pitch detection is quite difficult to get right. First of all, the microphone may well pick up background noise. Second, when you sing a into a microphone, the signal consists not only of a single frequency, but also “harmonics” at different frequencies.
The good news is that we need to detect only the primary pitch.
The awesomebox algorithm makes use of “autocorrelation” for its pitch detection, but I made a few small tweaks to how the algorithm is implemented in an attempt to improve its accuracy. Autocorrelation has the advantage of being a relatively quick process. The basic principle is that if a signal is periodic, it will “correlate” well with itself when shifted forward (or backwards) one cycle.
Let's say we are looking to see if the note “Middle C” is being sung. The frequency of Middle C is around 262Hz. If we are sampling at 44.1kHz (which is standard for CD quality audio), then we will expect the signal to repeat at approximately every 168 samples (44100/262). Accordingly, for every sample in the buffer, we calculate the sum of squares of that sample and the sample 168 samples previous. We do this for every possible offset that measures a frequency in the range we want to detect (I am using 85Hz to 300Hz, which is adequate for pitch detecting vocals). The offset with the highest score is the most likely frequency.
Let's have a look at the code for an autocorrelation algorithm, starting with the constructor for the AutoCorrelator class:
c#:
public AutoCorrelator(int sampleRate) { this.sampleRate = (float)sampleRate; int minFreq = 85; int maxFreq = 255; this.maxOffset = sampleRate / minFreq; this.minOffset = sampleRate / maxFreq; }
VB.Net
Public Sub New(ByVal sampleRate As Integer) Me.sampleRate = CSng(sampleRate) Dim minFreq = 85 Dim maxFreq = 255 Me.maxOffset = sampleRate \ minFreq Me.minOffset = sampleRate \ maxFreq End Sub
First of all, we pre-calculate some values based on the minimum and maximum frequencies we are looking for. Remember that lower frequencies are harder to detect than higher frequencies, so don't set minFreq too low. MaxOffset and MinOffset are the maximum and minimum backwards distances we will be seeking while looking for a match.
c#:
public float DetectPitch(float[] buffer, int frames) { if (prevBuffer == null) { prevBuffer = new float[frames]; } float maxCorr = 0; int maxLag = 0; // starting with low frequencies, working to higher for (int lag = maxOffset; lag >= minOffset; lag--) { float corr = 0; // sum of squares for (int i = 0; i < frames; i++) { int oldIndex = i - lag; float sample = ((oldIndex < 0) ? prevBuffer[frames + corr += (sample * buffer[i]); } if (corr > maxCorr) { maxCorr = corr; maxLag = lag; } } for (int n = 0; n < frames; n++) { prevBuffer[n] = buffer[n]; } float noiseThreshold = frames / 1000f; if (maxCorr < noiseThreshold || maxLag == 0) return 0.0f; return this.sampleRate / maxLag; }
VB.Net
Public Function DetectPitch(ByVal buffer() As Single, ByVal frames As Integer) As Single Implements IPitchDetector.DetectPitch If prevBuffer Is Nothing Then prevBuffer = New Single(frames - 1){} End If Dim secCor As Single = 0 Dim secLag = 0 Dim maxCorr As Single = 0 Dim maxLag = 0 ' starting with low frequencies, working to higher For lag = maxOffset To minOffset Step -1 Dim corr As Single = 0 ' this is calculated as the sum of squares For i = 0 To frames - 1 Dim oldIndex = i - lag Dim sample = (If(oldIndex < 0, prevBuffer(frames + oldIndex), buffer(oldIndex))) corr += (sample * buffer(i)) Next i If corr > maxCorr Then maxCorr = corr maxLag = lag End If If corr >= 0.9 * maxCorr Then secCor = corr secLag = lag End If Next lag For n = 0 To frames - 1 prevBuffer(n) = buffer(n) Next n Dim noiseThreshold = frames / 1000.0F 'Debug.WriteLine(String.Format("Max Corr: {0} ({1}), Sec Corr: {2} ({3})", Me.sampleRate / maxLag, maxCorr, Me.sampleRate / secLag, secCor)) If maxCorr < noiseThreshold OrElse maxLag = 0 Then Return 0.0F End If 'Return 44100.0f / secLag '--works better for singing Return Me.sampleRate / maxLag End Function
A few things to notice:
I wrote some unit tests to measure the accuracy of detection with sine waves (which admittedly are the easiest to detect). Here are the results for audio sampled at 44.1kHz:
|
Test Frequency |
Detected Pitch |
|
109.99Hz |
108.35Hz |
|
116.53Hz |
118.23Hz |
|
123.46Hz |
123.18Hz |
|
130.80Hz |
129.71Hz |
|
138.58Hz |
140.00Hz |
|
146.82Hz |
148.48Hz |
|
155.55Hz |
154.74Hz |
|
164.80Hz |
163.33Hz |
|
174.60Hz |
172.94Hz |
|
184.98Hz |
183.75Hz |
|
195.98Hz |
194.27Hz |
|
207.63Hz |
206.07Hz |
|
219.98Hz |
219.40Hz |
|
233.06Hz |
234.57Hz |
|
246.92Hz |
247.75Hz |
|
261.60Hz |
256.40Hz |
|
277.16Hz |
139.56Hz |
|
293.64Hz |
146.03Hz |
Notice that the detected frequencies from the final two tests are actually half the correct amount. This doesn't actually matter for our purposes, since this just means the frequency has been detected as one octave below the correct note.
To improve on the accuracy of the autocorrelator's results, there are a couple of things you can do:
I decided to implement an alternative pitch detection algorithm to see if I could get better results. A different approach is to use the Fast Fourier Transform, which converts signals from the “time domain” into the “frequency domain.”
The basic approach is to take a block of samples (which must be a power of 2 – e.g. 1024), and run the FFT on them. The FFT takes complex numbers as inputs, which for audio signals are entirely real. The implementation I am using expects real and complex parts interleaved for the input buffer. Here's our code setting up fftBuffer with interleaved samples:
c#:
private float[] fftBuffer; private float[] prevBuffer; public float DetectPitch(float[] buffer, int inFrames) { Func<int, int, float> window = HammingWindow; if (prevBuffer == null) { prevBuffer = new float[inFrames]; } // double frames since we are combining present and previous buffers int frames = inFrames * 2; if (fftBuffer == null) { fftBuffer = new float[frames * 2]; // times 2 because it is complex input } for (int n = 0; n < frames; n++) { if (n < inFrames) { fftBuffer[n * 2] = prevBuffer[n] * window(n, frames); fftBuffer[n * 2 + 1] = 0; // need to clear out as fft modifies buffer } else { fftBuffer[n * 2] = buffer[n-inFrames] * window(n, frames); fftBuffer[n * 2 + 1] = 0; // need to clear out as fft modifies buffer } }
VB.Net
Private fftBuffer() As Single Private prevBuffer() As Single Public Function DetectPitch(ByVal buffer() As Single, ByVal inFrames As Integer) As Single Implements IPitchDetector.DetectPitch Dim window As Func(Of Integer, Integer, Single) = AddressOf HammingWindow If prevBuffer Is Nothing Then prevBuffer = New Single(inFrames - 1) {} End If ' double frames since we are combining present and previous buffers Dim frames = inFrames * 2 If fftBuffer Is Nothing Then fftBuffer = New Single(frames * 2 - 1) {} ' times 2 because it is complex input End If For n = 0 To frames - 1 If n < inFrames Then fftBuffer(n * 2) = prevBuffer(n) * window(n, frames) fftBuffer(n * 2 + 1) = 0 ' need to clear out as fft modifies buffer Else fftBuffer(n * 2) = buffer(n - inFrames) * window(n, frames) fftBuffer(n * 2 + 1) = 0 ' need to clear out as fft modifies buffer End If Next n
Notice that we prepend the previous buffer we were passed. This is a common way of increasing the accuracy and resolution of an FFT by using overlapping windows, and can be further extended to store three previous buffers, allowing us to have 75% overlapping windows instead of just the 50% that we have in this example.
For better peak frequency detection, the signal that is passed into the FFT is best pre-processed with a “windowing” function. There are several to choose from, each with its own strengths and weaknesses. I used the Hamming window, which is a fairly common choice:
c#:
private float HammingWindow(int n, int N) { return 0.54f - 0.46f * (float)Math.Cos((2 * Math.PI * n) / (N - 1)); }
VB.Net
Private Function HammingWindow(ByVal n As Integer, ByVal _N As Integer) As Single Return 0.54F - 0.46F * CSng(Math.Cos((2 * Math.PI * n) / (_N - 1))) End Function
The next step is to pass on our interleaved buffer to the FFT algorithm. I am using Stephan Bernsee's here, though there is an alternative implementation in NAudio that I could have used. Since the same function can be used for an inverse FFT, the -1 parameter means (rather counter-intuitively), do a forwards FFT. It processes the data in place, which is fine since we don't need to keep the contents of the input buffer:
c#:
// assuming frames is a power of 2
SmbPitchShift.smbFft(fftBuffer, frames, -1);
VB.Net
' assuming frames is a power of 2
SmbPitchShift.smbFft(fftBuffer, frames, -1)
Once we have completed the FFT, we are ready to interpret its output. The output of the FFT consists of complex numbers (again real followed by imaginary in our buffer), which represent frequency “bins.”
We start off by calculating the bin size and working out which bins correspond to the range of frequencies we are interested in detecting:
c#:
float binSize = sampleRate / frames; int minBin = (int)(85 / binSize); int maxBin = (int)(300 / binSize);
VB.Net
Dim binSize = sampleRate / frames Dim minBin = CInt(Fix(85 / binSize)) Dim maxBin = CInt(Fix(300 / binSize))
For example, if our sample rate is 44.1kHz and we analyse a block of 1024 samples, then each bin represents 43Hz, which is hardly the granularity we are looking for. To increase resolution, our options are to either sample at a higher rate or analyse a bigger chunk. Our approach is to use overlapping blocks of 8192 samples, as we read 4096 samples each time. This means we have a resolution of around 5Hz, which is much more acceptable.
Now we can calculate the magnitude or “intensity” for each frequency by calculating the sum of squares (strictly we should then take the square root, but we don't need to since we are just looking for the largest value):
c#:
float maxIntensity = 0f; int maxBinIndex = 0; for (int bin = minBin; bin <= maxBin; bin++) { float real = fftBuffer[bin * 2]; float imaginary = fftBuffer[bin * 2 + 1]; float intensity = real * real + imaginary * imaginary; if (intensity > maxIntensity) { maxIntensity = intensity; maxBinIndex = bin; } }
VB.Net
Dim maxIntensity = 0.0F Dim maxBinIndex = 0 For bin = minBin To maxBin Dim real = fftBuffer(bin * 2) Dim imaginary = fftBuffer(bin * 2 + 1) Dim intensity = real * real + imaginary * imaginary If intensity > maxIntensity Then maxIntensity = intensity maxBinIndex = bin End If Next bin
Since we have identified the bin with the maximum intensity, we can calculate the detected frequency:
c#:
return binSize * maxBinIndex;
VB.Net
Return binSize * maxBinIndex
I don't currently specify a minimum threshold for maxIntensity, but perhaps if it were very low, the FFT pitch detector would return zero to indicate no pitch detected instead of returning an answer that is probably not accurate.
Let's have a look at how the FFT pitch detector does:
|
Test Frequency |
Detected Pitch |
|
109.99Hz |
107.67Hz |
|
116.53Hz |
118.43Hz |
|
123.46Hz |
123.82Hz |
|
130.80Hz |
129.20Hz |
|
138.58Hz |
139.97Hz |
|
146.82Hz |
145.35Hz |
|
155.55Hz |
156.12Hz |
|
164.80Hz |
166.88Hz |
|
174.60Hz |
172.27Hz |
|
184.98Hz |
183.03Hz |
|
195.98Hz |
193.80Hz |
|
207.63Hz |
209.95Hz |
|
219.98Hz |
220.72Hz |
|
233.06Hz |
231.48Hz |
|
246.92Hz |
247.63Hz |
|
261.60Hz |
263.78Hz |
|
277.16Hz |
274.55Hz |
|
293.64Hz |
296.08Hz |
As can be seen, it correctly picks out the primary frequencies of the higher notes, but overall it doesn't get that much closer than the autocorrelator, so I've left that as the default algorithm. You, however, can swap in the FFT detector in the code if it works better for the material you are auto-tuning.
There are ways of using the phase information from the FFT output to increase the accuracy of pitch detection even further, but I have left that as an exercise for the reader!
The next step is to determine how much we will shift the pitch. The simplest way to do this is to look for the musical pitch that is closest to the detected pitch. Then, the amount of the shift by is simply the ratio of those two notes.
There are, however, some additional considerations. First, we may want to select a subset of musical notes that are acceptable. For example, only notes in the key of C#, or maybe F# minor pentatonic. This may require a slightly more radical adjustment.
Second, depending on the effect we are after, we may not want to instantaneously jump to the new frequency. The code I am using utilizes a fairly rudimentary “attack” time parameter, allowing you to gradually move to the new frequency.
The actual DSP for the pitch-shifting effect is more or less untouched from Stephan Bernsee's code, and this is because it works really well. The Bernsee's code makes use of the Fast Fourier Transform, plus a bunch of clever mathematics, which I almost understand, but not quite well enough to try and explain here! You're better off reading an article in which the man himself explains how it works.
The class that manages the pitch-shifting algorithm is called SmbPitchShifer and inherits from a PitchShifter base class. This does the bulk of its work in the ShiftPitch function:
c#:
public void ShiftPitch(float[] inputBuff, float inputPitch, float targetPitch, float[] outputBuff, int nFrames) { UpdateSettings(); detectedPitch = inputPitch;
VB.Net
Public Sub ShiftPitch(ByVal inputBuff() As Single, ByVal inputPitch As Single, ByVal targetPitch As Single, ByVal outputBuff() As Single, ByVal nFrames As Integer) UpdateSettings() detectedPitch = inputPitch
The inputPitch parameter is set to the frequency detected by the PitchDetector. The targetPitch parameter is currently unused, but will be used to specify the target pitch in real-time when accepting input from, say, a MIDI keyboard. In any case, we call UpdateSettings in order to see if any of the autotune algorithm settings have changed since last time.
Next we calculate the amount we need to shift the pitch shift. A shift factor of 1 means no change. We don't allow the shift factor to go above 2 or below 0.5, since these figures represent a whole octave change:
c#:
float shiftFactor = 1.0f; if (inputPitch > 0) { shiftFactor = snapFactor(inputPitch); } if (shiftFactor > 2.0) shiftFactor = 2.0f; if (shiftFactor < 0.5) shiftFactor = 0.5f;
VB.Net
Dim shiftFactor = 1.0F If inputPitch > 0 Then shiftFactor = snapFactor(inputPitch) shiftFactor += addVibrato(nFrames) End If If shiftFactor > 2.0 Then shiftFactor = 2.0F End If If shiftFactor < 0.5 Then shiftFactor = 0.5F End If
The decision of what the target note is takes place in the snapFactor function:
c#:
protected float snapFactor(float freq) { float previousFrequency = 0.0f; float correctedFrequency = 0.0f; int previousNote = 0; int correctedNote = 0; for (int i = 1; i < 120; i++) { bool endLoop = false; foreach (int note in this.settings.AutoPitches) { if (i % 12 == note) { previousFrequency = correctedFrequency; previousNote = correctedNote; correctedFrequency = (float)(8.175 * Math.Pow(1.05946309, (float)i)); correctedNote = i; if (correctedFrequency > freq) { endLoop = true; } break; } } if (endLoop) { break; } } if (correctedFrequency == 0.0) { return 1.0f; } int destinationNote = 0; double destinationFrequency = 0.0; // decide whether we are shifting up or down if (correctedFrequency - freq > freq - previousFrequency) { destinationNote = previousNote; destinationFrequency = previousFrequency; } else { destinationNote = correctedNote; destinationFrequency = correctedFrequency; } if (destinationNote != currPitch) { numElapsed = 0; currPitch = destinationNote; } if (attack > numElapsed) { double n = (destinationFrequency - freq) / attack * numElapsed; destinationFrequency = freq + n; } numElapsed++; return (float)(destinationFrequency / freq); }
VB.Net:
Protected Function snapFactor(ByVal freq As Single) As Single Dim previousFrequency = 0.0F Dim correctedFrequency = 0.0F Dim previousNote = 0 Dim correctedNote = 0 For i = 1 To 119 Dim endLoop = False For Each note As Integer In Me.settings.AutoPitches If i Mod 12 = note Then previousFrequency = correctedFrequency previousNote = correctedNote correctedFrequency = CSng(8.175 * Math.Pow(1.05946309, CSng(i))) correctedNote = i If correctedFrequency > freq Then endLoop = True End If Exit For End If Next note If endLoop Then Exit For End If Next i If correctedFrequency = 0.0 Then Return 1.0f End If Dim destinationNote = 0 Dim destinationFrequency = 0.0 ' decide whether we are shifting up or down If correctedFrequency - freq > freq - previousFrequency Then destinationNote = previousNote destinationFrequency = previousFrequency Else destinationNote = correctedNote destinationFrequency = correctedFrequency End If If destinationNote <> currPitch Then numElapsed = 0 currPitch = destinationNote End If If attack > numElapsed Then Dim n = (destinationFrequency - freq) / attack * numElapsed destinationFrequency = freq + n End If numElapsed += 1 Return CSng(destinationFrequency / freq) End Function
The way this function works is that it runs through the MIDI notes 0-120 and, if that note is selected as one of the valid pitches we support, we remember the “corrected frequency,” which can be calculated from the MIDI note number with the following formula:
c#:
correctedFrequency = (float)(8.175 * Math.Pow(1.05946309, (float)midiNoteNumber));
VB.Net
correctedFrequency = CSng(8.175 * Math.Pow(1.05946309, CSng(i)))
Obviously, a pitch is likely to fall somewhere in between two valid notes, so we choose the which pitch to correct by determining which one is closest to the detected frequency.
The snapFactor function is also responsible for implementing the attack time parameter. This allows the destinationFrequency to be slowly moved to the target note over the duration of the attack period. Having calculated our shift factor, we are now ready to pass our data on to the actual pitch-shifting algorithm:
c#:
int fftFrameSize = 2048; int osamp = 8; // 32 is best quality SmbPitchShift.smbPitchShift(shiftFactor, nFrames, fftFrameSize, osamp, this.sampleRate, inputBuff, outputBuff);
VB.Net
Dim fftFrameSize = 2048 Dim osamp = 8 ' 32 is best quality SmbPitchShift.smbPitchShift(shiftFactor, nFrames, fftFrameSize, osamp, Me.sampleRate, inputBuff, outputBuff)
The final thing we do in the ShiftPitch function is keep a record of the pitch shifts we have made. These are stored in a queue (maximum of 5000 entries) and are very useful for diagnosing what is going on if you are not getting the results you wanted from the algorithm:
c#:
shiftedPitch = inputPitch * shiftFactor;
updateShifts(detectedPitch, shiftedPitch, this.currPitch);
VB.Net
shiftedPitch = inputPitch * shiftFactor
updateShifts(detectedPitch, shiftedPitch, Me.currPitch)
Performance, as you might expect in a managed application that has not been extensively optimized, was not good. Using my laptop, I could autotune one minute of audio in about 90 seconds. Obviously, that rules out real-time autotuning. I decided to profile the application to see if there were any quick ways I could improve things.
The profiling tools in Visual Studio revealed that 20% of the time was spent on pitch detection and 80% pitch shifting. Unfortunately, there were not too many options available for optimisation, since further investigation pointed to calls to Math.Sin taking the bulk of the time. Possibly creating lookup tables could save a bit more time.
Fortunately, we have another option for speeding things up. The pitch-shifting algorithm takes an “oversampling” parameter, which by default is set to 32, the highest value. However, we can trade off speed for quality. Setting it to 16 meant that I could autotune a minute of audio in 55ms (on my 2.4GHz Core2Duo laptop) – realtime but only just. Setting it to 8 reduced that down to 36s. The results still sounded reasonable, so I have left it set at 8 in the code.
An alternative way of speeding it up though would be to swap in a different pitch-shifting algorithm. You could start by trying out one I created as part of the Skype Voice Changer project previously featured on Coding4Fun, which is also able to operate in real-time (although I haven't done any quality comparisons).
Rather than starting from scratch, I decided to build upon .NET Voice Recorder, a WPF application I created for a previous Coding4Fun article. This takes advantage of the NAudio .NET audio library for audio recording and playback capabilities. The GUI has three screens. On the first is the input device used for recording. The second records a short voice clip. And the third allows you to edit a small portion of saved audio.
Here's a screenshot of the second screen showing a recording in progress:
And here's the screen that allows you to trim the recording, preview it, and save it as WAV:
As you can see, I have added a new button allowing access to the autotune effect settings. On this screen, you can select which notes are valid, and you can also adjust the “attack time” if you prefer to not go for the robotic effect. I've included a drop-down menu that automatically selects the appropriate notes from various keys.
When you click “Apply,” the autotune effect is applied (while you wait on a background thread) and then you are returned to the screen, allowing you to play back your recording and see how it sounds. If you'd like, you can then go back and change the autotune settings (or turn it off).
The original VoiceRecorder application used a MVVM (model-view-viewmodel) architecture for binding data to each view. I have updated it to make use of Laurent Bugnion's excellent MVVM Light library. This removes the need for my own RelayCommand and ViewModelBase classes, and also enables me to replace my ViewManager with a more extensible framework using the event aggregator (“Messenger”) that is included with MVVM light. This allows me to quickly navigate from one view to another by sending out a message on the event aggregator:
c#:
private void NavigateToSaveView() { Messenger.Default.Send(new NavigateMessage(SaveViewModel.ViewName, this.voiceRecorderState)); }
VB.Net
Private Sub NavigateToSaveView() Messenger.Default.Send(New NavigateMessage(SaveViewModel.ViewName, Me.voiceRecorderState)) End Sub
Unfortunately, Autotune is an effect that doesn't always produce the desired result . Obviously, if you want great autotune, you're best off buying a commercial implementation, but here are a few tips for getting the most out of an autotune algorithm:
.NET Voice Recorder is open source and hosted on CodePlex in a Mercurial repository. So what you waiting for? Make a fork and have a go at improving it:
Mark Heath is the author of several open source .NET applications and libraries, including NAudio and the Skype Voice Changer. He works for NICE Systems, developing applications that search, display, and play back vast amounts of multimedia data. He has a blog, Sound Code, and you can follow him on his sporadically updated Twitter account.
Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation,
please create a new thread in our Forums,
or
Contact Us and let us know.
Follow the Discussion
Oops, something didn't work.
What does this mean?
Following an item on Channel 9 allows you to watch for new content and comments that you are interested in. You need to be signed in to Channel 9 to use this feature.What does this mean?
Following an item on Channel 9 allows you to watch for new content and comments that you are interested in and view them all on your notifications page.sign up for email notifications?
That's cool!
Great work Mark. Now I just have to find a use for it.
Nice work Mark.
Whoa awesome! Great to see Awesomebox being used in the wild
I have also implemented a pitch detection algorithm which is used to display a realtime pitch graph in a VST plugin (although VST uses an unmanaged API, my plugin is written in C# and uses reverse P/Invoke). This plugin is used as a visual guide to train a singer to sing in key (or for vocal exercises), and can also display a "grade" based on previously entered notes (or via a MIDI clip) that has to be hit throughout the song. My algorithm is loosely based on auto-correlation, but it is heavily modified to solve the following two problems with it:
Since you need to do a compare of all samples with all of the shifted samples, and repeat that once for every single frequency you want to detect, this quickly becomes slow.
The way I solved this is by first down-sampling to reduce the overall number of samples to work with (with a filter to prevent anti-alias noise - also conveniently removes frequencies I don't care about), and then to have 3 passes, one each with low, medium, and high resolution.
In the first pass, I only test a total of 5 samples that are spaced out within the two sample windows. You can quickly tell if there just isn't going to be correlation if the 5 samples are all different from the shifted window's equally spaced out samples. In the second pass, there are more samples and I use the detected frequency as a starting point for the final pass, and in the last step I use more samples.
As you have clearly seen in your results, the higher the frequency, the less accurate it becomes. This is because a higher frequency waveform has fewer total samples per cycle, and you are stepping in whole sample values, so the frequency steps become courser the higher it is.
To solve this, for the 3rd pass in my algorithm (at which point the search is centered around the frequency that was detected in the second pass), I use interpolation in order to compare samples that are not limited to whole numbers. So sample 0 will be at position 0.0, sample 1 will be at position 0.674, or whatever. This allows me to space the sample steps so that it is exactly at the frequency I want to detect during that pass, as opposed to being quantized into ever-more courser frequency steps. I use a 4-point, 3rd-order Hermite interpolator.
Each high resolution pass is 1.005 times the previous frequency, so I don't use linear frequency steps. Also, once I found the two passes with the highest correlation, I interpolate between those two, so the final detected frequency is even higher resolution than the 3'rd pass' step size.
This results in a very fast and accurate pitch detection algorithm. From my tests, the accuracy is within 0.1% of the input frequency, which IIRC, is about 50 times higher than what humans can distinguish.
can you please give the code for pitch detection? i have implemented fft algoritm, and jus need to detect the pitched at a high frequency. How can i choose the amplitude threshold to get the corect peaks? If the amplitue thresh is to high i get 0 peaks, if its; too low i get incorect peaks
@BitFlipper:do u have the pitch detection code? I implement FFT and need pitch detection algorithm. I need to detect all peaks from the spectrum. All peaks at 18000hz frequency for example. I also work with amplitude threshold..but it is;s too high it doesn;t show me the peaks, if it;s too low it dowsnt show corectly
@mary:
Unfortunately right now my code isn't quite fit for public release. I think there are some dependencies on parts of other code that I don't want to post and which isn't really pitch-related. For instance I have a DSP class that the pitch-correction algorithm uses, but most of that code is unrelated to it.
If I have some time available I will clean it up and post it. Most likely before the end of the weekend.
OK I had some time to clean up my code. I am working on creating a CodePlex project in order to publish it. I also first want to create some sort of sample app in order to demonstrate the code in use (even though you need just three lines of code to instantiate and get your first pitch results back). Hopefully I will be done with it by this weekend.
@Christian Louboutin Bottes:
Wow is that spam or some weird C9 bug? I can't tell.
It's Spam that's been through an auto-tune like comment modifier to make it look like a C9 bug.
Hi Mark,
"Autotune.NET" is really very different topic. I am very excited to read this blog. Very interesting and useful.After reading your blog only i came to know about this technology in dotnet. Thanks for giving your knowledge to us. Your code also works well.
http://gloriatech.com/microsoft-net-development-services.aspx
OK, I finally had some time to isolate my pitch tracker class and create a CodePlex project. I would be interested to find out whether you can use my pitch tracker in your project, and what the results are.
From my tests my algorithm has an error of less that 0.02% over the frequency range of 55Hz to 1.5kHz. Accuracy is unaffected by amplitude, frequency or complexity of the waveform.
Please let me know if you end up trying it out.
hit BitFlipper, looks interesting. Fancy submitting a patch to the VoiceRecorder project (http://voicerecorder.codeplex.com/) that uses your algorithm?
Hi Mark, thanks for the code, it is rather interesting, but I can not understand one thing - in the code
public float DetectPitch(float[] buffer, int frames){ if (prevBuffer == null) { prevBuffer = new float[frames]; } float maxCorr = 0; int maxLag = 0; for (int lag = maxOffset; lag >= minOffset; lag--) { float corr = 0; // sum of squares for (int i = 0; i < frames; i++) { int oldIndex = i - lag; float sample = ((oldIndex < 0) ? prevBuffer[frames +oldIndex]:prevBuffer[oldIndex]); corr += (sample * buffer[i]); } ..........................................
Here we first initialize the array - prevBuffer = new float[frames] - and evidentially that all its members have zero as value.
And then we have "sample = prevBuffer[frames +oldIndex] or sample=prevBuffer[oldIndex]", but as the prevBuffer has only zero values so sample will always have the same zero value.
Could you explain this thing or maybe I am wrong?
Hey. It is nice lib. But I have a qns. How to record a voice for longer time ? say an hour or more.
@Wonde, to record for that length of time I would recommend storing the saved audio in a WAV file rather than the current implementation which keeps it in memory. Use the WaveFileWriter class.
@Eugene - look at the end of the function - the contents of the current buffer are copied across into prevBuffer
@Eric:Hi, it is a very nice introduction for recording voice on .net or from microphone. I am looking for a streaming audio recorder to record voice or sound that can be passed through on my computer audio device.
Hi there,
I want to know if its possible (or is there an algorithm I could use ) to convert a byte array sound data and change the pitch, frequency ... I did it with SoundEffectInstance in XNA but there is no way I could save the stream that was passed in. I would greatly appreciate any help. This is some code that i did (I want to change the pitch of bStream). Thank you very much:
if (states == PlayerStates.Ready || states == PlayerStates.Stopped)
{
InitTimer();
byte []bStream = stream.ToArray();
sound = new SoundEffect(bStream, microphone.SampleRate, AudioChannels.Mono);
SoundEffectInstance soundInstance = sound.CreateInstance();
soundInstance.Pitch += 1;
soundInstance.Play();
textBlock1.Text = "Now Playing";
states = PlayerStates.Playing;
}
Remove this comment
Remove this thread
close