Noise Cancellation Using Fourier Transforms and Spectral Subtraction
During my sophomore year at Hopkins, I took a (somewhat difficult) class called Signals & Systems for ECE majors. Most of the class consisted paper-and-pencil problems that dealt with convolutions and Fourier transforms. However, at the end of the semester, we were assigned an interesting noise cancellation project to do in MATLAB.
We had to design a system that takes as input a noisy speech recording and output a denoised, better-quality signal.
Basically, we were given audio files like the one below and we were told to fix them in MATLAB.
Noisy Audio File
Note: This is a very different problem when compared to other noise cancellation applications like active noise-cancelling headphones. Active noise-cancelling headphones use a microphone (sometimes multiple) that picks up ambient noise outside the headphones. Then, the headphone's electronics multiply the magnitude of the ambient noise signal by -1 to create a signal that is 180 degrees out of phase with the ambient noise. Using the principle of wave superposition, by playing this 180 degree out of phase signal out of the speakers, the ambient noise will be cancelled out.
The problem at hand is not the same problem. If it was, this would be a very easy Physics 1 problem and we wouldn't need Fourier transforms and the code would be like four lines long. Instead of us getting two audio signals - one that is purely noise (like the microphone in the headphones would give us) and one that is noise and speech/music - we were just given a single audio file that has both the noise and the speech so we would need a way to differentiate speech and noise to cancel out the noise frequencies.
First, we take the input signal x(t) and segment it over some timeframe Tw. Then we need some method of determining if the signal is just noise or noise and speech. If the segment is just noise, call it nw(t) and compute its magnitude spectrum Nw(w) of its Fourier transform. Using Nw(w), we need to update the |Nw(w)| estimate so that the quality of the denoising gets better as more audio is denoised. If the segment is speech and noise, call it sw(t) and then compute both the magnitude and phase of its Fourier transform. Then, subtract the noise magnitude spectrum from the speech magnitude spectrum to estimate the magnitude spectrum of the denoised signal (we will later come up with a better system). Combine the phase spectrum <Sw(w) with the magnitude spectrum |Yw(w)| and, using this, compute the inverse Fourier transform to get a denoised audio signal in the time domain, yw(t).
However, instead of simply subtracting the noise magnitude spectrum from the speech magnitude spectrum, we can add weights to the subtraction (alpha and beta) to make the denoised audio sound better. By optimizing alpha and beta, we can get rid of a whistling noise (known as musical noise) that would occur if we were to simply subtract noise from speech (ie. if alpha was 1 and beta was 0).
To simplify this project, we were allowed to assume that the audio files we were given were just noise for the first 1 second of each recording. If we weren't allowed to make this assumption, we would have to write a program that did a first pass over the audio clip to find a pause in speech and assume that the audio signal during the pause is just noise. Then the program would denoise the audio signal on the second pass.
Reading the audio file:
When calling the denoiser function, the user inputs the filename and a chosen period, Tw, alpha, and beta to segment the audio into. The program then reads in the audio file using audioread and stores the data into a variable called X and the sample rate into another variable called Fs. After that, I initialized some other variables that will be used later on.
Note: In the original assignment for class, there were three parts of it which is why the function takes in a parameter called partNumber. Ignore that variable for now.
Recording the noise:
The program then enters a loop where it analyzes each segment of audio signal at a time. We were given that the first second of each audio clip must be noise. I decided that the best and simplest way to detect noise segments would be to look at the energy of each segment, and classify all signals below a certain energy as noise. Energy of a discrete signal is given by the simple formula below:
I decided to populate an array of the energies (called Ea) of the segments that are within the first second of the audio clip. I also performed a fast fourier transform on the first second of audio and averaged the magnitude spectrums to create an estimate of the noise signal. From here, I needed to determine a threshold energy level to decide if the segments the program analyzes later on are speech or noise segments. While I could just take an average or max value of the noise energies and use that, I think that those options may distort the signal by classifying too much signal as noise. Rather, I decided to create a percentile variable that could be altered and fine-tuned so that I can choose the energy threshold at a specific percentile of the noise energies. I also calculate the average power of the noise to be used for the signal-to-noise (SNR) ratio later on.
Denoising the audio file in the Fourier domain:
After that, the program iterates through the remaining segments of audio (audio after one second). It checks if that segment’s energy is above or below the threshold energy of noise. If the segment’s energy is less than the threshold, the program assumes that the signal is noise and then its magnitude spectrum is added to the array that holds the magnitude spectra of all noise segments.
If the segment’s energy is greater than the threshold, the program assumes that the signal is speech and it performs a Fourier transform on that segment and perform a spectral subtraction where it subtracts the magnitude spectrum of the average noise signal from the magnitude spectrum of the speech signal. The results of the spectral subtraction is the magnitudes of the Fourier series coefficients of the denoised signal. It also finds the phase of the original speech segment from its Fourier coefficients.
Going back into the time domain:
After this, I set up the denoised Fourier series by phase shifting the denoised magnitude spectrum with the angle data of the original signal. The last step from here is to do an inverse Fourier transform to bring the signal back into the time domain. After the program reconstructs the denoised signal, it loops back through it and use the original noise threshold to determine the noise and speech segments of the denoised signal. Then the program stores their energies to compute the average noise and signal energy, and then the signal-to-noise (SNR) ratio (by way of average power, which is 1/Tw*Eavg).
Plotting the signals:
Then, the program subplots the original audio signal, the magnitude of the original audio signal Fourier series coefficients, the denoised audio signal, and the magnitude of the denoised audio signal's Fourier coefficients. The actual plots are in the results section below.
Outputting the denoised audio:
Then we write the denoised to a WAV file. This WAV file’s name is the name of the original file followed by “_DENOISED.” We found the original file’s name by calling the method “fileparts”.
Optimizing alpha, beta, percentile, and Tw:
The goal of this project is to lower the energy of the noise audio in the denoised signal. To do this, we can plot how the different variables (alpha, beta, Tw, and percentile) affects the average energy of the noise (first second of the denoised signal).
While I could have just chosen random values for Tw, alpha, and beta and then reported back the best ones that worked for us, I figured that the best option was to write a short optimization script that iterates each audio signal through a constrained range of values for Tw, alpha, and beta and saves the signal/noise ratio returned by denoiser.m. I then just take the parameters that gave the largest signal/noise ratio. This is not a perfect method, though, because an excessively high beta will increase the total power of the signal while not actually giving a better sounding signal. However, constraining the parameters to reasonable values should make the max be the ideal solution (the issue is that without proper constraints the ideal parameters would only be a local max, but not the global max because higher and higher betas will just yield more and more signal power). Furthermore, the denoised signal almost always has very low noise, so the signal/noise ratio is largely determined by signal strength, meaning it doesn’t swing much with different parameters. Still, I feel this allowed me to more efficiently test a lot of parameter options to narrow the range that I needed to manually test by exporting the files and listening to them.
Choosing alpha and beta:
As a reminder, alpha and beta are the weights we give the magnitude of the coefficients in the Fourier domain during the spectral subtraction.
Also as a reminder, the signal-to-noise ratio is calculated as follows:
To determine the correlation between alpha, beta, and the SNR of the denoised signal, I decided to 3D plot the SNR at different values of alpha and beta. Unlike the plots for percentile and Tw, I had to make a 3D plot because both alpha and beta affect the SNR (this is a multivariate function). As shown in the above plot, the higher the alpha and the lower the beta, the higher the SNR. The plot above is an analysis of a white noise signal, but the relationships I found are relevant for all the noise samples.
Of course, we would want a higher SNR as that indicates a stronger signal with less noise. Therefore, we should choose a low beta and a very high alpha (alpha > 1).
Choosing a Percentile:
As a reminder, the percentile threshold is a variable that represents at what amount of energy does the signal have to have for the program to consider that segment of signal as just noise. A higher percentile means that we will consider more signals as just noise even if there is some speech in it.
Graphing average energy of the noise in the denoised signal vs. the noise energy threshold percentile, the energy threshold is proportional to the average energy of the noise from the first second of the denoised audio. This makes logical sense because a higher threshold would cause more of the signal to be put to near-zero in magnitude which would make the energy go down. A higher energy threshold makes sense for signals where there is a large energy gap between all of the speech and all of the noise. However, in some cases like when the noise is very loud or the speech is very quiet, a high threshold would get rid of a lot of the speech as well. So the percentile we choose to denoise will have to be relative to the signal we input.
Choosing a Tw:
As a reminder, Tw represents the amount of time we use to segment the signals.
To figure out the optimal period of which to segment the signal, I decided to plot the average energy of the first second of the denoised signal verses the length of the period. I plotted the graph from 0 to 0.7 seconds and we can see a somewhat consistent upwards trend meaning that the greater the period length, the greater the energy of the noise. Since the goal of this project is to lower the average energy of the noise, I decided to choose a smaller period length (0.01s) for our period length when denoising the signals.
We started with the following noisy audio and we wanted to denoise it.
White Noise (White noise contains all frequencies with equal intensity):
White Noise - Noisy Audio
After putting it through my MATLAB program this is the result:
White Noise - Denoised Audio
This result is with a Tw of 0.04, a percentile of 0.9, an alpha of 2, and a beta of 0.08. This outputs an SNR of 20.2086. However, one of the most important things when writing a program like this is to make sure that our program isn't overfit for this audio track and type of noise that I manually chose parameters for. Therefore, we should test other audio files that have different types of noise with the same parameters to check if it still sounds good.
F16 Jet Noise (F16 jets produce noise around the 0.25-2kHz range):
F16 Jet Noise - Noisy Audio
F16 Jet Noise - Denoised Audio
Grey Noise - Noisy Audio
Grey Noise - Denoised Audio
Violet Noise - Noisy Audio
Violet Noise - Denoised Audio
Through these audio clips, we can hear that the denoising program is doing quite well with these parameters and multiple types of noise. Let's graph the signals to visually see the effect of this program.
With these graphs, we can easily see the effects of the denoising program. The original audio had a lot of white noise which is why the magnitude of the signal in the time domain is non-zero even when there is no speech (ie. a pause between words). After denoising it, the magnitude of the signal in the time domain when there is no speech is close to zero meaning the program eliminated most of the noise.
Full Code - https://github.com/tchanxx/Projects/tree/master/Noise%20Cancellation