Music is ubiquitous. It embodies a large part of human emotion, behaviour, cognitive thinking, and culture. Needless to say, various online platforms nowadays invest millions of expenditure in research to improve the ways user can listen to music.
Today we'll learn about how to extract features in music using Python. Particularly, how can we recognize the components of music in a spectrogram? How to interpret spectrogram? How spectrogram can assist feature extraction for music?
In this tutorial, we will manuever a Python package called librosa, which is a module primarily used to analyze audio signals.
When you listen to music, you feel the sound. Sound can be represented in the form of audio signal. When you receive such audio signal, your emotion, and brain activity will be stimulated. You can discern the frequency, bandwidth, decibel, formant and vibration of the sound if you are a trained musician.
In computer, these sounds are available in many formats that are compatible with computer to read and analyze:
In Python, there are various audio processing libraries. The most common ones are Librosa and PyAudio. These libraries support audio acquisition and playback functionalities. We will look at Librosa in this tutorial.
Librosa can analyze audio signals and use them to build Music information retrieval system. You can read the documentation of Librosa at (https://librosa.org/doc/latest/index.html).
!pip install librosa
We now import the Librosa library and make use of the IPython.display.Audio
functionality to play audio directly in a Jupyter Notebook. To allow loading mp3 files, based on the mp3 support discussion thread in https://github.com/librosa/librosa#audioread-and-mp3-support, we use conda install -c conda-forge ffmpeg
command by installing ffmpeg in Anaconda.
!conda install -c conda-forge ffmpeg
# Import package
import librosa
import matplotlib.pyplot as plt
# Import the dataset - audio dataset
# librosa.load
audio_time_series, sampling_rate = librosa.load('Mojito_full.mp3')
# Know what audio_time_series, sampling_rate
print(audio_time_series, sampling_rate )
print(len(audio_time_series))
max(audio_time_series)
min(audio_time_series)
import IPython.display as ipd
ipd.Audio('Mojito_full.mp3')
The first analysis for this piece of music is to make a waveplot. A waveplot can show the positive and negative values in the time-series trajectory.
We can use the waveplot function in librosa.display library to plot the audio array to see the amplitude envelope of a song from the beginning to the end.
import librosa.display
plt.figure(figsize = (10,4))
librosa.display.waveplot(audio_time_series, sampling_rate)
How to interpret this?!! The music starts with a low amplitude and gradually increase in amplitude as the music moves to a more dynamic chorus :)
The triangles showed in the graph indicates a crescendo (mostly happening in a changing paragraph)
A spectrogram, also called sonographs, voiceprints, or voicegrams, is a visual representation of sound or audio signals based on the spectrum of frequencies. For human, we can detect sound frequencies ranging from 20 - 8000 Hz.
Spectrogram shows the change of intensity of sound in different frequencies as a continuous trajectory, from the onset of a song to the ending of a song.
# Perform short-time Fourier Transform for the audio
X = librosa.stft(audio_time_series)
#print(X)
Xdb = librosa.amplitude_to_db(abs(X))
# Create spectrogram
plt.figure(figsize = (15,6))
librosa.display.specshow(Xdb, sr = sampling_rate, x_axis = 'time', y_axis = 'hz')
plt.colorbar()
In the spectrogram, the vertical axis shows the frequencies in Hz ranging from 0 to 10000 Hz, and the horizonal axis shows the timestamp of the imported song. Particularly, since most of the sound resonants at the lower end of the y-axis spectrum, we claim that the music mostly feature the low frequency area. In low frequency area, there are clearer, larger audio signal transmitted from computer to human. Human with hardship/disability in listening/processing low frequency range will find it difficult to listen to this song.
Since most of the sound resonants at the lower end of the y-axis, we can convert all the frequency using a log-transformation. We can replot the spectrogram but change the y_axis
parameter to log.
# Create spectrogram
plt.figure(figsize = (15,6))
librosa.display.specshow(Xdb, sr = sampling_rate, x_axis = 'time', y_axis = 'log')
plt.colorbar()
Every audio signal, every sound contains various interpretable features. The process of extracting featuers and to use them for analysis is called feature extraction. Let's go through some important features in music.
# We can use waveplot again to figure out the number of times the audio pass through zero.
# This number is important in analyzing sound and is referred to as zero crossing rate.
plt.figure(figsize = (12,5))
librosa.display.waveplot(audio_time_series, sampling_rate)
# Zoom in to look into signals
plt.figure(figsize = (12, 4))
plt.plot(audio_time_series[100000:101000])
plt.grid()
# We can count zero crossings between 100000 and 102000 using zero_crossing function
zero_crossings = librosa.zero_crossings(audio_time_series[100000:101000], pad = False)
# Zero crossings is returned as an array of boolean values
# We sum over the array and see how many True are there
sum(zero_crossings)
# Zoom in to look into signals
plt.figure(figsize = (12, 4))
plt.plot(audio_time_series[4000000:4001000])
plt.grid()
# We can count zero crossings between 100000 and 102000 using zero_crossing function
zero_crossings = librosa.zero_crossings(audio_time_series[4000000:4001000], pad = False)
# Zero crossings is returned as an array of boolean values
# We sum over the array and see how many True are there
sum(zero_crossings)
Principal component analysis / functional principal component analysis
# Chroma Frequencies
hop_length = 512
chromagram = librosa.feature.chroma_stft(audio_time_series, sr = sampling_rate,
hop_length = hop_length)
plt.figure(figsize = (15,6))
librosa.display.specshow(chromagram, x_axis = 'time', y_axis ='chroma',
hop_length = hop_length, cmap = 'coolwarm')