Visualizing Music using Spectrogram in Python

Music is ubiquitous. It embodies a large part of human emotion, behaviour, cognitive thinking, and culture. Needless to say, various online platforms nowadays invest millions of expenditure in research to improve the ways user can listen to music.

Today we'll learn about how to extract features in music using Python. Particularly, how can we recognize the components of music in a spectrogram? How to interpret spectrogram? How spectrogram can assist feature extraction for music?

In this tutorial, we will manuever a Python package called librosa, which is a module primarily used to analyze audio signals.

Sound Processing

When you listen to music, you feel the sound. Sound can be represented in the form of audio signal. When you receive such audio signal, your emotion, and brain activity will be stimulated. You can discern the frequency, bandwidth, decibel, formant and vibration of the sound if you are a trained musician.

In computer, these sounds are available in many formats that are compatible with computer to read and analyze:

  • mp3
  • WMA
  • wav

Import libraries

In Python, there are various audio processing libraries. The most common ones are Librosa and PyAudio. These libraries support audio acquisition and playback functionalities. We will look at Librosa in this tutorial.

Librosa can analyze audio signals and use them to build Music information retrieval system. You can read the documentation of Librosa at (https://librosa.org/doc/latest/index.html).

In [ ]:
!pip install librosa

Load library

We now import the Librosa library and make use of the IPython.display.Audio functionality to play audio directly in a Jupyter Notebook. To allow loading mp3 files, based on the mp3 support discussion thread in https://github.com/librosa/librosa#audioread-and-mp3-support, we use conda install -c conda-forge ffmpeg command by installing ffmpeg in Anaconda.

In [ ]:
!conda install -c conda-forge ffmpeg
In [2]:
# Import package
import librosa
import matplotlib.pyplot as plt
In [3]:
# Import the dataset - audio dataset
# librosa.load

audio_time_series, sampling_rate = librosa.load('Mojito_full.mp3')
C:\Users\sonso\Anaconda3\lib\site-packages\librosa\core\audio.py:162: UserWarning: PySoundFile failed. Trying audioread instead.
  warnings.warn("PySoundFile failed. Trying audioread instead.")
In [4]:
# Know what audio_time_series, sampling_rate 
print(audio_time_series, sampling_rate )
[0. 0. 0. ... 0. 0. 0.] 22050
In [18]:
print(len(audio_time_series))
5128128
In [5]:
max(audio_time_series)
Out[5]:
0.88857234
In [6]:
min(audio_time_series)
Out[6]:
-0.792852

Playing Audio

In [7]:
import IPython.display as ipd
ipd.Audio('Mojito_full.mp3')
Out[7]:

Waveplot

The first analysis for this piece of music is to make a waveplot. A waveplot can show the positive and negative values in the time-series trajectory.

We can use the waveplot function in librosa.display library to plot the audio array to see the amplitude envelope of a song from the beginning to the end.

In [9]:
import librosa.display
plt.figure(figsize = (10,4))
librosa.display.waveplot(audio_time_series, sampling_rate)
Out[9]:
<matplotlib.collections.PolyCollection at 0x160f458eac0>

How to interpret this?!! The music starts with a low amplitude and gradually increase in amplitude as the music moves to a more dynamic chorus :)

The triangles showed in the graph indicates a crescendo (mostly happening in a changing paragraph)

Spectrogram

A spectrogram, also called sonographs, voiceprints, or voicegrams, is a visual representation of sound or audio signals based on the spectrum of frequencies. For human, we can detect sound frequencies ranging from 20 - 8000 Hz.

Spectrogram shows the change of intensity of sound in different frequencies as a continuous trajectory, from the onset of a song to the ending of a song.

In [14]:
# Perform short-time Fourier Transform for the audio
X = librosa.stft(audio_time_series)
#print(X)
Xdb = librosa.amplitude_to_db(abs(X))

# Create spectrogram
plt.figure(figsize = (15,6))
librosa.display.specshow(Xdb, sr = sampling_rate, x_axis = 'time', y_axis = 'hz')
plt.colorbar()
Out[14]:
<matplotlib.colorbar.Colorbar at 0x160894d80a0>

In the spectrogram, the vertical axis shows the frequencies in Hz ranging from 0 to 10000 Hz, and the horizonal axis shows the timestamp of the imported song. Particularly, since most of the sound resonants at the lower end of the y-axis spectrum, we claim that the music mostly feature the low frequency area. In low frequency area, there are clearer, larger audio signal transmitted from computer to human. Human with hardship/disability in listening/processing low frequency range will find it difficult to listen to this song.

Since most of the sound resonants at the lower end of the y-axis, we can convert all the frequency using a log-transformation. We can replot the spectrogram but change the y_axis parameter to log.

In [15]:
# Create spectrogram
plt.figure(figsize = (15,6))
librosa.display.specshow(Xdb, sr = sampling_rate, x_axis = 'time', y_axis = 'log')
plt.colorbar()
Out[15]:
<matplotlib.colorbar.Colorbar at 0x1608950e4c0>

Music Feature Extraction

Every audio signal, every sound contains various interpretable features. The process of extracting featuers and to use them for analysis is called feature extraction. Let's go through some important features in music.

In [16]:
# We can use waveplot again to figure out the number of times the audio pass through zero.
# This number is important in analyzing sound and is referred to as zero crossing rate.

plt.figure(figsize = (12,5))
librosa.display.waveplot(audio_time_series, sampling_rate)
Out[16]:
<matplotlib.collections.PolyCollection at 0x160f4413820>
In [20]:
# Zoom in to look into signals
plt.figure(figsize = (12, 4))
plt.plot(audio_time_series[100000:101000])
plt.grid()
In [23]:
# We can count zero crossings between 100000 and 102000 using zero_crossing function
zero_crossings = librosa.zero_crossings(audio_time_series[100000:101000], pad = False)
# Zero crossings is returned as an array of boolean values
# We sum over the array and see how many True are there
sum(zero_crossings)
Out[23]:
76
In [24]:
# Zoom in to look into signals
plt.figure(figsize = (12, 4))
plt.plot(audio_time_series[4000000:4001000])
plt.grid()
In [25]:
# We can count zero crossings between 100000 and 102000 using zero_crossing function
zero_crossings = librosa.zero_crossings(audio_time_series[4000000:4001000], pad = False)
# Zero crossings is returned as an array of boolean values
# We sum over the array and see how many True are there
sum(zero_crossings)
Out[25]:
83

PCA / FPCA

Principal component analysis / functional principal component analysis

  • Can extract important feature in a piece of music / human speech
  • Use scree plot to show the intensity and direction of important features
  • The most important feature will explain the most variance within your dataset
  • Choose important features until your chosen features can represent approximately 90%, 95% of the variability in your dataset
In [26]:
# Chroma Frequencies

hop_length = 512
chromagram = librosa.feature.chroma_stft(audio_time_series, sr = sampling_rate,
                                         hop_length = hop_length)
plt.figure(figsize = (15,6))
librosa.display.specshow(chromagram, x_axis = 'time', y_axis ='chroma',
                        hop_length = hop_length, cmap = 'coolwarm')
Out[26]:
<matplotlib.collections.QuadMesh at 0x160f43df160>

Future Direction

  • Classification of music genre
  • Classification of verse and chorus
  • Youtube / Bilibili / Spotify - Recommender System (Combine system rating from algorithm and user rating from data storage to provide future user recommendation of the next music to listen)
  • Generative Analysis Network (GAN) the AI can compose new songs based on learning a bunch of input music (the AI can learn the features such as musical timbre, pitch/frequency, dynamic changes, instrumental restrictions, harmony types, rhythm types, etc)