Audio Feature Extraction

TIL
Audio
librosa
Author

Stephen Barrie

Published

December 9, 2022

Audio files and concepts

In audio data analysis, we process and transform audio signals captured by digital devices. Depending on how they’re captured, they can come in many different formats such as wav, mp3, m4a, aiff, and flac.

Quoting Izotope.com, Waveform (wav) is one of the most popular digital audio formats. It is a lossless file format — which means it captures the closest mathematical representation of the original audio with no noticeable audio quality loss. In mp3 or m4a (Apple’s mp3 format) the data is compressed in such a way so it can be more easily distributed although in lower quality. In audio data analytics, most libraries support wav file processing.

As a form of a wave, sound/audio signal has the generic properties of:

  • Frequency: occurrences of vibrations per unit of time
  • Amplitude: maximum displacement or distance moved by a point on a wave measured from its equilibrium position; impacting the sound intensity
  • Speed of sound: distance traveled per unit of time by a soundwave

The information to be extracted from audio files are just transformations of the main properties above.

Exploratory analysis on audio files

For this analysis, I’m going to compare two demo tracks that our band Thirteen-Seven produced.

The files will be analyzed mainly with these Python packages:

  • librosa for audio signal extraction and visualization
  • pydub for audio file manipulation
  • wave for reading wav files

General audio parameters

Just like how we usually start evaluating tabular data by getting the statistical summary of the data (i.e using “Dataframe.describe” method), in the audio analysis we can start by getting the audio metadata summary. We can do so by utilizing the audiosegment module in pydub.

Below are some generic features that can be extracted:

  • Channels: number of channels; 1 for mono, 2 for stereo audio
  • Sample width: number of bytes per sample; 1 means 8-bit, 2 means 16-bit, 3 means 24-bit, 4 means 32-bit
  • Frame rate(sample rate): frequency of samples used (in Hertz)
  • Frame width: Number of bytes for each “frame”. One frame contains a sample for each channel.
  • Length: audio file length (in milliseconds)
  • Frame count: the number of frames from the sample
  • Intensity: loudness in dBFS (dB relative to the maximum possible loudness)
import numpy as np
import matplotlib.pyplot as plt
from pydub import AudioSegment
import librosa
import librosa.display
import IPython.display as ipd
# Load in the track and create widget to listen
all_the_excuses, sr = librosa.load('Audio/all_the_excuses.wav')
ipd.Audio(all_the_excuses, rate=sr)
# Load files
all_the_excuses = AudioSegment.from_file('Audio/all_the_excuses.wav')

# Print attributes
print(f"***All The Excuses - metadata***")
print(f"Channels:  {all_the_excuses.channels}")
print(f"Sample width: {all_the_excuses.sample_width}")
print(f"Frame rate (sample rate): {all_the_excuses.frame_rate}")
print(f"Frame width:  {all_the_excuses.frame_width}")
print(f"Length (ms): {len(all_the_excuses)}")
print(f"Frame count:  {all_the_excuses.frame_count()}")
print(f"Intensity: {all_the_excuses.dBFS}")
***All The Excuses - metadata***
Channels:  2
Sample width: 2
Frame rate (sample rate): 44100
Frame width:  4
Length (ms): 252891
Frame count:  11152512.0
Intensity: -10.902991191150802
# Load in the track and create widget to listen
all_or_nothing, sr = librosa.load('Audio/all_or_nothing.wav')
ipd.Audio(all_or_nothing, rate=sr)
all_or_nothing = AudioSegment.from_file('Audio/all_or_nothing.wav')

# Print attributes
print(f"***All or Nothing - metadata***")
print(f"Channels:  {all_or_nothing.channels}")
print(f"Sample width: {all_or_nothing.sample_width}")
print(f"Frame rate (sample rate): {all_or_nothing.frame_rate}")
print(f"Frame width:  {all_or_nothing.frame_width}")
print(f"Length (ms): {len(all_or_nothing)}")
print(f"Frame count:  {all_or_nothing.frame_count()}")
print(f"Intensity: {all_or_nothing.dBFS}")
***All or Nothing - metadata***
Channels:  2
Sample width: 2
Frame rate (sample rate): 44100
Frame width:  4
Length (ms): 239099
Frame count:  10544256.0
Intensity: -10.614852809540894

Feature extraction

Numerous advanced features can be extracted and visualized using librosa to analyze audio characteristics.

Amplitude envelope

We can visualize the amplitude over time of an audio file to get an idea of the wave movement using librosa:

# Import required module
import librosa.display
import matplotlib.pyplot as plt


# Load in our track
all_the_excuses = 'Audio/all_the_excuses.wav'
x , sr = librosa.load(all_the_excuses, sr=None)
    
# Plot the signal
plt.figure(figsize=(15, 3))
plt.title("Thirteen-Seven | All The Excuses - waveplot")
librosa.display.waveshow(x, sr=sr)
<librosa.display.AdaptiveWaveplot at 0x7fbc37a9a260>

# Load in our track
all_or_nothing = 'Audio/all_or_nothing.wav'
x , sr = librosa.load(all_or_nothing, sr=None)
    
# Import required module
import librosa.display

# Plot the signal
plt.figure(figsize=(15, 3))
plt.title("Thirteen-Seven | All or Nothing - waveplot")
librosa.display.waveshow(x, sr=sr)
<librosa.display.AdaptiveWaveplot at 0x7fbc2b8dcd60>

Spectrogram

The extracted audio features can be visualized on a spectrogram. Quoting Wikipedia, a spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. It is usually depicted as a heat map, with the intensity shown on varying color gradients.

import librosa.display

all_the_excuses, sr = librosa.load('Audio/all_the_excuses.wav')

X = librosa.stft(all_the_excuses)
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(15, 3))
plt.title('Thirteen-Seven | All The Excuses - spectrogram')
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7fbc29d14c40>

import librosa.display

all_or_nothing, sr = librosa.load('Audio/all_or_nothing.wav')

X = librosa.stft(all_or_nothing)
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(15, 3))
plt.title('Thirteen-Seven | All or Nothing - spectrogram')
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7fbc29c08dc0>

The vertical axis shows frequency, the horizontal axis shows the time of the clip, and the color variation shows the intensity of the audio wave.

Root-mean-square (RMS)

The root-mean-square here refers to the total magnitude of the signal, which in layman terms can be interpreted as the loudness or energy parameter of the audio file.

all_the_excuses, sr = librosa.load('Audio/all_the_excuses.wav')

# Get RMS value from each frame's magnitude value
S, phase = librosa.magphase(librosa.stft(all_the_excuses))
rms = librosa.feature.rms(S=S)


# Plot the RMS energy
fig, ax = plt.subplots(figsize=(15, 6), nrows=2, sharex=True)
times = librosa.times_like(rms)
ax[0].semilogy(times, rms[0], label='RMS Energy')
ax[0].set(xticks=[])
ax[0].legend()
ax[0].label_outer()
librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
                         y_axis='log', x_axis='time', ax=ax[1])
ax[1].set(title='Thirteen-Seven | All The Excuses - log Power spectrogram')
[Text(0.5, 1.0, 'Thirteen-Seven | All The Excuses - log Power spectrogram')]

all_or_nothing, sr = librosa.load('Audio/all_or_nothing.wav')

# Get RMS value from each frame's magnitude value
S, phase = librosa.magphase(librosa.stft(all_or_nothing))
rms = librosa.feature.rms(S=S)


# Plot the RMS energy
fig, ax = plt.subplots(figsize=(15, 6), nrows=2, sharex=True)
times = librosa.times_like(rms)
ax[0].semilogy(times, rms[0], label='RMS Energy')
ax[0].set(xticks=[])
ax[0].legend()
ax[0].label_outer()
librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
                         y_axis='log', x_axis='time', ax=ax[1])
ax[1].set(title='Thirteen-Seven | All or Nothing - log Power spectrogram')
[Text(0.5, 1.0, 'Thirteen-Seven | All or Nothing - log Power spectrogram')]

Here we can see the RMS values are consistently high (until the very end of the tracks) as this rock music is loud and intense throughout.

Zero crossing rate

Quoting Wikipedia, zero-crossing rate (ZCR) is the rate at which a signal changes from positive to zero to negative or from negative to zero to positive. Its value has been widely used in both speech recognition and music information retrieval, being a key feature to classify percussive sounds. Highly percussive sounds like rock, metal, emo, or punk music tend to have higher zero-crossing rate values.

We can get this data manually by zooming into a certain frame in the amplitude time series, counting the times it passes zero value in the y-axis and extrapolating for the whole audio. Alternatively, there is a function in librosa that we can use to get the zero-crossing state and rate.

all_the_excuses, sr = librosa.load('Audio/all_the_excuses.wav')

zcrs = librosa.feature.zero_crossing_rate(all_the_excuses)
                     
print(f"Zero crossing rate: {sum(librosa.zero_crossings(all_the_excuses))}")
plt.figure(figsize=(15, 3))
plt.plot(zcrs[0])
plt.title('Thirteen-Seven | All The Excuses - zero-crossing rate (ZCR)')
Zero crossing rate: 706615
Text(0.5, 1.0, 'Thirteen-Seven | All The Excuses - zero-crossing rate (ZCR)')

all_or_nothing, sr = librosa.load('Audio/all_or_nothing.wav')

zcrs = librosa.feature.zero_crossing_rate(all_or_nothing)
                     
print(f"Zero crossing rate: {sum(librosa.zero_crossings(all_or_nothing))}")
plt.figure(figsize=(15, 3))
plt.plot(zcrs[0])
plt.title('Thirteen-Seven | All or Nothing - zero-crossing rate (ZCR)')
Zero crossing rate: 679083
Text(0.5, 1.0, 'Thirteen-Seven | All or Nothing - zero-crossing rate (ZCR)')

Above is the zero crossing value and rate for the track. Here we can see the zero-crossing rate is high as it is a highly percussive rock song.

Mel-Frequency Cepstral Coefficients (MFCCs)

Quoting Analytics Vidhya, humans do not perceive frequencies on a linear scale. We are better at detecting differences in lower frequencies than higher frequencies, even if the gap is the same (i.e 50 and 1,000 Hz vs 10,000 and 10,500 Hz). In Mel-scale, equal distances in pitch sounded equally distant to the listener.

Mel-Frequency Cepstral Coefficients (MFCCs) is a representation of the short-term power spectrum of a sound, based on some transformation in a Mel-scale. It is commonly used in speech recognition as people’s voices are usually on a certain range of frequency and different from one to another. Getting and displaying MFCCs is quite straightforward in Librosa.

all_the_excuses, sr = librosa.load('Audio/all_the_excuses.wav')
mfccs = librosa.feature.mfcc(all_the_excuses, sr=sr)

#Displaying  the MFCCs:
fig,ax = plt.subplots(figsize=(15, 3))
img = librosa.display.specshow(mfccs, sr=sr, x_axis='time')
fig.colorbar(img, ax=ax)
                     
ax.set(title='Thirteen-Seven | All The Excuses - Mel-Frequency Cepstral Coefficients (MFCCs')
/tmp/ipykernel_64/1341239866.py:2: FutureWarning: Pass y=[0. 0. 0. ... 0. 0. 0.] as keyword args. From version 0.10 passing these as positional arguments will result in an error
  mfccs = librosa.feature.mfcc(all_the_excuses, sr=sr)
[Text(0.5, 1.0, 'Thirteen-Seven | All The Excuses - Mel-Frequency Cepstral Coefficients (MFCCs')]

all_or_nothing, sr = librosa.load('Audio/all_or_nothing.wav')
mfccs = librosa.feature.mfcc(all_or_nothing, sr=sr)

#Displaying  the MFCCs:
fig,ax = plt.subplots(figsize=(15, 3))
img = librosa.display.specshow(mfccs, sr=sr, x_axis='time')
fig.colorbar(img, ax=ax)
                     
ax.set(title='Thirteen-Seven | All or Nothing - Mel-Frequency Cepstral Coefficients (MFCCs')
/tmp/ipykernel_64/2838893102.py:2: FutureWarning: Pass y=[0. 0. 0. ... 0. 0. 0.] as keyword args. From version 0.10 passing these as positional arguments will result in an error
  mfccs = librosa.feature.mfcc(all_or_nothing, sr=sr)
[Text(0.5, 1.0, 'Thirteen-Seven | All or Nothing - Mel-Frequency Cepstral Coefficients (MFCCs')]

Chroma

We can use Chroma feature visualization to know how dominant the characteristics of a certain pitch {C, C♯, D, D♯, E, F, F♯, G, G♯, A, A♯, B} is present in the sampled frame.

all_the_excuses, sr = librosa.load('Audio/all_the_excuses.wav')

hop_length = 512

chromagram = librosa.feature.chroma_stft(all_the_excuses, sr=sr, hop_length=hop_length)

plt.figure(figsize=(15, 5))
plt.title('Thirteen-Seven | All The Excuses - chromagram')
librosa.display.specshow(chromagram, x_axis='time', y_axis='chroma', hop_length=hop_length, cmap='coolwarm')
/tmp/ipykernel_64/4136034188.py:5: FutureWarning: Pass y=[0. 0. 0. ... 0. 0. 0.] as keyword args. From version 0.10 passing these as positional arguments will result in an error
  chromagram = librosa.feature.chroma_stft(all_the_excuses, sr=sr, hop_length=hop_length)
<matplotlib.collections.QuadMesh at 0x7fbc29847760>

all_or_nothing, sr = librosa.load('Audio/all_or_nothing.wav')

hop_length = 512

chromagram = librosa.feature.chroma_stft(all_or_nothing, sr=sr, hop_length=hop_length)

plt.figure(figsize=(15, 5))
plt.title('Thirteen-Seven | All or Nothing - chromagram')
librosa.display.specshow(chromagram, x_axis='time', y_axis='chroma', hop_length=hop_length, cmap='coolwarm')
/tmp/ipykernel_64/626002655.py:5: FutureWarning: Pass y=[0. 0. 0. ... 0. 0. 0.] as keyword args. From version 0.10 passing these as positional arguments will result in an error
  chromagram = librosa.feature.chroma_stft(all_or_nothing, sr=sr, hop_length=hop_length)
<matplotlib.collections.QuadMesh at 0x7fbc298ddfc0>

Tempogram

Tempo refers to the rate of the musical beat and is given by the reciprocal of the beat period. Tempo is often defined in units of beats per minute (BPM). Tempo can vary locally within a piece. Therefore, we introduce the tempogram (FMP, p. 317) as a feature matrix which indicates the prevalence of certain tempo at each moment in time.

# Estimate the tempo:
tempo = librosa.beat.tempo(all_the_excuses, sr=sr)
tempo
/tmp/ipykernel_64/3447807976.py:2: FutureWarning: Pass y=[ 0.          0.          0.         ... -0.15874578 -0.19955122
 -0.38175374] as keyword args. From version 0.10 passing these as positional arguments will result in an error
  tempo = librosa.beat.tempo(all_the_excuses, sr=sr)
array([172.265625])
# Visualize the tempo estimate on top of the input signal
T = len(all_the_excuses)/float(sr)
seconds_per_beat = 60.0/tempo[0]
beat_times = np.arange(0, T, seconds_per_beat)

librosa.display.waveshow(all_the_excuses)
plt.vlines(beat_times, -1, 1, color='r')
plt.title("Thirteen-Seven | All The Excuses - estimated tempo plot")
Text(0.5, 1.0, 'Thirteen-Seven | All The Excuses - estimated tempo plot')

# Listen to the input signal with a click track using the tempo estimate:
clicks = librosa.clicks(beat_times, sr, length=len(all_the_excuses))
ipd.Audio(all_the_excuses + clicks, rate=sr)
/tmp/ipykernel_64/625036389.py:1: FutureWarning: Pass times=[ 0.          0.53405896  1.06811791  1.60217687  2.13623583  2.67029478
  3.20435374  3.7384127   4.27247166  4.80653061  5.34058957  5.87464853
  6.40870748  6.94276644  7.4768254   8.01088435  8.54494331  9.07900227
  9.61306122 10.14712018 10.68117914 11.2152381  11.74929705 12.28335601
 12.81741497 13.35147392 13.88553288 14.41959184 14.95365079 15.48770975
 16.02176871 16.55582766 17.08988662 17.62394558 18.15800454 18.69206349
 19.22612245 19.76018141 20.29424036 20.82829932 21.36235828 21.89641723
 22.43047619 22.96453515 23.4985941  24.03265306 24.56671202 25.10077098
 25.63482993 26.16888889 26.70294785 27.2370068  27.77106576 28.30512472
 28.83918367 29.37324263 29.90730159], frames=22050 as keyword args. From version 0.10 passing these as positional arguments will result in an error
  clicks = librosa.clicks(beat_times, sr, length=len(all_the_excuses))
# Estimate the tempo:
tempo = librosa.beat.tempo(all_or_nothing, sr=sr)
tempo
/tmp/ipykernel_64/3362745988.py:2: FutureWarning: Pass y=[0.         0.         0.         ... 0.3365013  0.3631341  0.37732217] as keyword args. From version 0.10 passing these as positional arguments will result in an error
  tempo = librosa.beat.tempo(all_or_nothing, sr=sr)
array([112.34714674])
# Visualize the tempo estimate on top of the input signal
T = len(all_or_nothing)/float(sr)
seconds_per_beat = 60.0/tempo[0]
beat_times = np.arange(0, T, seconds_per_beat)

librosa.display.waveshow(all_or_nothing)
plt.vlines(beat_times, -1, 1, color='r')
plt.title("Thirteen-Seven | All or Nothing - estimated tempo plot")
Text(0.5, 1.0, 'Thirteen-Seven | All or Nothing - estimated tempo plot')

# Listen to the input signal with a click track using the tempo estimate:
clicks = librosa.clicks(beat_times, sr, length=len(all_or_nothing))
ipd.Audio(all_or_nothing + clicks, rate=sr)
/tmp/ipykernel_64/1680592788.py:2: FutureWarning: Pass times=[ 0.          0.53405896  1.06811791  1.60217687  2.13623583  2.67029478
  3.20435374  3.7384127   4.27247166  4.80653061  5.34058957  5.87464853
  6.40870748  6.94276644  7.4768254   8.01088435  8.54494331  9.07900227
  9.61306122 10.14712018 10.68117914 11.2152381  11.74929705 12.28335601
 12.81741497 13.35147392 13.88553288 14.41959184 14.95365079 15.48770975
 16.02176871 16.55582766 17.08988662 17.62394558 18.15800454 18.69206349
 19.22612245 19.76018141 20.29424036 20.82829932 21.36235828 21.89641723
 22.43047619 22.96453515 23.4985941  24.03265306 24.56671202 25.10077098
 25.63482993 26.16888889 26.70294785 27.2370068  27.77106576 28.30512472
 28.83918367 29.37324263 29.90730159], frames=22050 as keyword args. From version 0.10 passing these as positional arguments will result in an error
  clicks = librosa.clicks(beat_times, sr, length=len(all_or_nothing))