torchaudio spectrum feature extraction

1. Read and save audio

In torchaudio, the APIs for loading and saving audio are load and save.

import torchaudio
from IPython import display
data, sample = torchaudio.load(r"E:\pycharm\datas data set\test\audio\c6.flac")
print(data.shape, sample)
display.Audio(data.numpy(), rate=sample)
torchaudio.save('./data/audio.wav', src=data, sample_rate=sample)
-------------------------------------------------------------------------
torch.Size([1, 32640]) 16000

It should be noted that, unlike librosa.load(), audio.load() does not support specifying the sampling rate and does not have the sr parameter.
When saving, the format parameter can be used to specify the saving format, the encoding parameter specifies the encoding method, and the bits_per_sample parameter value specifies the number of sampling bits for encoding.
format can take the values ​​”mp3″, “flac”, “vorbis”, “sph”, “amb”, “amr-nb”, “gsm”
Possible values ​​for encoding:

  • “PCM_S”: Signed integer linear PCM
  • “PCM_U”: Unsigned integer linear PCM
  • “PCM_F”: Floating point linear PCM
  • “FLAC”: Flac, Free Lossless Audio Codec
  • “ULAW”: Mu-law, [wikipedia]
  • “ALAW”: A-law [wikipedia]
  • “MP3” : MP3, MPEG-1 Audio Layer III
  • “VORBIS”: OGG Vorbis [xiph.org]
  • “AMR_NB”: Adaptive Multi-Rate [wikipedia]
  • “AMR_WB”: Adaptive Multi-Rate Wideband [wikipedia]
  • “OPUS”: Opus [opus-codec.org]
  • “GSM”: GSM-FR [wikipedia]
  • “UNKNOWN” None of above

2. Extract features

  • Spectrogram
  • GriffinLim
  • Mel Filter Bank
  • MelSpectrogram
  • MFCC
  • Pitch
  • Kaldi Pitch (beta)

2.1 Short-time Fourier transform

Taking Spectrogram as an example, extract short-time Fourier features

n_fft = 1024
win_length = None
hop_length = 512
# Short time Fourier transform
transform = torchaudio.transforms.Spectrogram(
    n_fft=n_fft,
    win_length=win_length,
    hop_length=hop_length,
    center=True,
    pad_mode="reflect",
    power=2.0,
)
spec = transform(data)
print(spec.shape, spec.dtype)
------------------------------------------------------------
torch.Size([1, 513, 64]) torch.float32

2.2Conversion and use of pytorch complex values

What is worth noting among the above parameters is the index power. According to the official documentation, the value of power is a float greater than 0 or None. For example, 1.0 returns energy, 2.0 returns power, and None returns the spectrum of a complex number.

# When power is None, the returned data type is complex64
transform_complex = torchaudio.transforms.Spectrogram(
    n_fft=n_fft,
    win_length=win_length,
    hop_length=hop_length,
    center=True,
    pad_mode="reflect",
    power=None,
)
spec_complex = transform_complex(data)
print(spec_complex.shape, spec_complex.dtype)
----------------------------------------------------------
torch.Size([1, 513, 64]) torch.complex64

Since complex values ​​cannot be used in the network, you can use torch.view_as_real() to convert them to pseudo-complex numbers, that is, split the complex numbers into real numbers and complex numbers and put them into the last dimension (the last dimension becomes 2)

spec_real = torch.view_as_real(spec_complex)
print(spec_real.shape, spec_real.dtype)
----------------------------------------------------------------------------------
torch.Size([1, 513, 64, 2]) torch.float32

2.3 Inverse transformation of Spectrogram

API:InverseSpectrogram
InverseSpectrogramde only accepts spectrums whose data type is complex numbers, or spectrum matrices of pseudo-complex numbers whose last dimension is 2.

inverse_Spect = torchaudio.transforms.InverseSpectrogram(n_fft=n_fft,
    win_length=win_length,
    hop_length=hop_length,
    center=True,
    pad_mode="reflect")
# original audio
print(data[0, :10])
# Inverse transformation of pseudo-complex matrix
data_hat = inverse_Spect(spec_real)
print(data_hat[0, :10])
# Inverse transformation of complex matrix
data_hat2 = inverse_Spect(spec_complex)
print(data_hat2[0, :10])
==========================================================================
tensor([ 0.0000e+00,  0.0000e+00,  0.0000e+00,  3.0518e-05,  3.0518e-05,
         3.0518e-05,  0.0000e+00, -6.1035e-05,  0.0000e+00,  0.0000e+00])
tensor([-2.2737e-13,  1.8190e-12,  2.0181e-12,  3.0518e-05,  3.0518e-05,
         3.0518e-05, -2.2714e-13, -6.1035e-05, -3.5279e-12, -2.7309e-12])
tensor([-2.2737e-13,  1.8190e-12,  2.0181e-12,  3.0518e-05,  3.0518e-05,
         3.0518e-05, -2.2714e-13, -6.1035e-05, -3.5279e-12, -2.7309e-12])

It can be seen that the error is 10 decimal places after the decimal point.

Related Posts

Anaconda Tsinghua Image–Installation, Configuration and Use

RK3399 study notes 1.0.1—python environment Firefly Core-3399pro-jd4 rknn environment construction

Basic use of Python Request get post agent

Exception: Python in worker has different version 2.7 than that in driver 3.6

Data analysis–Pandas③

Calculation of pi π in Python

Detailed explanation of finding the difference between two dataframes using Pandas

Python3 crawler tutorial-basic use of aiohttp

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*