torchaudio spectrum feature extraction

admin 25/10/2023 Speech Recognition 0

torchaudio spectrum feature extraction

1. Read and save audio
2. Extract features

1. Read and save audio

In torchaudio, the APIs for loading and saving audio are load and save.

import torchaudio
from IPython import display
data, sample = torchaudio.load(r"E:\pycharm\datas data set\test\audio\c6.flac")
print(data.shape, sample)
display.Audio(data.numpy(), rate=sample)
torchaudio.save('./data/audio.wav', src=data, sample_rate=sample)
-------------------------------------------------------------------------
torch.Size([1, 32640]) 16000

It should be noted that, unlike librosa.load(), audio.load() does not support specifying the sampling rate and does not have the sr parameter.
When saving, the format parameter can be used to specify the saving format, the encoding parameter specifies the encoding method, and the bits_per_sample parameter value specifies the number of sampling bits for encoding.
format can take the values ”mp3″, “flac”, “vorbis”, “sph”, “amb”, “amr-nb”, “gsm”
Possible values for encoding:

“PCM_S”: Signed integer linear PCM
“PCM_U”: Unsigned integer linear PCM
“PCM_F”: Floating point linear PCM
“FLAC”: Flac, Free Lossless Audio Codec
“ULAW”: Mu-law, [wikipedia]
“ALAW”: A-law [wikipedia]
“MP3” : MP3, MPEG-1 Audio Layer III
“VORBIS”: OGG Vorbis [xiph.org]
“AMR_NB”: Adaptive Multi-Rate [wikipedia]
“AMR_WB”: Adaptive Multi-Rate Wideband [wikipedia]
“OPUS”: Opus [opus-codec.org]
“GSM”: GSM-FR [wikipedia]
“UNKNOWN” None of above

2. Extract features

Spectrogram
GriffinLim
Mel Filter Bank
MelSpectrogram
MFCC
Pitch
Kaldi Pitch (beta)

2.1 Short-time Fourier transform

Taking Spectrogram as an example, extract short-time Fourier features

n_fft = 1024
win_length = None
hop_length = 512
# Short time Fourier transform
transform = torchaudio.transforms.Spectrogram(
    n_fft=n_fft,
    win_length=win_length,
    hop_length=hop_length,
    center=True,
    pad_mode="reflect",
    power=2.0,
)
spec = transform(data)
print(spec.shape, spec.dtype)
------------------------------------------------------------
torch.Size([1, 513, 64]) torch.float32

2.2Conversion and use of pytorch complex values

What is worth noting among the above parameters is the index power. According to the official documentation, the value of power is a float greater than 0 or None. For example, 1.0 returns energy, 2.0 returns power, and None returns the spectrum of a complex number.

# When power is None, the returned data type is complex64
transform_complex = torchaudio.transforms.Spectrogram(
    n_fft=n_fft,
    win_length=win_length,
    hop_length=hop_length,
    center=True,
    pad_mode="reflect",
    power=None,
)
spec_complex = transform_complex(data)
print(spec_complex.shape, spec_complex.dtype)
----------------------------------------------------------
torch.Size([1, 513, 64]) torch.complex64

Since complex values cannot be used in the network, you can use torch.view_as_real() to convert them to pseudo-complex numbers, that is, split the complex numbers into real numbers and complex numbers and put them into the last dimension (the last dimension becomes 2)

spec_real = torch.view_as_real(spec_complex)
print(spec_real.shape, spec_real.dtype)
----------------------------------------------------------------------------------
torch.Size([1, 513, 64, 2]) torch.float32

2.3 Inverse transformation of Spectrogram

API：InverseSpectrogram
InverseSpectrogramde only accepts spectrums whose data type is complex numbers, or spectrum matrices of pseudo-complex numbers whose last dimension is 2.

inverse_Spect = torchaudio.transforms.InverseSpectrogram(n_fft=n_fft,
    win_length=win_length,
    hop_length=hop_length,
    center=True,
    pad_mode="reflect")
# original audio
print(data[0, :10])
# Inverse transformation of pseudo-complex matrix
data_hat = inverse_Spect(spec_real)
print(data_hat[0, :10])
# Inverse transformation of complex matrix
data_hat2 = inverse_Spect(spec_complex)
print(data_hat2[0, :10])
==========================================================================
tensor([ 0.0000e+00,  0.0000e+00,  0.0000e+00,  3.0518e-05,  3.0518e-05,
         3.0518e-05,  0.0000e+00, -6.1035e-05,  0.0000e+00,  0.0000e+00])
tensor([-2.2737e-13,  1.8190e-12,  2.0181e-12,  3.0518e-05,  3.0518e-05,
         3.0518e-05, -2.2714e-13, -6.1035e-05, -3.5279e-12, -2.7309e-12])
tensor([-2.2737e-13,  1.8190e-12,  2.0181e-12,  3.0518e-05,  3.0518e-05,
         3.0518e-05, -2.2714e-13, -6.1035e-05, -3.5279e-12, -2.7309e-12])

It can be seen that the error is 10 decimal places after the decimal point.

deep learning, machine learning, python, speech recognition

Pre: Completely delete Universal Recovery Master

Next: Sign in with Apple REST API / Revoke tokens (JAVA)

torchaudio spectrum feature extraction

torchaudio spectrum feature extraction

1. Read and save audio

2. Extract features

2.1 Short-time Fourier transform

2.2Conversion and use of pytorch complex values

2.3 Inverse transformation of Spectrogram

Leave a Reply Cancel Reply

Advertisement

About Developerknow

Privacy Policy

Contact Us

Sitemap

torchaudio spectrum feature extraction

torchaudio spectrum feature extraction

1. Read and save audio

2. Extract features

2.1 Short-time Fourier transform

2.2Conversion and use of pytorch complex values

2.3 Inverse transformation of Spectrogram

Related Posts

Leave a Reply Cancel Reply

Advertisement

Tags

About Developerknow

Privacy Policy

Contact Us

Sitemap