Open access peer-reviewed chapter

Methods for Speech Signal Structuring and Extracting Features

Written By

Eugene Fedorov, Tetyana Utkina and Tetiana Neskorodieva

Submitted: 11 March 2022 Reviewed: 23 March 2022 Published: 16 June 2022

DOI: 10.5772/intechopen.104634

From the Edited Volume

Computational Semantics

Edited by George Dekoulis and Jainath Yadav

Chapter metrics overview

74 Chapter Downloads

View Full Metrics


The preliminary stage of the biometric identification is speech signal structuring and extracting features. For calculation of the fundamental tone are considered and in number investigated the following methods – autocorrelation function (ACF) method, average magnitude difference function (AMDF) method, simplified inverse filter transformation (SIFT) method, method on a basis a wavelet analysis, method based on the cepstral analysis, harmonic product spectrum (HPS) method. For speech signal extracting features are considered and in number investigated the following methods – the digital bandpass filters bank; spectral analysis; homomorphic processing; linear predictive coding. This methods make it possible to extract linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients, mel-frequency cepstral coefficients (MFCC), barkfrequency cepstral coefficients (BFCC), perceptual linear prediction coefficients (PLPC), perceptual reflection coefficients (PRC), perceptual linear prediction cepstral coefficients (PLPCC), perceptual log area ratio (PLAR) coefficients, reconsidered perceptual linear prediction coefficients (RPLPC), reconsidered perceptual reflection coefficients (RPRC), reconsidered perceptual linear prediction cepstral coefficients (RPLPCC), reconsidered perceptual log area ratio (RPLAR) coefficients. The largest probability of identification (equal 0.98) and the smallest number of coefficients (4 coefficients) are provided by coding of a vocal of the speech sound from the TIMIT based on PRC.


  • speech recognition
  • speech signal structuring and extracting features
  • the digital bandpass filters bank
  • spectral analysis
  • homomorphic processing
  • linear predictive coding

1. Introduction

Most often from a speech signal the following features are distinguished [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]: power features (energy of a spectral bands); cepstrum; linear predictive parameters; fundamental tone and formant; mel-frequency cepstral coefficients (MFCC); bark-frequency cepstral coefficients (BFCC); parameters of perceptual linear prediction; parameters of the reconsidered perceptual linear prediction.

For features extraction of a speech signal usually use [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]: digital bandpass filters bank; spectral analysis (Fourier’s transformation, wavelet transformation); homomorphic processing; linear predictive coding; MFCC method; BFCC method; perceptual linear prediction; reconsidered perceptual linear prediction.


2. Calculation methods of the fundamental tone

For calculation of the fundamental tone use methods which are based on a basis of the analysis of the following signal representations [3]: amplitude-time; spectral (amplitude-frequency); cepstral (maplitude-quefrency); wavelet-spectral (amplitude-time-frequency).

2.1 ACF method

The autocorrelation function (ACF) method carries out search of the maximum value in autocorrelated function [3]:

1. For the chosen signal frame of length ΔN calculates autocorrelated function


2. Impulse response function initialization Is defined at what value k autocorrelated function Rk it is maximum that corresponds to extraction of the periods in a speech signal


The period of the fundamental tone is defined in a form


where n1—minimum length of the fundamental tone period, n1=infTОТ, n2—maximum length of the fundamental tone period, n2=supTОТ.

2.2 AMDF method

The average magnitude difference function (AMDF) method carries out search of the minimum value as the average magnitude difference [3] that quicker than search of the maximum value in autocorrelated function.

  1. For the chosen signal frame of length ΔN calculates function of the average magnitude difference


  2. Is defined at what value k function of the average magnitude difference vk it is minimum that corresponds to extract of the periods in a speech signal


The period of the fundamental tone is defined in a look


where n1—minimum length of the period of the fundamental tone, n1=infTОТ, n2—maximum length of the period of the fundamental tone, n2=supTОТ.

2.3 SIFT method

The simplified inverse filter transformation (SIFT) method carries out search of the maximum value in autocorrelated function of linear prediction error of the decimated signal [4]:

  1. For the chosen signal frame of length ΔN extracted the frequency range containing the frequency of the fundamental tone by means of elliptic LPF with a cut frequency fcut=1000 Hz. Instead of the elliptic LPF used in [4] the consecutive calculation is offered:

    • DFT (discrete Fourier transform)


    • extract of the lower frequencies


      where fd—sampling frequency;

    • calculation of the inverse DFT


  2. Decreases sampling frequencies to f1d=2000 Hz by decimation of a signal, i.e. are removed intermediate samples of a signal


    where Δn=fd/f1d—decimation coefficient,fd—sampling frequency.

  3. The differences of two next samples of the decimated signal are calculated


  4. Autocorrelated function is calculated


    where wm—Hamming’s window, p—order of linear prediction, ceilf1d/1000p5+ceilf1d/1000, ceilf—function which rounds f to the next integer.

  5. LPC coefficients are calculated aj according to the procedure Darbin.

  6. The error of linear prediction by means of LPC coefficients is calculated


    where en—prediction error.

  7. Autocorrelated function of a linear error of prediction is calculated


    where wm—Hamming’s window.

  8. Is defined at what value k autocorrelated function rk it is maximum that corresponds to extraction of the periods in a speech signal


where n1—minimum length of the fundamental tone period, n1=infTОТ,

n2—maximum length of the fundamental tone period, n2=supTОТ.

Thus, length of the fundamental tone period is determined in a form


where γ—the threshold value.

Example 1

In Figure 1 the source signal, is presented on Figure 2noisy (additive white is added the noise with a mean 0 and variance 0.001 is Gaussian), on Figure 3filtered and M=1.

Figure 1.

Initial signal.

Figure 2.

The filtered signal.

Figure 3.

The decimation signal.

As a signal the frame of a sound “A” length is chosen ΔN= 512 with a sampling frequency fd=22050 Hz, 8 bits, mono. In Figures 16 the initial signal (Figure 1), the filtered signal (Figure 2), the decimated signal (Figure 3), a signal in the form of the weighed difference (Figure 4), an error of prediction (Figure 5), autocorrelated function of an error of the prediction with extraction of the found maximum and admissible boundaries (Figure 6) are presented.

Figure 4.

A signal in the form of the weighed difference.

Figure 5.

Prediction error.

Figure 6.

Autocorrelated function of prediction error.

2.4 Method on a basis a wavelet analysis

This method calculates distance between the next minimum a wavelet coefficients.

At the first stage the continuous wavelet transformation which is approximated according to a rectangles formula in a look is calculated


where μ—the decomposition level at which the smooth sinusoid is reached, N—signal length, Δt—quantization step.

For Morle’s wavelet


As sequence dμl represents a smooth sinusoid, the use needs of autocorrelated function and function of the average value of a difference of signal amplitudes having considerable computing complexity disappears. Instead of calculation of these functions at the second stage in the sequence dμl two are defined in a row going a maximum and the difference between them in a form is calculated


The period of the fundamental tone is defined in a form


where n1—minimum length of the period of the fundamental tone, n1=infTОТ, n2—maximum length of the period of the fundamental tone, n2=supTОТ.

Example 2

In Figure 7 it is given a sound “A”, and in Figure 8a sound “A” on μ=50 decomposition level.

Figure 7.

Sound “A” for wavelet analysis.

Figure 8.

A sound “A” at the 50th level of decomposition (frequency range is 51–250 Hz).

2.5 Method based on the cepstral analysis

This method carries out search of the maximum value in cepstrum [3].

  1. For the chosen signal frame of length ΔN calculates a spectrum, using DFT


  2. Cepstrum is calculated, using the inverse DFT


  3. Is defined at what value n cepstrum sn it is maximum that corresponds to extraction of the periods in a speech signal


where n1—minimum length of the period of the fundamental tone, n1=infTОТ, n2—maximum length of the period of the fundamental tone, n2=supTОТ.

The period of the fundamental tone is defined in a form


where γ—the threshold value.

Example 3

As a signal the frame of a sound “A” length is chosen ΔN = 512 with a sampling frequency fd = 22050 Hz, 8 bits, mono. In Figure 9 it is given an initial signal, and in Figure 10cepstrum of a signal.

Figure 9.

Initial signal for cepstrum analysis.

Figure 10.

Cepstrum of a sound “A”.

2.6 HPS method

The harmonic product spectrum (HPS) method carries out search of the maximum value in the product of harmonicas of the decimated power spectrum [3].

  1. For the chosen signal frame of length ΔN calculates a spectrum, using DFT


  2. The power spectrum of a signal is calculated


  3. Z times a power spectrum of a signal is decimated, i.e. intermediate frequencies of a power spectrum of a signal are removed


    where —integer part of number.

  4. The product of harmonicas of the decimated power spectrum is calculated


  5. Is defined at what value k the product of harmonicas of the decimated power spectrum as much as possible that corresponds to extraction of the periods in a speech signal


Frequency of the fundamental tone is determined in a form


where k1—minimum frequency of the fundamental tone, k1=infFОТ, k2—maximum frequency of the fundamental tone, k2=supFОТ.

The SIFT, ACF, AMDF methods, based on the cepstral analysis depend on noise level.

The HPS methods, on a basis a wavelet analysis, are resistant to noise.

The SIFT methods, based on the cepstral analysis demand a threshold task.

The method on a basis a wavelet analysis demands the setting level of decomposition.

The HPS method demands a task of decimating quantity.


3. Calculation method of linear prediction parameters

The linear predictive coding method uses the amplifier and the digital filter (Figure 11).

Figure 11.

The block diagram of the simplified model of signal formation.

Thus, the signal can be presented in the signal form at the input of the linear system with variables on time parameters excited by quasiperiodic impulses or random noise.

Transfer function of a linear system with variable parameters Hz is considered as the relation of an output signal spectrum Sz to input signal spectrum Uz


where Az—the inverse filter for the system Hz, G—coefficient of gain, p—a prediction order (filter order).

The input signal un is presented by the pulse sequence and noise. The model has the following parameters: coefficient of gain G and coefficients of the digital filter ak. All these parameters slowly change in time and can be estimated on frames.

This method as features linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients are used [3].

  1. Signal sm breaks on L frames of the length ΔN. For n-th frame by means of LPF the balancing of the spectrum having steep descent in area of high frequencies is carried out


    where α—filtration parameter, 0<α<1.

  2. For n-th frame the autocorrelated function is calculated Rnk


    where wm—Hamming’s window, p—order of linear prediction, ceilfd/1000p5+ceilfd/1000, ceilf—function which rounds f to the next integer.

  3. For n-th frame linear prediction coefficients (LPC) anj and reflection coefficients (RC) kni are calculated according to the procedure Darbin.

  4. For n-th frame gain coefficient is calculated Gn.


  5. For n-th frame linear prediction cepstral coefficients (LPCC) are calculated


  6. For n-th frame log area ratio (LAR) coefficients are calculated


4. Calculation method formant

For n-th of a frame the logarithmic power spectrum is calculated, using coefficient of gain and linear prediction coefficients (LPC) [3, 4]


At identification of the person or speech recognition for the analysis of vocalized sounds with a frequency range from 0 to 3 kHz are limited and the first 3 formant use F1,F2,F3. At synthesis of the speech with a frequency range from 0 to 4–5 kHz are limited and use the first 5 formant F1,F2,F3,F4,F5.

Example 4

In Figure 12 the logarithmic power spectrum of the central frame of a sound “A” with different orders of prediction, at the same time length of a frame N=512, sampling frequency is presented fd=22050 Hz.

Figure 12.

The Logarithmic power spectrum of LPC of a sound “A” at different orders of prediction p.

Apparently from Figure 12, extraction a formant (maximum in a spectrum) perhaps already at p=30.

Example 5

In Figure 13 it is given a sound “A”, and in Figure 14—its logarithmic power spectrum of LPC. In Figure 15 it is given the central frame of a sound “Sh”, and in Figure 16—its logarithmic power spectrum of LPC. At the same time length of a frame N=512, sampling frequency fd=22050 Hz., 8 bits, mono, prediction order p=30.

Figure 13.

Sound “A”.

Figure 14.

Logarithmic power spectrum of LPC sound “A” at a prediction order p=30.

Figure 15.

Sound “Sh”.

Figure 16.

Logarithmic power spectrum of LPC sound “Sh” at an order of prediction p=30.


5. Method of mel-frequency cepstral coefficients calculation

This method is based on homomorphic processing and uses as features mel-frequency cepstral coefficients (MFCC) [5, 6].

  1. Signal sm breaks on L frames of the length ΔN. For n-th frame by means of LPF the balancing of the spectrum having steep descent in area of high frequencies is carried out


    where α—filtration parameter, 0<α<1.

  2. For n-th frame the spectrum is calculated, using DFT


    where wm—Hamming’s window.

  3. For n-th frame on i-th mel-frequency band, the energy mel-frequency band is calculated, using frequency transformation and Bartlett’s window


where Eim—energy of m-th mel-frequency band, wmk—Bartlett's window for band m-th,Bf—function which will transform frequency to Hz in frequency in mel,B1b—function which will transform frequency to mel in frequency in Hz, fm—normalized frequency,fmin,fmax—minimum and maximum frequency in Hz (for example, fmin=0,fmax=fd/2),fd—frequency of sampling of a speech signal in Hz, P—quantity of mel-frequency bands.

4. For n-th frame are calculated mel-frequency cepstral coefficients (MFCC), using the inverse discrete cosine transformation DCT-2


where P˜—quantity mel-frequency cepstral coefficients, 1P˜P.


6. Method of bark-frequency cepstral coefficients calculation

This method is based on homomorphic processing and uses as features are used a bark-frequency cepstral coefficients (BFCC) [7, 8].

  1. Signal sm breaks on frames ΔN of length the L. For n-th frame the spectrum is calculated, using DFT


    where wm—Hamming’s window.

  2. The quantity of bark-frequency bands is calculated


    where ceilf—function which rounds f to the next integer, fd—frequency of sampling of a speech signal in Hz, Bf—function which will transform frequency to Hz in frequency in a bark.

  3. For n-th frame energy of bark-frequency bands is calculated


    where Eim—energy of i-th a bark-frequency band, wmk—trapezoidal window for band m-th.

  4. For n-th frame the distortion of equal loudness for energy of bark-frequency bands is carried out


    where vf—function for distortion of equal loudness (allows to approach human acoustical perception as the person has an unequal sensitivity of hearing at different frequencies), B1b—function which will transform frequency to a bark in frequency in Hz.

  5. For n-th frame the law of intensity loudness is applied to energy of bark-frequency bands


  6. For n-th frame are calculated a bark-frequency cepstral coefficients (BFCC), using the inverse discrete cosine transformation DCT-2, and previously it is necessary to replace energy E˜n0 and E˜n,P1 energy E˜n1 and E˜n,P2 respectively


where P˜—quantity a bark-frequency cepstral coefficients, 1P˜P.


7. Method of parameters of perceptual linear prediction calculation

In this method as features perceptual linear prediction coefficients (PLPC), perceptual reflection coefficients (PRC), perceptual linear prediction cepstral coefficients (PLPCC), perceptual log area ratio (PLAR) coefficients are used [9, 10].

  1. Signal sm breaks on frames ΔN of the length L. For n-th frame the spectrum is calculated, using DFT


    where wm—Hamming’s window.

  2. The quantity of bark-frequency bands is calculated


    where ceilf—function which rounds f to the next integer, fd—frequency of sampling of a speech signal in Hz, Bf—function which will transform frequency to Hz in frequency in a bark.

  3. For n-th frame energy of bark-frequency bands is calculated


    where Eim—energy of i-th a bark-frequency band, wmk—trapezoidal window for m-th band.

  4. For n-th frame the distortion of equal loudness for energy of bark-frequency bands is carried out


    where vf—function for distortion of equal loudness (allows to approach human acoustical perception as the person has an unequal sensitivity of hearing at different frequencies), B1b—function which will transform frequency to a bark in frequency in Hz.

  5. For n-th frame the law of intensity loudness is applied to energy of bark-frequency bands


  6. For n-th frame values of autocorrelated function are calculated, using the inverse DFT, previously it is necessary to replace energy E˜n0 and E˜n,P1 energy E˜n1 and E˜n,P2 respectively


    where p—order of linear prediction, ceilfd/1000p5+ceilfd/1000 , ceilf—function which rounds f to the next integer.

  7. For n-th frame perceptual linear prediction coefficients (PLPC) anj and perceptual reflection coefficients (PRC) kni are calculated according to the procedure Darbin.

  8. For n-th frame gain coefficient is calculated Gn


  9. For n-th of frame perceptual linear prediction cepstral coefficients (PLPCC) are calculated


  10. For n-th frame perceptual log area ratio (PLAR) is calculated


8. Method of parameters of reconsidered perceptual linear prediction calculation

In this method as features reconsidered perceptual linear prediction coefficients (RPLPC), reconsidered perceptual reflection coefficients (RPRC), the reconsidered perceptual linear prediction cepstral coefficients (RPLPCC), the reconsidered perceptual log area ratio (PLAR) coefficients are used [7, 8].

  1. Signal sm breaks on L frames of the length ΔN. For frame n-th by means of LPF the balancing of the spectrum having steep descent in area of high frequencies is carried out


    where α—filtration parameter, 0<α<1.

  2. For n-th frame the spectrum is calculated, using DFT


    where wm—Hamming’s window.

  3. For n-th frame on i-th mel-frequency band, the energy mel-frequency band is calculated, using frequency transformation and Bartlett’s window


    where Eim—energy of m-th mel-frequency band, wmk—Bartlett’s window for band m-th, Bf—function which will transform frequency to Hz in frequency in mel, B1b—function which will transform frequency to mel in frequency in Hz, fm—normalized frequency, fmin,fmax—minimum and maximum frequency in Hz (for example, fmin=0,fmax=fd/2), fd—frequency of sampling of a speech signal in Hz, P—quantity of mel-frequency bands.

  4. For n-th frame values of autocorrelated function are calculated, using the inverse DFT


    where p—order of linear prediction, ceilfd/1000p5+ceilfd/1000, ceilf—function which rounds f to the next integer.

  5. For n-th frame reconsidered perceptual linear prediction coefficients (RPLPC) anj and reconsidered perceptual reflection coefficients (RPRC) kni are calculated according to the procedure Darbin.

  6. For n-th frame gain coefficient is calculated Gn


  7. For n-th frame the reconsidered perceptual linear prediction cepstral coefficients (RPLPCC) are calculated


  8. For n-th frame the reconsidered perceptual log area ratio (PLAR) are calculated


9. The performance comparison of various features for person identification

For the speech signals containing vocal sounds the sampling frequency 8 kHz and the number of quantization levels 256 was established. Sample length of a vocal sound of the speech is equal to 256.

A numerical research results of LPC, RC, LPCC, LAR coefficients, MFCC, BFCC, PLPC, PRC, PLPCC, PLAR coefficients, RPLPC, RPRC, RPLPCC, RPLAR coefficients received by methods of coding and used for biometric identification of people from the TIMIT database on vocal sounds by means of the Gaussian mixed models (GMM) are presented in Table 1.

Coefficient’s typeIdentification probabilityCoefficients number
LAR coefficients0.8212
PLAR coefficients0.844
RPLAR coefficients0.8312

Table 1.

Numerical research results of the coefficients used for personality biometric identification.

For coding methods for the analysis of a speech signal the filter order in case of linear prediction is equal 12, in case of perceptual linear prediction is equal 4, in case of the reconsidered perceptual linear prediction is equal 12, quantity mel-frequency bands equally 20, quantity a bark-frequency bands equally 17, the number of cepstral parameters based on subbands is equal to 13.

The result presented in Table 1 shows that the largest probability of identification and the smallest number of coefficients are provided by coding of a vocal sound of the speech based on PRC.


10. Conclusion

The preliminary stage of the biometric identification is speech signal structuring and extracting features.

For calculation of the fundamental tone are considered and in number investigated the following methods of digital signal processing—ACF (autocorrelation function) method, AMDF (Average Magnitude. Difference Function) method, SIFT (Simplified Inverse Filter Transformation) method, method on a basis a wavelet analysis, method based on the cepstral analysis, HPS (Harmonic Product Spectrum) method. For speech signal extracting features are considered and in number investigated the following methods of digital signal processing—the digital bandpass filters bank; spectral analysis (Fourier’s transformation, wavelet transformation); homomorphic processing; linear predictive coding. This methods make it possible to extract linear prediction coefficients (LPC), reflection coefficients (RC), linear prediction cepstral coefficients (LPCC), log area ratio (LAR) coefficients, mel-frequency cepstral coefficients (MFCC), bark-frequency cepstral coefficients (BFCC), perceptual linear prediction coefficients (PLPC), perceptual reflection coefficients (PRC), perceptual linear prediction cepstral coefficients (PLPCC), perceptual log area ratio (PLAR) coefficients, reconsidered perceptual linear prediction coefficients (RPLPC), reconsidered perceptual reflection coefficients (RPRC), reconsidered perceptual linear prediction cepstral coefficients (RPLPCC), reconsidered perceptual log area ratio (RPLAR) coefficients. Results of a numerical research of speech signal features extraction methods for voice signals people from the TIMIT (Texas Instruments and Massachusetts Institute of Technology) database were received. The features PRC proved to be the most effective.


  1. 1. Oppenheim AV, Schafer RW. Discrete-Time Signal Processing. Upper Saddle River, NJ: Prentice Hall; 2010. p. 1108
  2. 2. Mallat S. A Wavelet Tour of Signal Processing: Sparse Way. Bourlington, MA: Academic Press; 2008. p. 832. DOI: 10.1016/B978-0-12-374370-1.X0001-8
  3. 3. Rabiner LR, Schafer RW. Theory and Applications of Digital Speech Processing. Upper Saddle River, NJ: Pearson Higher Education; 2011. p. 1042
  4. 4. Markel JD, Gray AH. Linear Prediction of Speech. Berlin: Springer Verlag; 1976. p. 382
  5. 5. Davis SB, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustic, Speech and Signal Processing. 1980;28(4):357-366
  6. 6. Ganchev T, Fakotakis N, Kokkinakis G. Comparative evaluation of various MFCC implementations on the speaker verification task. In: Proceedings of SPECOM 2005. Vol. 1. Patras, Greece; 2005. pp. 191-194
  7. 7. Josef R, Pollak P. Modified feature extraction methods in robust speech recognition. In: Proceedings of the 17th IEEE Internations Conference Radioelektronika. Brno, Czech Republic: IEEE; 2007. pp. 1-4
  8. 8. Kumar P, Biswas A, Mishra AN, Chandra M. Spoken language identification using hybrid feature extraction methods. Journal of Telecommunications. 2010;1(2):11-15
  9. 9. Huang X, Acero A, Hon H-W. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Upper Saddle River, NJ: Prentice Hall; 2001. p. 980
  10. 10. Hermansky H. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America. 1990;87(4):1738-1752. DOI: 10.1121/1.399423

Written By

Eugene Fedorov, Tetyana Utkina and Tetiana Neskorodieva

Submitted: 11 March 2022 Reviewed: 23 March 2022 Published: 16 June 2022