Skip to the content.

1 Description

This website presents our solution for the speech inpainting using Long Short-Term Memory (LSTM) networks. We designed multi-layer LSTM networks and trained them with two types of speech datasets, which correspond to four single-speaker and four multi-speaker datasets. Our study aims to investigate the inpainting performance of the proposed models on the different datasets and varying LSTM layers, so as to explore the effect of multi-layer LSTM networks on the prediction of speech samples in terms of perceived audio quality. The inpainted speech quality is evaluated through the Mean Opinion Score (MOS) and frequency analysis of spectrogram.

MOS is commonly used in the telecommunications industry to measure the perceived quality of audio and video signals, as defined by the ITU-T Recommendation P.800. MOS returns a subjective rating scale that ranges from 1 to 5 to represent the quality, with a MOS of 5 being perceived as Excellent, 4 as Good, 3 as Fair, 2 as Poor and 1 as Bad. Two types of MOS are employed in this study, corresponding to both Narrow Bandwidth (NB) and Wide Bandwidth (WB) speech signals. NB varies from 300Hz to 3400Hz, while WB varies from 50Hz to 7000Hz.

On this web page, we will present sound samples and their corresponding MOS generated with the proposed models.


2 Inpainting process and Model architecture

The speech inpainting progress and model architecture are shown in Figure 1, the left side of the figure shows the speech inpainting process and the right side shows the structure of the LSTM model, note that only a 5-layer LSTM model is exhibited.

Inpainting process and model structure
Figure 1 - The speech inpainting process and network structure of the proposed LSTM model. The orange blocks labelled with pre_i on the right side indicate the predicted parts of the speech, i.e., the inpainted speech. Note that the orange blocks have two border types, the solid line and the dashed line. The dashed line only represents the position relationship between the inpainted signals and window_i, and the solid line represents the inpainted results.

3 Inpainting results

3.1 Single-speaker Datasets

Single-speaker inpainting results-1000ms
Figure 2 - The inpainting results of Original signal, Signal with gap(1000ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal
  • Gap length = 500ms

    Single-speaker inpainting results-500ms
    Figure 3 - The inpainting results of Original signal, Signal with gap(500ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal
  • Gap length = 200ms

    Single-speaker inpainting results-200ms
    Figure 4 - The inpainting results of Original signal, Signal with gap(200ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal
  • Gap length = 100ms

    Single-speaker inpainting results-100ms
    Figure 5 - The inpainting results of Original signal, Signal with gap(100ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal
  • Gap length = 50ms

    Single-speaker inpainting results-50ms
    Figure 6- The inpainting results of Original signal, Signal with gap(50ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal
  • Gap length = 40ms

    Single-speaker inpainting results-40ms
    Figure 7 - The inpainting results of Original signal, Signal with gap(40ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal
  • Gap length = 20ms

    Single-speaker inpainting results-20ms
    Figure 8 - The inpainting results of Original signal, Signal with gap(20ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal
  • MOS

    MOS(NB) and MOS(WB) are shown below (single-speaekr).

    MOS 20ms 40ms 50ms 100ms 200ms 500ms 1000ms
    MOS_NB 4.31 4.28 4.27 4.22 4.12 4.00 3.66
    MOS_WB 4.42 4.41 4.40 4.36 4.23 4.00 2.58

    3.2 Multi-speaker Datasets

    Multi-speaker inpainting results-1000ms
    Figure 9 - The inpainting results of Original signal, Signal with gap(1000ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal
  • Gap length = 500ms

    Multi-speaker inpainting results-500ms
    Figure 10 - The inpainting results of Original signal, Signal with gap(500ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal
  • Gap length = 200ms

    Multi-speaker inpainting results-200ms
    Figure 11 - The inpainting results of Original signal, Signal with gap(200ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal
  • Gap length = 100ms

    Multi-speaker inpainting results-100ms
    Figure 12 - The inpainting results of Original signal, Signal with gap(100ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal
  • Gap length = 50ms

    Multi-speaker inpainting results-50ms
    Figure 13 - The inpainting results of Original signal, Signal with gap(50ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal
  • Gap length = 40ms

    Multi-speaker inpainting results-40ms
    Figure 14 - The inpainting results of Original signal, Signal with gap(40ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal
  • Gap length = 20ms

    Multi-speaker inpainting results-20ms
    Figure 15 - The inpainting results of Original signal, Signal with gap(20ms) and Inpainted signal in time and frequency domain from top to bottom.
  • Original signal
  • Zeroed signal (Signal with gap)
  • Inpainted signal

  • MOS

    MOS(NB) and MOS(WB) are shown below (multi-speaekr).

    MOS 20ms 40ms 50ms 100ms 200ms 500ms 1000ms
    MOS_NB 4.08 4.07 4.06 4.02 3.29 2.75 2.16
    MOS_WB 3.92 3.85 3.84 3.85 3.44 2.76 2.03