1 Description

This website presents our solution for the speech inpainting using Long Short-Term Memory (LSTM) networks. We designed multi-layer LSTM networks and trained them with two types of speech datasets, which correspond to four single-speaker and four multi-speaker datasets. Our study aims to investigate the inpainting performance of the proposed models on the different datasets and varying LSTM layers, so as to explore the effect of multi-layer LSTM networks on the prediction of speech samples in terms of perceived audio quality. The inpainted speech quality is evaluated through the Mean Opinion Score (MOS) and frequency analysis of spectrogram.

MOS is commonly used in the telecommunications industry to measure the perceived quality of audio and video signals, as defined by the ITU-T Recommendation P.800. MOS returns a subjective rating scale that ranges from 1 to 5 to represent the quality, with a MOS of 5 being perceived as Excellent, 4 as Good, 3 as Fair, 2 as Poor and 1 as Bad. Two types of MOS are employed in this study, corresponding to both Narrow Bandwidth (NB) and Wide Bandwidth (WB) speech signals. NB varies from 300Hz to 3400Hz, while WB varies from 50Hz to 7000Hz.

On this web page, we will present sound samples and their corresponding MOS generated with the proposed models.

2 Inpainting process and Model architecture

The speech inpainting progress and model architecture are shown in Figure 1, the left side of the figure shows the speech inpainting process and the right side shows the structure of the LSTM model, note that only a 5-layer LSTM model is exhibited.

Inpainting process and model structure — Figure 1 - The speech inpainting process and network structure of the proposed LSTM model. The orange blocks labelled with pre_i on the right side indicate the predicted parts of the speech, i.e., the inpainted speech. Note that the orange blocks have two border types, the solid line and the dashed line. The dashed line only represents the position relationship between the inpainted signals and window_i, and the solid line represents the inpainted results.

3 Inpainting results

3.1 Single-speaker Datasets

The context of the auido signal is as follows, note that the gap is started from 1.62 seconds and lasted for 1000ms, 500ms, 200ms, 100ms, 50ms, 40ms and 20ms respectively.
Original Context: No, it’s no use, I can never, never forgive you, and it’s all over.
Zeroed Context(gap=1000ms): No, it’s no use, I ca~~n never, never~~ forgive you, and it’s all over.
Zeroed Context(gap=500ms): No, it’s no use, I ca~~n never~~ never forgive you, and it’s all over.
Zeroed Context(gap=200ms): No, it’s no use, I ca~~n ne~~ver, never forgive you, and it’s all over.
Zeroed Context(gap=100ms): No, it’s no use, I can never, never forgive you, and it’s all over.
Zeroed Context(gap=50ms): No, it’s no use, I can never, never forgive you, and it’s all over.
Zeroed Context(gap=40ms): No, it’s no use, I can never, never forgive you, and it’s all over.
Zeroed Context(gap=20ms): No, it’s no use, I can never, never forgive you, and it’s all over.
Gap length = 1s

Single-speaker inpainting results-1000ms — Figure 2 - The inpainting results of Original signal, Signal with gap(1000ms) and Inpainted signal in time and frequency domain from top to bottom.

Original signal

Zeroed signal (Signal with gap)

Inpainted signal

Gap length = 500ms

Single-speaker inpainting results-500ms — Figure 3 - The inpainting results of Original signal, Signal with gap(500ms) and Inpainted signal in time and frequency domain from top to bottom.

Original signal

Zeroed signal (Signal with gap)

Inpainted signal

Gap length = 200ms

Single-speaker inpainting results-200ms — Figure 4 - The inpainting results of Original signal, Signal with gap(200ms) and Inpainted signal in time and frequency domain from top to bottom.

Original signal

Zeroed signal (Signal with gap)

Inpainted signal

Gap length = 100ms

Single-speaker inpainting results-100ms — Figure 5 - The inpainting results of Original signal, Signal with gap(100ms) and Inpainted signal in time and frequency domain from top to bottom.

Original signal

Zeroed signal (Signal with gap)

Inpainted signal

Gap length = 50ms

Single-speaker inpainting results-50ms — Figure 6- The inpainting results of Original signal, Signal with gap(50ms) and Inpainted signal in time and frequency domain from top to bottom.

Original signal

Zeroed signal (Signal with gap)

Inpainted signal

Gap length = 40ms

Single-speaker inpainting results-40ms — Figure 7 - The inpainting results of Original signal, Signal with gap(40ms) and Inpainted signal in time and frequency domain from top to bottom.

Original signal

Zeroed signal (Signal with gap)

Inpainted signal

Gap length = 20ms

Single-speaker inpainting results-20ms — Figure 8 - The inpainting results of Original signal, Signal with gap(20ms) and Inpainted signal in time and frequency domain from top to bottom.

Original signal

Zeroed signal (Signal with gap)

Inpainted signal

MOS

MOS(NB) and MOS(WB) are shown below (single-speaker).

MOS	20ms	40ms	50ms	100ms	200ms	500ms	1000ms
MOS_NB	4.31	4.28	4.27	4.22	4.12	4.00	3.66
MOS_WB	4.42	4.41	4.40	4.36	4.23	4.00	2.58

3.2 Multi-speaker Datasets

The context of the auido signal is as follows, note that the gap is started from 2.88 seconds and lasted for 1000ms, 500ms, 200ms, 100ms, 50ms, 40ms and 20ms respectively.
Original Context: Then we’ll run home together sometimes he saw her hand stretched out to find his own
Zeroed Context(gap=1000ms): Then we’ll run home together ~~sometimes he saw~~ her hand stretched out to find his own
Zeroed Context(gap=500ms): Then we’ll run home together ~~sometimes h~~e saw her hand stretched out to find his own
Zeroed Context(gap=200ms): Then we’ll run home together ~~some~~times he saw her hand stretched out to find his own
Zeroed Context(gap=100ms): Then we’ll run home together sometimes he saw her hand stretched out to find his own
Zeroed Context(gap=50ms): Then we’ll run home together sometimes he saw her hand stretched out to find his own
Zeroed Context(gap=40ms): Then we’ll run home together sometimes he saw her hand stretched out to find his own
Zeroed Context(gap=20ms): Then we’ll run home together sometimes he saw her hand stretched out to find his own
Gap length = 1s