This website presents our solution for the speech inpainting using Long Short-Term Memory (LSTM) networks. We designed multi-layer LSTM networks and trained them with two types of speech datasets, which correspond to four single-speaker and four multi-speaker datasets. Our study aims to investigate the inpainting performance of the proposed models on the different datasets and varying LSTM layers, so as to explore the effect of multi-layer LSTM networks on the prediction of speech samples in terms of perceived audio quality. The inpainted speech quality is evaluated through the Mean Opinion Score (MOS) and frequency analysis of spectrogram.
MOS is commonly used in the telecommunications industry to measure the perceived quality of audio and video signals, as defined by the ITU-T Recommendation P.800. MOS returns a subjective rating scale that ranges from 1 to 5 to represent the quality, with a MOS of 5 being perceived as Excellent, 4 as Good, 3 as Fair, 2 as Poor and 1 as Bad. Two types of MOS are employed in this study, corresponding to both Narrow Bandwidth (NB) and Wide Bandwidth (WB) speech signals. NB varies from 300Hz to 3400Hz, while WB varies from 50Hz to 7000Hz.
On this web page, we will present sound samples and their corresponding MOS generated with the proposed models.
2 Inpainting process and Model architecture
The speech inpainting progress and model architecture are shown in Figure 1, the left side of the figure shows the speech inpainting process and the right side shows the structure of the LSTM model, note that only a 5-layer LSTM model is exhibited.
3 Inpainting results
3.1 Single-speaker Datasets
The context of the auido signal is as follows, note that the gap is started from 1.62 seconds and lasted for 1000ms, 500ms, 200ms, 100ms, 50ms, 40ms and 20ms respectively.
Original Context: No, it’s no use, I can never, never forgive you, and it’s all over.
Zeroed Context(gap=1000ms): No, it’s no use, I can never, never forgive you, and it’s all over.
Zeroed Context(gap=500ms): No, it’s no use, I can never never forgive you, and it’s all over.
Zeroed Context(gap=200ms): No, it’s no use, I can never, never forgive you, and it’s all over.
Zeroed Context(gap=100ms): No, it’s no use, I can never, never forgive you, and it’s all over.
Zeroed Context(gap=50ms): No, it’s no use, I can never, never forgive you, and it’s all over.
Zeroed Context(gap=40ms): No, it’s no use, I can never, never forgive you, and it’s all over.
Zeroed Context(gap=20ms): No, it’s no use, I can never, never forgive you, and it’s all over.
Gap length = 1s
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
Gap length = 500ms
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
Gap length = 200ms
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
Gap length = 100ms
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
Gap length = 50ms
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
Gap length = 40ms
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
Gap length = 20ms
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
MOS
MOS(NB) and MOS(WB) are shown below (single-speaker).
MOS
20ms
40ms
50ms
100ms
200ms
500ms
1000ms
MOS_NB
4.31
4.28
4.27
4.22
4.12
4.00
3.66
MOS_WB
4.42
4.41
4.40
4.36
4.23
4.00
2.58
3.2 Multi-speaker Datasets
The context of the auido signal is as follows, note that the gap is started from 2.88 seconds and lasted for 1000ms, 500ms, 200ms, 100ms, 50ms, 40ms and 20ms respectively.
Original Context: Then we’ll run home together sometimes he saw her hand stretched out to find his own
Zeroed Context(gap=1000ms): Then we’ll run home together sometimes he saw her hand stretched out to find his own
Zeroed Context(gap=500ms): Then we’ll run home together sometimes he saw her hand stretched out to find his own
Zeroed Context(gap=200ms): Then we’ll run home together sometimes he saw her hand stretched out to find his own
Zeroed Context(gap=100ms): Then we’ll run home together sometimes he saw her hand stretched out to find his own
Zeroed Context(gap=50ms): Then we’ll run home together sometimes he saw her hand stretched out to find his own
Zeroed Context(gap=40ms): Then we’ll run home together sometimes he saw her hand stretched out to find his own
Zeroed Context(gap=20ms): Then we’ll run home together sometimes he saw her hand stretched out to find his own
Gap length = 1s
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
Gap length = 500ms
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
Gap length = 200ms
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
Gap length = 100ms
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
Gap length = 50ms
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
Gap length = 40ms
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
Gap length = 20ms
Original signal
Zeroed signal (Signal with gap)
Inpainted signal
MOS
MOS(NB) and MOS(WB) are shown below (multi-speaker).