Speech-Declipping Transformer with Complex Spectrogram and Learnable Temporal Features
Younghoo Kwon and Jung-Woo Choi
School of Electrical Engineering, KAIST
[
arXiv
]
Abstract
We present a transformer-based speech-declipping model that effectively recovers clipped signals across a wide range of input signal-to-distortion ratios (SDRs). While recent time-domain deep neural network (DNN)-based declippers have outperformed traditional handcrafted and spectrogram-based DNN approaches, they still struggle with low-SDR inputs. To address this, we incorporate a transformer-based architecture that operates in the time-frequency (TF) domain. The TF-transformer architecture has demonstrated remarkable performance in the speech enhancement task for low-SDR signals but cannot be optimal for the time-domain artifact like clipping. To overcome the limitations of spectrogram-based DNNs, we design an extra convolutional block that directly extracts temporal features from time-domain waveforms. The joint analysis of complex spectrogram and learned temporal features allows the model to improve performance on both high- and low-SDR inputs. Our approach also preserves the unclipped portions of the speech signal during processing, preventing degradation typically seen when only spectral information is used. In evaluations on the VoiceBank-DEMAND and DNS challenge datasets, the proposed model consistently outperformed state-of-the-art (SOTA) declipping models across various metrics, demonstrating its robustness and generalizability.
Overall Architecture
Performance Table
Speech Samples
Click a toggle to listen to samples
VBDM dataset: 1 dB SDR input (p223_007)
1 dB SDR clipped speech
Your browser does not support the audio element.
Target Speech
Your browser does not support the audio element.
Proposed method
Your browser does not support the audio element.
DeFTAN-II
Your browser does not support the audio element.
A-SPADE
Your browser does not support the audio element.
T-UNet
Your browser does not support the audio element.
DD
Your browser does not support the audio element.
DDD
Your browser does not support the audio element.
VBDM dataset: 3 dB SDR input (p223_007)
3 dB SDR clipped speech
Your browser does not support the audio element.
Target Speech
Your browser does not support the audio element.
Proposed method
Your browser does not support the audio element.
DeFTAN-II
Your browser does not support the audio element.
A-SPADE
Your browser does not support the audio element.
T-UNet
Your browser does not support the audio element.
DD
Your browser does not support the audio element.
DDD
Your browser does not support the audio element.
VBDM dataset: 7 dB SDR input (p223_007)
7 dB SDR clipped speech
Your browser does not support the audio element.
Target Speech
Your browser does not support the audio element.
Proposed method
Your browser does not support the audio element.
DeFTAN-II
Your browser does not support the audio element.
A-SPADE
Your browser does not support the audio element.
T-UNet
Your browser does not support the audio element.
DD
Your browser does not support the audio element.
DDD
Your browser does not support the audio element.
VBDM dataset: 15 dB SDR input (p223_007)
15 dB SDR clipped speech
Your browser does not support the audio element.
Target Speech
Your browser does not support the audio element.
Proposed method
Your browser does not support the audio element.
DeFTAN-II
Your browser does not support the audio element.
A-SPADE
Your browser does not support the audio element.
T-UNet
Your browser does not support the audio element.
DD
Your browser does not support the audio element.
DDD
Your browser does not support the audio element.
DNS dataset: 1 dB SDR input (fileid_2)
1 dB SDR clipped speech
Your browser does not support the audio element.
Target Speech
Your browser does not support the audio element.
Proposed method
Your browser does not support the audio element.
DeFTAN-II
Your browser does not support the audio element.
A-SPADE
Your browser does not support the audio element.
T-UNet
Your browser does not support the audio element.
DD
Your browser does not support the audio element.
DDD
Your browser does not support the audio element.
DNS dataset: 3 dB SDR input (fileid_2)
3 dB SDR clipped speech
Your browser does not support the audio element.
Target Speech
Your browser does not support the audio element.
Proposed method
Your browser does not support the audio element.
DeFTAN-II
Your browser does not support the audio element.
A-SPADE
Your browser does not support the audio element.
T-UNet
Your browser does not support the audio element.
DD
Your browser does not support the audio element.
DDD
Your browser does not support the audio element.
DNS dataset: 7 dB SDR input (fileid_2)
7 dB SDR clipped speech
Your browser does not support the audio element.
Target Speech
Your browser does not support the audio element.
Proposed method
Your browser does not support the audio element.
DeFTAN-II
Your browser does not support the audio element.
A-SPADE
Your browser does not support the audio element.
T-UNet
Your browser does not support the audio element.
DD
Your browser does not support the audio element.
DDD
Your browser does not support the audio element.
DNS dataset: 15 dB SDR input (fileid_2)
15 dB SDR clipped speech
Your browser does not support the audio element.
Target Speech
Your browser does not support the audio element.
Proposed method
Your browser does not support the audio element.
DeFTAN-II
Your browser does not support the audio element.
A-SPADE
Your browser does not support the audio element.
T-UNet
Your browser does not support the audio element.
DD
Your browser does not support the audio element.
DDD
Your browser does not support the audio element.