Abstract: Speech enhancement has seen great improvement in recent years using end-to-end neural networks. However, most models are agnostic to the spoken phonetic content. Recently, several studies suggested phonetic-aware speech enhancement, mostly using perceptual supervision. Yet, injecting phonetic features during model optimization can take additional forms (e.g., model conditioning). In this paper, we conduct a systematic comparison between different methods of incorporating phonetic information in a speech enhancement model. By conducting a series of controlled experiments, we observe the influence of different phonetic content models as well as various feature-injection techniques on enhancement performance, considering both causal and non-causal models. Specifically, we evaluate three settings for injecting phonetic information, namely: i) feature conditioning; ii) perceptual supervision; and iii) regularization. Phonetic features are obtained using an intermediate layer of either a supervised pre-trained Automatic Speech Recognition (ASR) model or by using a pre-trained Self-Supervised Learning (SSL) model. We further observe the effect of choosing different embedding layers on performance, considering both manual and learned configurations. Results suggest that using a SSL model as phonetic features outperforms the ASR one in most cases. Interestingly, the conditioning setting performs best among the evaluated configurations.
A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement @ Interspeech2022
This is a PyTorch implementation of the pipeline presented in A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement paper published @ Interspeech2022.
Project page
Abstract
A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement
Or Tal , Moshe Mandel , Felix Kreuk , Yossi Adi
23rd INTERSPEECH conference
Speech enhancement has seen great improvement in recent years using end-to-end neural networks. However, most models are agnostic to the spoken phonetic content.
Recently, several studies suggested phonetic-aware speech enhancement, mostly using perceptual supervision. Yet, injecting phonetic features during model optimization can take additional forms (e.g., model conditioning).
In this paper, we conduct a systematic comparison between different methods of incorporating phonetic information in a speech enhancement model.
By conducting a series of controlled experiments, we observe the influence of different phonetic content models as well as various feature-injection techniques on enhancement performance, considering both causal and non-causal models.
Specifically, we evaluate three settings for injecting phonetic information, namely: i) feature conditioning; ii) perceptual supervision; and iii) regularization.
Phonetic features are obtained using an intermediate layer of either a supervised pre-trained Automatic Speech Recognition (ASR) model or by using a pre-trained Self-Supervised Learning (SSL) model.
We further observe the effect of choosing different embedding layers on performance, considering both manual and learned configurations. Results suggest that using a SSL model as phonetic features outperforms the ASR one in most cases. Interestingly, the conditioning setting performs best among the evaluated configurations. Paper.
Cite
If you find this implementation useful please consider citing our work:
@article{tal2022systematic,
title={A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement},
author={Tal, Or and Mandel, Moshe and Kreuk, Felix and Adi, Yossi},
journal={arXiv preprint arXiv:2206.11000},
year={2022}
}
Usage
Clone the repository
git clone https://github.com/slp-rl/SC-PhASE.git
cd SC-PhASE
Installation
pip install -r requirements.txt
.
Note: torch installation may depend on your cuda version. see Install torch
Data preprocessing
- Download all and unzip the Valentini dataset
- Down-sample each directory using:
bash data_preprocessing_scripts/general/audio_resample_using_sox.sh <path to data dir> <path to target dir>
- Generate json files:
python data_preprocessing_scripts/speech_enhancement/valentini_egs_script.py --project_dir <full path to current project root> --dataset_base_dir <full path to the downsampled audio root, containing all downsampled dirs> --spk <num speakers in {28,56}, default=28>
Note: valentini_egs_script.py assumes that following dataset structure:
root dir
│
└─── noisy_trainset_{28/56}spk_wav
│ └─ Downsampled audio files
│
└─── clean_trainset_{28/56}spk_wav
│ └─ Downsampled audio files
│
└─── noisy_testset_wav
│ └─ Downsampled audio files
│
└─── clean_testset_wav
└─ Downsampled audio files
Download pretrained HuBERT model weights
- Download pretrained weights link
- Copy full path to the pretrained .pt file to
features_config.state_dict_path
field in: configurations/main_config.yaml
Train
Example of running commands could be found in: run_commands_examples/
Train Demucs(hidden:48, stride:4, resample:4) baseline example:
python train.py \
dset=noisy_clean \
experiment_name=h48u4s4_baseline \
hidden=48 \
stride=4 \
resample=4 \
features_dim=768 \
features_dim_for_conditioning=768 \
include_ft=False \
get_ft_after_lstm=False \
use_as_conditioning=False \
use_as_supervision=False \
learnable=False \
ddp=True \
batch_size=16
Test
Test the whole pipeline for a single epoch:
python train.py dset=debug eval_every=1 epochs=1 experiment_name=test_pipeline