Code for paper "A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement"

Code for paper "A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement"
Abstract: Speech enhancement has seen great improvement in recent years using end-to-end neural networks. However, most models are agnostic to the spoken phonetic content. Recently, several studies suggested phonetic-aware speech enhancement, mostly using perceptual supervision. Yet, injecting phonetic features during model optimization can take additional forms (e.g., model conditioning). In this paper, we conduct a systematic comparison between different methods of incorporating phonetic information in a speech enhancement model. By conducting a series of controlled experiments, we observe the influence of different phonetic content models as well as various feature-injection techniques on enhancement performance, considering both causal and non-causal models. Specifically, we evaluate three settings for injecting phonetic information, namely: i) feature conditioning; ii) perceptual supervision; and iii) regularization. Phonetic features are obtained using an intermediate layer of either a supervised pre-trained Automatic Speech Recognition (ASR) model or by using a pre-trained Self-Supervised Learning (SSL) model. We further observe the effect of choosing different embedding layers on performance, considering both manual and learned configurations. Results suggest that using a SSL model as phonetic features outperforms the ASR one in most cases. Interestingly, the conditioning setting performs best among the evaluated configurations.

A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement @ Interspeech2022

This is a PyTorch implementation of the pipeline presented in A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement paper published @ Interspeech2022.
Project page

Abstract

A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement
Or Tal , Moshe Mandel , Felix Kreuk , Yossi Adi
23rd INTERSPEECH conference

Speech enhancement has seen great improvement in recent years using end-to-end neural networks. However, most models are agnostic to the spoken phonetic content.
Recently, several studies suggested phonetic-aware speech enhancement, mostly using perceptual supervision. Yet, injecting phonetic features during model optimization can take additional forms (e.g., model conditioning).
In this paper, we conduct a systematic comparison between different methods of incorporating phonetic information in a speech enhancement model.
By conducting a series of controlled experiments, we observe the influence of different phonetic content models as well as various feature-injection techniques on enhancement performance, considering both causal and non-causal models.
Specifically, we evaluate three settings for injecting phonetic information, namely: i) feature conditioning; ii) perceptual supervision; and iii) regularization.
Phonetic features are obtained using an intermediate layer of either a supervised pre-trained Automatic Speech Recognition (ASR) model or by using a pre-trained Self-Supervised Learning (SSL) model.
We further observe the effect of choosing different embedding layers on performance, considering both manual and learned configurations. Results suggest that using a SSL model as phonetic features outperforms the ASR one in most cases. Interestingly, the conditioning setting performs best among the evaluated configurations. Paper.

Pipeline overview

Cite

If you find this implementation useful please consider citing our work:

@article{tal2022systematic,
  title={A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement},
  author={Tal, Or and Mandel, Moshe and Kreuk, Felix and Adi, Yossi},
  journal={arXiv preprint arXiv:2206.11000},
  year={2022}
}

Usage

Clone the repository

git clone https://github.com/slp-rl/SC-PhASE.git
cd SC-PhASE

Installation

pip install -r requirements.txt.

Note: torch installation may depend on your cuda version. see Install torch

Data preprocessing

  1. Download all and unzip the Valentini dataset
  2. Down-sample each directory using:
    bash data_preprocessing_scripts/general/audio_resample_using_sox.sh <path to data dir> <path to target dir>
  3. Generate json files:
    python data_preprocessing_scripts/speech_enhancement/valentini_egs_script.py --project_dir <full path to current project root> --dataset_base_dir <full path to the downsampled audio root, containing all downsampled dirs> --spk <num speakers in {28,56}, default=28>

    Note: valentini_egs_script.py assumes that following dataset structure:
root dir
│
└─── noisy_trainset_{28/56}spk_wav
│   └─  Downsampled audio files
│
└─── clean_trainset_{28/56}spk_wav
│   └─  Downsampled audio files
│
└─── noisy_testset_wav
│   └─  Downsampled audio files
│
└─── clean_testset_wav
    └─  Downsampled audio files

Download pretrained HuBERT model weights

  1. Download pretrained weights link
  2. Copy full path to the pretrained .pt file to features_config.state_dict_path field in: configurations/main_config.yaml

Train

Example of running commands could be found in: run_commands_examples/
Train Demucs(hidden:48, stride:4, resample:4) baseline example:

python train.py \
dset=noisy_clean \
experiment_name=h48u4s4_baseline \
hidden=48 \
stride=4 \
resample=4 \
features_dim=768 \
features_dim_for_conditioning=768 \
include_ft=False \
get_ft_after_lstm=False \
use_as_conditioning=False \
use_as_supervision=False \
learnable=False \
ddp=True \
batch_size=16

Test

Test the whole pipeline for a single epoch:
python train.py dset=debug eval_every=1 epochs=1 experiment_name=test_pipeline

Download Source Code

Download ZIP

Paper Preview

Aug 2, 2022