Dissecting deepspeech.pytorch Part 1

Published in

The Startup

5 min readJun 27, 2020

deepspeech.pytorch is one of implementations of Baidu’s DeepSpeech2 paper. I think deepspeech.pytorch is clean and relatively simple and very educative.

I’m thinking about dissecting deepspeech.pytorch in 2 articles, part 1 for overall concept and data processing, part 2 for model. Part 2 is here.

What is DeepSpeech2

DeepSpeech2 is one of idea/architecture for speech-to-text model. In a nutshell, you can train your own speech-to-text model with DeepSpeech2. BTW, DeepSpeech2 has two siblings, DeepSpeech and DeepSpeech3.

What was special about DeepSpeech?

DeepSpeech was one of pioneer models of deep learning based end-to-end speech-to-text model. Since the paper was published on Dec 2015, there are many end-to-end models out there. But most of models are strongly influenced by DeepSpeech. Mozilla is even trying to make DeepSpeech de facto open source model for speech-to-text.

BTW, I’m not crazy about Mozilla’s implementation partly because their implementation has too many features and difficult to understand, but mainly because their implementation is Tensorflow-based.

Anyway, I think DeepSpeech2 is good enough even for production if you have data. And the deepspeech.pytorch is a good implementation because it’s simple and easy to understand, means you can easily extend it.

Data

Before looking the code, let’s check what kind of dataset we use. Since our goal is to predict text out of audio, input is audio and output is texts.

Input: audio

Output: text

ENTER TWO NINE EIGHT ONE

If deep learning is familiar to you, you probably know how to handle image datasets, but since sounds data is relatively minor, let’s dissect source code.

Data preprocessing

First, you need to process audio data into something pytorch model can understand.

Manifest CSV file

train.py take csv file called manifest file, which is a csv file containing the paths to wav files and label texts files.

getitem in SpectrogramDataset

self.ids is a list of rows of CSV file. You can access specific row containing path of wav and text. self.parse_audio and self.parse_transcript looks important

def __getitem__(self, index):
    sample = self.ids[index]
    audio_path, transcript_path = sample[0], sample[1]
    spect = self.parse_audio(audio_path)
    transcript = self.parse_transcript(transcript_path)
    return spect, transcript

BWT, self.ids are initialized like below. You can see manifest file is just read split by “,”.

with open(manifest_filepath) as f:
    ids = f.readlines()
ids = [x.strip().split(',') for x in ids]

self.parse_audio

parse_audio basically does read audio file and do some augmentation, and make it spectrogram.

load_audio is like this. Read audio file and normalize it.

from scipy.io.wavfile import readdef load_audio(path):    try:
        sample_rate, sound = read(path)
    except:
        print("path: ", path)
    
    sound = sound.astype('float32') / 32767  # normalize audio
    if len(sound.shape) > 1:
        if sound.shape[1] == 1:
            sound = sound.squeeze()
        else:
            sound = sound.mean(axis=1)  # multiple channels, average
    return sound

Get spectrogram form audio is like this. You can see librosa does magic.

import librosasample_rate = 16000
window_size = .02
window_stride = .01
window = 'hamming'def parse_audio(audio_path, spec_augment_flg=True):y = load_audio(audio_path)n_fft = int(sample_rate * window_size)
    win_length = n_fft
    hop_length = int(sample_rate * window_stride)
    # STFT
    D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length,
                     win_length=win_length, window=window)
    spect, phase = librosa.magphase(D)
    # S = log(S+1)
    spect = np.log1p(spect)
    spect = torch.FloatTensor(spect)
    
    # if self.normalize:
    # True
    mean = spect.mean()
    std = spect.std()
    spect.add_(-mean)
    spect.div_(std)if spec_augment_flg is True:
      spect= spec_augment(spect)return spect

Now our input data is like…

self.parse_transcript

Just open and read transcript_path and make it numbers. Since there is no tokenize process, I think string in transcript_path has to be tokenized in advance. Think an example: ENTER TWO NINE EIGHT ONE. This is already tokenized.

def parse_transcript(self, transcript_path):
    with open(transcript_path, 'r', encoding='utf8') as   transcript_file:
        transcript = transcript_file.read().replace('\n', '')
    transcript = list(filter(None, [self.labels_map.get(x) for x in list(transcript)]))
    return transcript

But how to make it numbers? Looks like using self.labels_map thing. Where does self.labels_map thing come from? I just forgot mentioned it… You can specify labels with labels_path in train.py.

labels

There is all alphabets, comma, space, and under score in labels. So text string is numericalized with the alphabet and some symbols.

data augmentation

deepspeech.pytorch has 2 data augmentations types, tempo and gain change and spec augmentation.

tempo and gain changes are defined on load_randomly_augmented_audio. Sox does magic.

def load_randomly_augmented_audio(path, sample_rate=16000, tempo_range=(0.85, 1.15),
                                  gain_range=(-6, 8)):
    """
    Picks tempo and gain uniformly, applies it to the utterance by using sox utility.
    Returns the augmented utterance.
    """
    low_tempo, high_tempo = tempo_range
    tempo_value = np.random.uniform(low=low_tempo, high=high_tempo)
    low_gain, high_gain = gain_range
    gain_value = np.random.uniform(low=low_gain, high=high_gain)
    audio = augment_audio_with_sox(path=path, sample_rate=sample_rate,
                                   tempo=tempo_value, gain=gain_value)
    return audio

Spec augment is more robust ways to augment audio data. It seems spec augment apply computer vision technique to audio. In a nutshell, you can morph spectrogram like below.

You can see detail here.

sampler

Sampler create bins based of audio length. Shuffling happens only in bins.

dataloader

After that, it’s kind of like regular dataloader.

Model

Architecture looks relatively simple at first, but there are many tricks. I will dissect details at part 2.

CTC Loss

The core of DeepSpeech is CTC loss. Since I’m not good at C++, I can not check source code. Just install and use it.

git clone https://github.com/SeanNaren/warp-ctc.git
cd warp-ctc; mkdir build; cd build; cmake ..; make
export CUDA_HOME="/usr/local/cuda"
cd ../pytorch_binding && python setup.py install

CTC loss does doing something like this.

Final thought

I think there are not so many models and data for speech-to-text comparing to computer vision domain. If more and more developers play with speech-to-text model, the situation could change. Of course we need open data too for ImageNet moment for speech to text.

Anyway, I created a jupyter notebooks to understand entire code of deepspeech.pytorch. you can check it here.

I intentionally removed apex part to make it work even with CPU.
I hope it helps. [UPDATED] Since CTC Loss require GPU, you can not run this code on CPU…