Speechbrain CTC parameters
Speechbrain is a framework to experiments with neural network in speech. It has components and structure that is usually used for speech tasks, and nicely integrated with pytorch. They have lots of recipes / quickstart configurations for certain speech tasks, but their documentation is not there yet. In this article I’ll explain my exploration for speechbrain parameters, especially for automatic speech recognition (ASR) with connectiionist temporal classification (CTC) loss.
Tokenizer parameters
Speechbrain uses SentencePiece
tokenizer, and in the yaml example, it has 2 configurations related to this tokenizer token_type
and character_coverage
.
token_type
is Sentenpiece.model_type
which currently has 3 mode_type: char
, unigram
, bpe
. While char
and unigram
is self-explainatory, bpe
will download Google’s unsupervised sentencepiece and that will be used to tokenize the target text.
character_coverage
is to be used in conjunction with token_type=bpe
, which will indicates the percentage or fraction of characters that the sentencepiece model should cover. 1.0
means 100% characters will be covered, and this is the default value, since it does make sense to cover all the characters in a language. However, when using the SentencePiece tokenizer for rich languages, like Chinese or Japanese the documentation itself mention to reduce this towards 0.9995
.
Augmentation
TimeDomainSpecAugment
is an augmentation that capable to: drop chunks of audio, or drop frequency band, or do speed perturbation. The default augmentation for my experiments is speed perturbation, where we speed up or speed down the input spectogram. Here is an example of where to use the augmentation.
# wavs and wav_lens are loaded spectogram in form of pytorch
wavs, wav_lens = wavs.to(self.device), wav_lens.to(self.device)
# do augmentation directly to the wavs itself,
# wav_lens is a must for this augmentation
# hparams is configuration loaded from speechbrain yaml
# hparams.augmentation will be initiated with TimeDomainSpecAugment
wavs = self.hparams.augmentation(wavs, wav_lens)
and here is the example of yaml, that will be translated into hparams
object.
augmentation: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
sample_rate: !ref <sample_rate>
speeds: [95, 100, 105]
SpecAugment
is an augmentation procedure that can do frequency masking, or time masking, or time warping. This augmentation will be applied to log mel spectogram, or latent feature.
Frequency masking, is to 0 out the frequency, which is the x
axis in log mel spectogram. While time masking, is to 0 out values along time axis, which is the y
axis in log mel spectogram. Finally, time warping means to “remove” some sequential portion of the spectogram, or in other hand: “removing the values from time x1 to x2”.
Here is the sample code, that describes where to put the augmentation with wav2vec
model.
wavs, wav_lens = wavs.to(self.device), wav_lens.to(self.device)
# Create latent feature from raw wavs with wav2vec2 model
feats = self.modules.wav2vec2(wavs)
# apply SpecAugment augmentation
feats = self.hparams.augmentation(feats)
and here is the example on how to define the SpecAugment
in speechbrain yaml file.
augmentation: !new:speechbrain.lobes.augment.SpecAugment
freq_mask: True
time_warp: True
time_mask: True