Acoustic model training published here is based on CMU Sphinx
CMU Sphinx is geared for use on Linux.
However, if like myself, you have little experience using a Linux machine , it is easier working on an Ubuntu installation in Windows.
You will be spending a lot of time preparing audio files and correcting transcription files for your own acoustic model.
The tools for preparing transcriptions and audio are plentiful on Windows,
and navigation is a lot simpler.
Installation
To install Ubuntu on Windows 10 WSL, follow these simple steps
https://docs.microsoft.com/en-us/windows/wsl/install-win10
Install Ubuntu Ubuntu 18.04 LTS
Later versions of Ubuntu no longer use Python27 ,
and may cause problems running Sphinxtrain
Next , download the necessary CMU Sphinx files for PocketSphinx, Sphinxbase and Sphinxtrain.
Extract and install all three of these on the same level main directory:
https://github.com/cmusphinx/pocketsphinx
https://github.com/cmusphinx/sphinxbase
https://github.com/cmusphinx/sphinxtrain
Install each to your own main level directory using these 3 steps
./autogen.sh –prefix=/mnt/(your drive letter, e.g. d)/(your directory name)/usr/local
make
make install
Install a Perl package if necessary.
CMU Sphinx uses Python27 which comes with Ubuntu 18.04,
however, you will also need to install extra python modules for NumPy and SciPy.
Use these commands on your Ubuntu WSL terminal:
sudo apt update
sudo apt-get install perl
sudo apt-get install python-numpy
sudo apt-get install python-scipy
Final essential tweak:
Make a change to…sphinxtrain\scripts\lib\SphinxTrain\Config.pm.
Add the dot to line 51 with “$ST::CFG_FILE = “./etc/sphinx_train.cfg”;”
Similarly change the CFG file path for other files that produce an error on your training run.
File setup
Prepare your audio files, transcripts and language model file.
Add these to the subfolder – ‘wav’ and ‘etc’
Run
Launch an Ubuntu terminal, then navigate to your training directory.
cd /mnt/(your drive letter, e.g. d)/(your directory name)
Set the paths for your program.
export PATH=/mnt/(your drive letter, e.g. d)/(your directory name)/usr/local/bin:$PATH
export LD_LIBRARY_PATH=/mnt/(your drive letter, e.g. d)/(your directory name)/usr/local/lib
export PKG_CONFIG_PATH=/mnt/(your drive letter, e.g. d)/(your directory name)/usr/local/lib/pkgconfig
Do an initial run without changing the config file settings.
sphinxtrain -t (your folder name) setup
sphinxtrain run
Note the output result Word Error Rate (WER) and Sentence Error Rate (SER)
This is your main metric of performance of your acoustic model.
Testing and improving your model
The configuration settings which achieve the best result are essentially a ‘black box’ and found by trial and error , by comparing WER results.
The following changes to the configuration file sphinx_train.cfg as useful:
- Line 144 – $CFG_FINAL_NUM_DENSITIES = 32;
This is where you set the number of Gaussians . Higher Gaussians produces a better WER ,
but hides transcript errors and slows down decoding/recognition ,
which has a practical downside when your acoustic model is run on a device - Line 167 – $CFG_N_TIED_STATES = 1375;
This represents the number of senones .
This changes relative to the size of your audio data - Line 173 – $CFG_CROSS_PHONE_TREES = ‘yes’;
Train a single decision tree for all phones. - Line 196 –
#Calculate an LDA/MLLT transform?
$CFG_LDA_MLLT = ‘yes’;
#Dimensionality of LDA/MLLT output
$CFG_LDA_DIMENSION = 22;
The LDA_MLLT transform results in a 25%+ better WER.
You may change the LDA dimension to try for better results. - Line 208 – $CFG_QUEUE_TYPE = “Queue::POSIX”;
This enable multi-threading , using up to 16/24 threads , which dramatically speeds up training
For multithreading to be effected , change Lines 168
#How many parts to run Forward-Backward estimation in
$CFG_NPART = 16;
And line 277
# Define how many pieces to split decode in
$DEC_CFG_NPART = 16; - Line 270 – You may also change the Decoder settings and experiment with different settings
for Language weight, Beam width and WordBeam width , e.g.
$DEC_CFG_LANGUAGEWEIGHT = “12”;
$DEC_CFG_BEAMWIDTH = “1e-140”;
$DEC_CFG_WORDBEAM = “1e-90”;
$DEC_CFG_WORDPENALTY = “0.2”;
Sources
Follow the tutorial provide by CMU in order to set up your trainer
https://cmusphinx.github.io/wiki/tutorialam/
Working with LDA/MLLT:
https://cmusphinx.github.io/wiki/ldamllt/
More help can be found on the SourceForge CMUSphinx forum:
https://sourceforge.net/p/cmusphinx/discussion/