Transcripts – Quraan Acoustic Model

Basic Qur’aan transcript

Click to download the latest Qur’aan transcript File

This transcript is based on an Uthmani Qur’aan script from http://tanzil.net/docs/home.
The script has been modified to make it machine readable.

All silent letters were removed
All Alif has been replaced by hamza
Tanween was dropped and replaced by noon saakin
All the inner pause/stop markings were removed, but preserving the stopping sound
Alif superscript was used to replace all short maddah sounds
Shaddah was removed and replaced with equivalent characters

Feel free to change this if it can improve speech recognition,
however, remember that you will have make the same modification to all transcripts and dictionary entries.
Strictly use only letters from the original transcript and phonelist provided here.
Do not copy and paste characters from elsewhere.
UTF coding is very specific and causes major problems if the letters with
similar shape but different UTF code are used .
Also, there are two different characters for hamza , yeh and noon which even share the same UTF code
and are impossible to distinguish visually on some word processors.
(Use the Notepad++ search options to try to track these down should they enter your transcript accidentally – Microsoft’s Notepad is useless for this).
Also , remember to convert all your text files to UNIX format

Possible improvements to consider:

Shaddah produces quite a few errors especially at word boundaries
Click to view a paper which demonstrates the issues with word boundaries and shadda
Long maddah ( ٓ ) often produces errors , especially if the reciter is inconsistent with the length
AlifLaamMeem ( الٓمٓ ) , etc are written as a single word , but occasionally produce an error in recognition
This transcript was written to reflect stopping at all inner pauses.
If useful, an alternate script can be written for non-pause recitation.
The transcript above has been cleaned and corrected as much as possible, but may still contain hidden errors.
Please report any errors to admin at – quraa632@quraanacousticmodel.com

Cleaning Transcripts

After your sphinxtrain run , you will find a generated file called (your model name).align in the Results folder.

Open this in Notepad++

Click Search -> Find -> Mark
Check Bookmark line and Regular expression
Enter the following code in the ‘Find what’ field:
(?-s)^.+(?=\R.+\R.Error = 0.00%)|^.+(?=\R.Error = 0.00%)|^.Error = 0.00%.\R|\G.+
Click Mark All
This will mark all the lines where no error was found
Then click Search -> Bookmark -> Remove bookmarked lines

Next Click Search -> Find -> Mark
Enter the following line in the ‘Find what’ field:
Insertions: 0 Deletions: 0 Substitutions: 0
Click Mark All
Then click Search -> Bookmark -> Remove bookmarked lines

This should leave you with a list of all the errors found during training , and will be used to clean your existing transcript

Open the .align file alongside your transcript file in Notepad++ .
Then make corrections to your transcript.
You will require several runs of training in order to find all the transcription errors.

Tools

The most versatile word processor freeware is Notepad++ found here:
https://notepad-plus-plus.org/downloads/

Sources

Arabic Unicode Reference https://en.wikipedia.org/wiki/Arabic_(Unicode_block)

Quraan Uthmani and Emlai text and database files http://tanzil.net/docs/home

Click to view a paper which demonstrates the issues with word boundaries and shadda