Basic Qur’aan transcript
Click to download the latest Qur’aan transcript File
This transcript is based on an Uthmani Qur’aan script from http://tanzil.net/docs/home.
The script has been modified to make it machine readable.
- All silent letters were removed
- All Alif has been replaced by hamza
- Tanween was dropped and replaced by noon saakin
- All the inner pause/stop markings were removed, but preserving the stopping sound
- Alif superscript was used to replace all short maddah sounds
- Shaddah was removed and replaced with equivalent characters
Feel free to change this if it can improve speech recognition,
however, remember that you will have make the same modification to all transcripts and dictionary entries.
Strictly use only letters from the original transcript and phonelist provided here.
Do not copy and paste characters from elsewhere.
UTF coding is very specific and causes major problems if the letters with
similar shape but different UTF code are used .
Also, there are two different characters for hamza , yeh and noon which even share the same UTF code
and are impossible to distinguish visually on some word processors.
(Use the Notepad++ search options to try to track these down should they enter your transcript accidentally – Microsoft’s Notepad is useless for this).
Also , remember to convert all your text files to UNIX format
Possible improvements to consider:
- Shaddah produces quite a few errors especially at word boundaries
Click to view a paper which demonstrates the issues with word boundaries and shadda - Long maddah ( ٓ ) often produces errors , especially if the reciter is inconsistent with the length
- AlifLaamMeem ( الٓمٓ ) , etc are written as a single word , but occasionally produce an error in recognition
- This transcript was written to reflect stopping at all inner pauses.
If useful, an alternate script can be written for non-pause recitation. - The transcript above has been cleaned and corrected as much as possible, but may still contain hidden errors.
Please report any errors to admin at – quraa632@quraanacousticmodel.com
Cleaning Transcripts
After your sphinxtrain run , you will find a generated file called (your model name).align in the Results folder.
Open this in Notepad++
Click Search -> Find -> Mark
Check Bookmark line and Regular expression
Enter the following code in the ‘Find what’ field:
(?-s)^.+(?=\R.+\R.Error = 0.00%)|^.+(?=\R.Error = 0.00%)|^.Error = 0.00%.\R|\G.+
Click Mark All
This will mark all the lines where no error was found
Then click Search -> Bookmark -> Remove bookmarked lines
Next Click Search -> Find -> Mark
Enter the following line in the ‘Find what’ field:
Insertions: 0 Deletions: 0 Substitutions: 0
Click Mark All
Then click Search -> Bookmark -> Remove bookmarked lines
This should leave you with a list of all the errors found during training , and will be used to clean your existing transcript

Open the .align file alongside your transcript file in Notepad++ .
Then make corrections to your transcript.
You will require several runs of training in order to find all the transcription errors.
Tools
The most versatile word processor freeware is Notepad++ found here:
https://notepad-plus-plus.org/downloads/
Sources
Arabic Unicode Reference https://en.wikipedia.org/wiki/Arabic_(Unicode_block)
Quraan Uthmani and Emlai text and database files http://tanzil.net/docs/home
Click to view a paper which demonstrates the issues with word boundaries and shadda