I have heard the term ‘phoneme’ used inappropriately, so I will try to put my linguistics background to use by clarifying some phonetics and phonology notions. Having a better understanding of these concepts could help us better define the problem of synthetising speech and thus better design the architecture of our neural networks.
First, the most primitive notion about sounds in language is that of a phonological system. Each language has one. It defines what sound differences are to be interpreted as differences in meaning.
A phoneme is a family of sounds that are understood as the same sound for the purpose of determining meaning in a given language. This notion only makes senses as part of a phonological system, a system of oppositions between phonemes that serves to contrast different meanings. For example, in English, the difference between the pronunciation of the words
‘feal’ /f iy l/
‘fill’ /f ih l/
is a single phoneme change in the vowel. But in French, those two families of vowel sounds are considered a single phoneme (a single family of sounds) and thus a unilingual francophone would only hear the same word spoken twice (but pronounced slightly differently if he is attentive) :
‘fil’ /f iy l/
The two phonetic variants [iy] and [ih] of the French phoneme /iy/ of that example are called allophones.
A phone is a unit of speech sound and allophones are phones belonging to a single phoneme. So two phones can be considered different phonemes or allophones of the same phoneme, depending on the language.
To distinguish phones from phonemes the usual notation is to use slashes /…/ for phonemic transcriptions (phonemes) and square brackets […] for phonetic transcriptions (phones). (In the preceding examples, I have used the same phonetic symbols than in the TIMIT corpus, but other phonetic alphabets are also used.)
When realizing a given phoneme, the corresponding allophone can be selected based on the neighboring phonemes (phonemic context), even sometimes the neighboring words or parts of words (morphological context), and also based on the dialect. Now, a phone is considered a unit for the purposes of linguistic description, but it can nonetheless itself be realized in different ways, depending on a variety of factors.
In the TIMIT corpus, the phonetic symbols which are used in the phonetic transcriptions of the spoken sentences represent phones, not phonemes. A quasi-phonemic transcriptions of all the words of the corpus can be found in the file:
TIMITDIC.TXT “This file contains a dictionary of all the words in the TIMIT prompts.”
in the folder:
The transcription in this lexicon is called quasi-phonemic, because it mostly represents phonemes, but makes just a few more sound distinctions (frequent in speech) than what is necessary to distinguish meanings. Note that the same basic symbols are used for phones in the phonetic transcriptions of the sentences and (quasi-) phonemes in the lexicon, but in the lexicon they are enclosed in slashes. For each group of allophones, a unique symbol is used to represent the corresponding phoneme. Also, in the TIMIT phonetic transcriptions, some symbols are introduced to distinguish the closure phase (during which airflow ceases) from the release phase in stop consonants (b,d,g,p,t,k). For more information, please refer to these two files, found in the same folder:
PHONCODE.DOC “This file contains a table of all the phonemic and phonetic symbols used in the TIMIT lexicon and in the phonetic transcriptions.”
TIMITDIC.DOC “TIMIT Lexicon Documentation”
I also recommend this book on phonetics:
Raphael, L. J., Borden, G. J., & Harris, K. S. (2007). Speech science primer: Physiology, acoustics, and perception of speech. Lippincott Williams & Wilkins.
Returning to the architecture of the neural network, I think that considering what I have discussed in this post, it might be better to have two separate hidden layers to represent phonemes and phones, respectively, and probably a third layer to represent other articulatory factors.
But even before considering the architecture, maybe we ought to ask ourselves what we want the input to our system to be. If we still want the input to be phonemes, then training it with examples of phones-to-sound correspondences (.PHN files to .WAV files) will not be enough. We will also have to have it learn phonemes-to-phones correspondences. That could imply first taking the word transcriptions (.WRD files) and look up each word in the lexicon (TIMITDIC.TXT file) to get the corresponding phonemes, then making the correspondence with the phones (.PHN files).