Basic Phonetics and Phonology concepts

I have heard the term ‘phoneme’ used inappropriately, so I will try to put my linguistics background to use by clarifying some phonetics and phonology notions. Having a better understanding of these concepts could help us better define the problem of synthetising speech and thus better design the architecture of our neural networks.

First, the most primitive notion about sounds in language is that of a phonological system. Each language has one. It defines what sound differences are to be interpreted as differences in meaning.

A phoneme is a family of sounds that are understood as the same sound for the purpose of determining meaning in a given language. This notion only makes senses as part of a phonological system, a system of oppositions between phonemes that serves to contrast different meanings. For example, in English, the difference between the pronunciation of the words

‘feal’  /f iy l/

and

‘fill’  /f ih l/

is a single phoneme change in the vowel. But in French, those two families of vowel sounds are considered a single phoneme (a single family of sounds) and thus a unilingual francophone would only hear the same word spoken twice (but pronounced slightly differently if he is attentive) :

‘fil’ /f iy l/

The two phonetic variants [iy] and [ih] of the French phoneme /iy/ of that example are called allophones.

A phone is a unit of speech sound and allophones are phones belonging to a single phoneme. So two phones can be considered different phonemes or allophones of the same phoneme, depending on the language.

To distinguish phones from phonemes the usual notation is to use slashes /…/ for phonemic transcriptions (phonemes) and square brackets […] for phonetic transcriptions (phones). (In the preceding examples, I have used the same phonetic symbols than in the TIMIT corpus, but other phonetic alphabets are also used.)

When realizing a given phoneme, the corresponding allophone can be selected based on the neighboring phonemes (phonemic context), even sometimes the neighboring words or parts of words (morphological context), and also based on the dialect. Now, a phone is considered a unit for the purposes of linguistic description, but it can nonetheless itself be realized in different ways, depending on a variety of factors.

In the TIMIT corpus, the phonetic symbols which are used in the phonetic transcriptions of the spoken sentences represent phones, not phonemes. A quasi-phonemic transcriptions of all the words of the corpus can be found in the file:

TIMITDIC.TXT  “This file contains a dictionary of all the words in the TIMIT prompts.”

in the folder:

/data/lisa/data/timit/raw/TIMIT/DOC/

The transcription in this lexicon is called quasi-phonemic, because it mostly represents phonemes, but makes just a few more sound distinctions (frequent in speech) than what is necessary to distinguish meanings. Note that the same basic symbols are used for phones in the phonetic transcriptions of the sentences and (quasi-) phonemes in the lexicon, but in the lexicon they are enclosed in slashes. For each group of allophones, a unique symbol is used to represent the corresponding phoneme. Also, in the TIMIT phonetic transcriptions, some symbols are introduced to distinguish the closure phase (during which airflow ceases) from the release phase in stop consonants (b,d,g,p,t,k). For more information, please refer to these two files, found in the same folder:

PHONCODE.DOC  “This file contains a table of all the phonemic and phonetic symbols used in the TIMIT lexicon and in the phonetic transcriptions.”

TIMITDIC.DOC “TIMIT Lexicon Documentation”

I also recommend this book on phonetics:

Raphael, L. J., Borden, G. J., & Harris, K. S. (2007). Speech science primer: Physiology, acoustics, and perception of speech. Lippincott Williams & Wilkins.

Returning to the architecture of the neural network, I think that considering what I have discussed in this post, it might be better to have two separate hidden layers to represent phonemes and phones, respectively, and probably a third layer to represent other articulatory factors.

But even before considering the architecture, maybe we ought to ask ourselves what we want the input to our system to be. If we still want the input to be phonemes, then training it with examples of phones-to-sound correspondences (.PHN files to .WAV files) will not be enough. We will also have to have it learn phonemes-to-phones correspondences. That could imply first taking the word transcriptions (.WRD files) and look up each word in the lexicon (TIMITDIC.TXT file) to get the corresponding phonemes, then making the correspondence with the phones (.PHN files).

8 thoughts on “Basic Phonetics and Phonology concepts

  1. Pingback: Some statistics of the training set | DAVIDTOB

    1. Pierre-Luc Vaudry Post author

      I just looked at Laurent’s equivalence file. On the one hand, acoustically it makes sense to say that the stop consonants closures are silences, because in effect the airflow is completely stopped during that phase of the articulation (in both the mouth and nose). On the other hand, it is nonetheless an important part of the articulation of those consonants. In some contexts, the release phase (the one in which airflow is resumed in a burst) does not even occur, for example when another stop consonant follows immediately (see the book reference in the main post). I think that ignoring the stop consonant closures during training may negatively impact the ability of the system to produce the corresponding phonemes. We may want to speak with Laurent and Yann Dauphin about how the phoneme dictionary was obtained.

      Reply
  2. Pingback: Markov assumption and regression | Random Mumbling

  3. Pingback: First understanding | Speech synthesis experiments

  4. davidscottkrueger

    First: I’m not sure why, but it says there are 5 comments and I only see 3 (before mine). If someone can explain it, and help me access any comments I may be missing, much obliged.

    Thanks for the post! One thing I am curious about (which may not be very relevant for the project) is how phoned are defined and how much agreement exists over what the set of phones should be. As you have defined phones/phonemes, it sounds like every real world phonological system could be considered a sub-system of some ideal total phonological system, and that is what I am wondering about… it might occur that different languages have similar phones, but not identical. Like for the “feel”/”fill” example, I guess either word could be interpreted as “file” in French. But if a native French speaker were speaking in French and we inserted an audio clip of me saying “fill” where they said “file”, I’m not sure how intelligible that would be. So what I’m saying is maybe it is not entirely accurate to say that [iy] and [ih] are allophones in French. Or maybe [ih] does not refer to exactly the same sound in different languages…

    I guess another thing I am kind of curious about is whether it is really sensible to define phones strictly in terms of their space in audio-space. What I mean is: hypothetically, might the same phone pronounced by one intelligible speaker be interpreted as a different phone if the same exact audio utterance were to have as its source another intelligible speaker with a very different accent?

    On a related note, I’ve often heard phones/phonemes defined by how a speaker physically produces the sound (rather than by reference to the audio signal itself).

    Reply
    1. Pierre-Luc Vaudry Post author

      As for the number of comments, it seems to include the number of “pingbacks” also.

      For the rest, you ask very good questions; I just wanted to keep it as simple as possible in the context of this post and for the purpose of this project. We can discuss it further in person if you wish, and I encourage you to consult the book reference I gave in the main post.

      What I can say is that it is true that phones are not pronounced exactly the same in every language. Phonologically the notion of phone is derived from the notion of phoneme, and the notion of phoneme is derived from the notion of phonological system of a given language.

      Another thing is that the production and understanding of speech are interrelated and it is easier for one to identify a phone if one is able to pronounce it. So although we obviously cannot “hear” how a sound is produced, we associate it with how it could be articulated. But on the other hand, the way the same phone is articulated varies according to the speed with which it is pronounced, for example.

      Reply

Leave a reply to davidtob Cancel reply