Project Proposal: English Synthetic Speech with a Portuguese Accent

PDM 351

Project Proposal

C.Serge de Souza

desouza@computing.edu.au

Supervisor: Andrew Marriott

ABSTRACT

Curtin University is a partner of the European Union Fifth Framework Research and Technology Project. This project is aimed at defining and implementing tools that will be used for making man-machine interaction more effective and more natural through multimodal web interfaces and also shared or augmented environments. The Talking Head, which is part of this project, consists of a text-to-speech synthesiser that contains a natural language processor and a digital signal processor, which allow it to produce speech from text. The text can be marked up using the speech mark-up language to add emotion to the speech. The other components of the head are the brain, which uses artificial intelligence, the persona and the face. The head is meant to simulate a human and there are a large number of applications, such as a newsreader, a virtual storyteller; there are no limits to the applications, these are not limited to humans given that an animal can be represented. The aim of this project is to add an accent to the talking head in order to make it possible to represent a foreigner more realistically. The accent will be a Portuguese accent over English speech. The accent is added by mapping the English phonemes to the Portuguese phonemes and catering for non-existent phonemes, the speech is then generated using a Portuguese voice.

1 Introduction

ACCENTS IN SPOKEN LANGUAGE

ACCENTS IN TTS

ENGLISH TO PORTUGUESE

2 OBJECTIVES

3 APPROACH

4 PLAN

5 Reference List

Introduction

The talking head (TH) is a virtual 3D computer animated head that is being developed in collaboration with a consortium of 11 Universities and organisations as a part of the Interface project (Interface Homepage, 2001). The talking head is meant to simulate a human being using various audio and video tools. The use of these tools is aimed at making the talking head seem more natural and thus facilitate communication with a human being. These tools include OpenGL, MPEG-4, VHML the Virtual Human Markup Language (VHML Homepage, 2001), which includes SML the Speech Markup Language; it also makes use of artificial intelligence and of a text-to-speech synthesiser (TTS). The TH may be used in a wide range of applications, such as an interactive newsreader, an overseas news correspondent, a storyteller, a mentoring system (Mentor Homepage, 2001), a virtual lecturer or a virtual salesperson. The TH is capable of expressing emotion through facial expression using facial animation parameters (FAP), that modify the geometry of the face. This feature is supported by the Facial Animation Markup Language (FAML). A texture map can also be applied to the TH, providing it with greater resemblance with the given individual from whom the texture map was derived (Tschirren, 2000). This feature improves the realism of the TH.

The TTS module is considered as a black box; it allows the TH to speak using text provided by the brain. The TTS synthesiser "comprises a natural language processing (NLP) module, capable of producing a phonetic transcription of the text to be read, together with the desired intonation and rhythm, and a digital signal processing (DSP) module, which transforms the symbolic information it receives into natural-sounding speech" (Dutoit, 1997, p.14). The TH currently produces emotive speech using SML (Stallo, 2000). The output from the natural language processor (NLP) is modified before it is supplied to the digital signal processor (DSP) in order to obtain the desired emotion in the output voice. This is a significant improvement from the blank neutral voice of the synthesiser that now makes the TH sound more natural.

It is necessary to add some personality to the TH; the fact that it can be made to resemble an existing person and that it can express emotion through facial expression and speech are important improvements. If the TH was required to represent a foreigner we could add a texture map representing a person of whichever ethnicity we wanted, but there would be a problem in regards to the voice. Currently the TH only allows for a pure English voice (or any other language supported by the TTS synthesiser). Thus if we were representing an ethnic person we would have to use a pure English voice, which would not always be appropriate. It would be better if we had the option of using a voice with an accent because this would increase the number of applications and also the realism of the TH.

ACCENTS IN SPOKEN LANGUAGE

Accents are normally due to what linguists call first language (L1) interference or "negative transfer of the structures and patterns of one first language to the second language" (Major, 1987, p.185). The first language does not have to be the speaker's mother tongue. For example a person having French as their mother tongue and having good knowledge of Spanish may speak Portuguese with interference from both Spanish, that is not their mother tongue, and French. But this is quite a complex situation because there are factors other than L1 interference involved.

Linguists also recognise other factors as being the cause of people speaking with an accent the more important ones are age, developmental factors and style.

The age argument recognises that the "older learners have difficulty learning how to speak a foreign language without an accent" (Major, 1987, p.185). Major (1987, p.186) citing Scovel (1969) "claims that learning native pronunciation in a foreign language after puberty is impossible", but Hill (1970), also cited in Major, claims that it is possible provided the "affective and cultural factors are present".

The developmental factors argument involves the cases where the accents and mistakes made are comparable to those made by a child learning a language or those that cannot be linked directly to L1 interference (Major, 1987, p.190). These include improvisation of sounds in languages that the person has never heard or knows little about, the speaker replaces the natural sound with an incorrect sound that cannot be justified by L1 interference.

The style factor argument is when the speaker adapts the speech to the context. For example the person will mispronounce a word in speech but will be capable of reading it from a word-list. Another case is that of a person fluent both in L1 and L2, whilst speaking to L1 monolingual individuals, this person will pronounce L2 words with the same mistakes as these individuals, but in company of L2 native speakers, will pronounce these same words correctly (Major, 1987, p.192).

L1 interference is the most commonly accepted cause of accents and was the foundation of what is now known as Contrastive Analysis. However L1 interference by itself cannot explain all of the errors made (Mayor, 1987, p.187). This theory has progressed and it is possible according to Azevedo (1981, pp. 6-7) citing Lado (1968, pp.124-125) to "predict probabilistically many of the distortions that a speaker of L1 is most likely to introduce into L2 as he learns it [...] the inventory of distortions does not represent behaviour that will be exhibited by every subject on every trial. It represents behaviour that is likely to appear with greater than random frequency, and it represents the pressures that have to be overcome" by the speaker of a foreign language.

ACCENTS IN TTS

To add an accent to the TH the approach suggested by Mike Hamilton on his web site (Cross-Language synthesis with MBROLA, 2000) and through E-mail (Hamilton, M. 2001 March 21) is to take the phoneme output of an English NLP and "map each English phoneme to the nearest sounding phoneme in the other language" then finally provide for the cases where the English phonemes are unpronounceable in the target language. The modified output is to be generated using the voice of the accent we want to add. For example if we wanted a Spanish accent we would then use a Spanish voice for the final output.

This work has been followed up by Anne Warlus (Transcribe program for mbrola, 2001) and was implemented by her in PERL using the same approach, but the method used is different according to Hamilton (Hamilton, M.2001 March 27).

This approach is consistent with the linguistic theory that accents are due to L1 interference.

ENGLISH TO PORTUGUESE

In order to do the mapping, it is necessary to have the mapping of the English phones to Portuguese these can be obtained from the book that was suggested by Schutz (2001 May 5 ) written by Azevedo (1981). The book concentrates more on the mapping of phones from Portuguese to English, but there are chapters dedicated to Portuguese phones on the one hand and to English phones on the other. From these chapters it should be possible to come up with a mapping of English phones to Portuguese. This can be completed with Schutz's work on correspondence of vowels (Schutz, 2001a) and that of the correspondence of consonants (Schutz, 2001b).

OBJECTIVES

The main objective of this project is to add a foreign accent to the TH. The accent will be that of a person whose mother tongue is Portuguese and will thus speak English with a Portuguese accent.

If necessary, this new feature will then have to be integrated with the existing modifications that have been made to the output of the NLP. The TH should be able to speak with an accent and at the same time display emotion.

Learning how to work in a group is another one of the objectives of this project. Work will be carried out in collaboration with other students involved in different areas of the Interface project. Sharing of knowledge and resources will be one of the aspects of the collaboration in particular with students involved in the development of the TTS package.

An additional objective would be to attempt to make it possible to specify the degree of interference of Portuguese on the speech output by the TH. This requires additional information regarding probability of distortion of given phonemes (Lado, 1968, pp.124-125).

The success of the implementation will be determined by the conduct of a survey. The survey will determine whether there is a difference in the speech before and after the implementation of the accent and whether the ethnic origin can be determined. The origin may be specified as whether or not the voice sounds French, Chinese or any other language. If the degree of interference feature is implemented, this will also be taken into account in the survey.

APPROACH

A software engineering approach will be taken here. The method followed will be the incremental model. The first step will be to modify the voice in order to add the accent.

Once this has been done an attempt will be made to add the possibility to specify the degree of interference of L1, which is Portuguese in our case.

The next step will be to integrate the work done to the existing system.

At each step of the development process testing will be performed in order to verify the correctness of the implementation.

If due to time restrictions some of the steps cannot be implemented they will be left out. The main goal is to implement the accent feature; the degree of interference is a secondary objective.

The Elan Speech Cube will be the TTS Synthesiser that will be used, unless it is determined to be inappropriate for this project. Therefore it may not be necessary to integrate with the speech emotion feature of the TH given that this feature has not yet been tested with the new TTS Synthesiser. At this point it appears as though the Speech Cube only supports Brazilian Portuguese, which is different from European Portuguese in many aspects one being phonetic. We will assume that it is possible to use the information provided by Azevedo (1981) to implement the accent and obtain the desired results. The reason for this being that we will not use the phonemes present in Brazilian Portuguese that are inexistent in European Portuguese. If this assumption is incorrect then we will revert the MBROLA TTS synthesiser.

Software Engineering Incremental Model for the project

PLAN

Mapping: The phonemes will be mapped from English to Portuguese and cases where the English phoneme does not exist in Portuguese will be dealt with (1 week).

Software Writing: during this phase the accent module will be written following the Software Engineering approach described in the previous section, that is an incremental approach with testing at the end of each step (5 weeks).

System Integration: the accent module will be integrated to the TH (1 week).

System Testing: the system will be tested with respect to the TTS section in order to determine that it is working correctly (2 weeks)

Evaluation: during this phase the implementation will be evaluated through a survey (1 week). A report based on the data collected will be generated. (1week).

Project write-up: the project report will be written during these last few weeks (4 weeks).

Reference List

Azevedo, M. (1981). A Contrastive Phonology of Portuguese and English. Washington D.C.: Georgetown University Press.

Beard, S., Crossman, B., Cechner, P. and Marriott A. (1999). "FAQbot", Proceedings of Pan Sydney Area Workshop on Visual Information Processing, Nov 1999, University of Sydney, Australia.

Cross-language synthesis with MBROLA [Online] Available: http://www.hamilton.net.au/mbrola.html [2000 June 22]

Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis, Kluwer Academic Publishers.

Interface Homepage [Online] Available: http://www.interface.computing.edu.au/ [2001, May].

Hamilton, M. (2001, March 21) Re: TTS of English with an accent. E-mail to C.S. de Souza (desouza@cs.curtin.edu.au).

Hamilton, M. (2001, March 27) Re: Thank you. E-Mail to C.S de Souza (desouza@cs.curtin.edu.au ).

Hill, J. (1970): "Foreign Accents, Language Acquisition and Cerebral Dominance Revisited." Language Learning, 20, pp.247-248.

Lado, Robert (1968). Contrastive Linguistics in a Mentalistic Theory of Language Learning. Georgetown University: Round Table on Languages and Linguistics 1968. Edited by James E. Alatis. Washington D.C.:Georgetown University Press. 123-135.

Major, R.C. (1987). Foreign Accent: Recent Research and Theory. International review of applied linguistics in language teaching 25 (2), 185-202.

Mentor Homepage [Online] Available: http://www.mentor.computing.edu.au/ [2001 May].

Schutz, R. (2001 May 5) Re: Attn: Ricardo Schutz, Projecto Interface. E-mail to C.S. de Souza ( desouza@cs.curtin.edu.au ).

Schutz, R. (2001a) A Questão das Vogais - English and Portuguese Vowels Phonemes Compared [Online] Available: http://www.sk.com.br/sk-voga.html [2001 June].

Schutz, R. (2001b) As Consoantes em Inglês e Português - English and Portuguese Consonant Phonemes Compared [Online] Available: http://www.sk.com.br/sk-conso.html [2001 June].

Scovel, T.(1969): "Foreign Accents, Language Acquisition, and Cerebral Dominance." Language Learning, 19, pp.245-253.

Stallo, J. (2000) "Simulating emotional speech for a talking head", Honours Dissertation, Curtin University of Technology, Perth, Australia.

Transcribe program for mbrola [Online] Available: http://tcts.fpms.ac.be/synthesis/mbrola/tts/xlang/

Tschirren,B. (2000) "Realism and Believability in MPEG-4 Facial Models", Honours Dissertation. Curtin University of Technology, Perth, Australia.

VHML Homepage [Online] Available: http://www.vhml.org/ [2001, May 22].