Presentation
This is the home screen of the application; where you can register to create an IBM account, or if you already have one, log in. In addition, the application provides the possibility to access and use a demo to convert written texts into audio files. This will be possible by clicking on "Watch the demo".

Once you click on "Watch the demo" the following screen will appear. This service allows, starting from a text, to create audio files with a cadence and intonation appropriate to the chosen language. It is available in 27 voices (13 neural and 14 standard) in 7 languages; to have an audio with a natural voice it is advisable to use neural voices (V3, enhanced dnn). The selected voices offer expressive synthesis (SSML) and voice transformation functions. The language in which the text has been written must correspond to the language of the selected voice, otherwise different text and voice languages will not produce correct results in pronunciation. The audio is returned in mp3 format which can be played using VLC and Audacity players.

After entering the text you will need to click on the "speak" button, in this way you can hear the audio related to the written text.Then if you want to download the audio you have to click the mouse’s right button and choose "copy audio address" option, open it in another browser window and then download it.

As indicated above, the selected items offer expressive synthesis functions (SSML). Speech Synthesis Markup Language is a standard markup language that allows you to control, in speech synthesis, pronunciation, volume, tone, speed, insert pauses, etc.For example, in the following screen you can see that the "break" element associated with the "time" attribute has been inserted, which indicates a certain length of the pause that can be expressed in seconds or milliseconds; another example is the insertion of the "prosody" element, which controls the tone, the speed of pronunciation and the volume of the text, with the attribute "rate" which indicates a change of speed in the pronunciation of the text. The SSML service also offers the possibility to use other elements: such as the use of SSML of expressivity (how the text must be expressed when is pronounced), the use of SSML for voice transformation and the insertion of phonemes to specify the phonetic spelling used to pronounce a word. To define the phonetic pronunciation of a word, use the
element to which two attributes can be associated. The first is the attribute "alphabet" which specifies the notation of the pronunciation; it can be associated with the value "ibm" (pronunciation defined in SPR) or the value "ipa" (pronunciation defined in IPA). The second one is the attribute "ph" which defines the pronunciation, so how the word enclosed in the element should be pronounced. The SPR symbols or IPA symbols of the various languages can be easily found in the application documentation section.
Using SSML for voice transformation allows us to expand the range of voices by using the element to be embedded within a text, phrase or single word. Voice transformation is not supported for all voices present, but only for some. Voice transformations can be integrated or customized. Integrated transformations apply predefined changes that can be expressed either through the "type" attribute associated with the "young" element (gives a youthful intonation) and the "soft" element (gives more softness), or through the "strength" attribute to which you can associate a value from 0% to 100%. Customized transformations give you more control over voice transformation. To do this you will need to attribute the element with the "type" attribute followed by "Custom" and then insert a set of attributes. For example, in the following screen you can see the attribute "glottal_tension" which allows you to increase or decrease the glottal tension of the voice; or the attribute "timbre" which allows you to change the timbre of the voice; or the attribute "rate" which allows you to increase or decrease the speed of speech. 
Copy link