SSML tags support of Omilia TTS

Besides the plain text input, it is possible to use Speech Synthesis Markup Language (SSML) in the input text for prompts in OCP miniApps®, thanks to the Omilia TTS engine.

The SSML of the TTS engine is based on the W3C SSML specification, however, not all the SSML elements and/or their attributes are supported. This document defines the SSML elements and the attributes that can be used.

Table of Supported SSML Elements

Here is a table containing the SSML elements that are currently supported:

Element	Tag	Usage
speak	`<speak>`	Encapsulates SSML text
break	`<break>`	Adds a pause in audio
say-as	`<say-as>`	Provides control on how text should be pronounced or interpreted
prosody	`<prosody>`	Provides control on the speaking rate

Description of SSML elements

speak

The speak element is the root element of SSML text. No attributes of speak are currently supported.

The use of the speak element is optional for the TTS engine, as it will be added in case it is missing. Note that generally this is not the case, because usually, it is a required element from other text-to-speech solutions.

Example

JS

<speak>Hello world. How are you?</speak>

break

The break element is used to manually insert appropriate pauses or breaks in the speech output of the TTS Engine. The use of the break element is optional.

The <break> tag can take two attributes: time and strength. The time attribute specifies the duration of the pause in seconds or milliseconds, and the strength attribute specifies the relative strength of the pause. If no attributes are given, a medium strength pause (<break strength="medium"/>) will be assumed by default.

Attributes	Description
`time`	The duration of the break in seconds or milliseconds (e.g. "1.5s" or "300ms")
`strength`	The relative strength of the pause. Valid values are: `x-weak`, `weak`, `medium`, `strong`, `x-strong` and `none`

Example

CODE

<speak>Let me think... <break time="0.8s"/> Ok... <break strength="medium"/> I think, I am ready</speak>

say-as

The say-as element is used to specify how text should be pronounced or interpreted in speech synthesis of TTS engine. The use of the say-as element is optional.

The <say-as> tag has a required attribute interpret-as, which is the main indicator of how the text should be verbalized. Currently, this is the only attribute supported.

Attributes	Description
`interpret-as`	Provides the main indication of how to verbalize the text
`format`	Used in conjunction with some values of `interpret-as` to provide more detailed information on how or what to verbalize

"interpret-as" values

You can check out the interpret-as values in detail below.

Value	Description	Example
`characters`/ `spell-out`	Both `characters` and `spell-out` options indicate spelling out each letter of the text, as in a-b-c.	`<speak>Can you spell <say-as interpret-as="characters">AB3C</say-as>, faster than me?</speak>`
`number`/ `cardinal`	Both `number` and `cardinal` options indicate that the text should be vocalized as numeric.	`<speak>The meaning of life is <say-as interpret-as="number">42</say-as>. I know it!</speak>`
`ordinal`	The `ordinal` option indicates that the text should be vocalized as an ordinal number.	`<speak>She came <say-as interpret-as="ordinal">1st</say-as> in the race.</speak>`
`date`	The `date` option indicates that the text is a date and should be vocalized accordingly. The values in the date text may be separated by `/`(slash), `-`(dash) or `.`(dot). The optional attribute `format` can also be used to indicate which value is which part of the provided date. In the `format` attribute, the `d`(day), `m`(month) and `y`(year) character symbols are expected either in single character notation, e.g. `mdy`, or in multi-character notation, e.g. `ddmmyyyy`.	`<speak>Your appointment is at <say-as interpret-as="date" format="mdy">10-12-2023</say-as>.</speak>`
`time`	The `time` option indicates that the encapsulated text is a time and should be vocalized accordingly. The values in the time text may be separated by `:`(semicolon). Daytime marker can also be included at the end of the text: `a.m.`, `am`, `A.M.`, `AM` or `a` for before noon and `p.m.`, `pm`, `P.M.`, `PM` or `p` for afternoon. The optional attribute `format` can also be used to indicate if the given time is in 12-hour format (`hms12`) or in 24-hour format (`hms24`), with `hms12` being used by default when not specified.	`<speak>Our shop opens at <say-as interpret-as="time" format="hms12">09:30 AM</say-as>.</speak>`
`digits`	The `digits` option indicates that the text is a number that should be vocalized digit by digit.	`<speak>The code is <say-as interpret-as="digits">5862</say-as>.</speak>`
`fraction`	The `fraction` option indicates that the text is a fraction or a simple mathematical expression to be vocalized as digits accordingly.	`<speak>He ate <say-as interpret-as="fraction">2+1/2</say-as> pieces of cake.</speak>`
`telephone`	The `telephone` option indicates that the text is a telephone number and should be vocalized accordingly. Valid telephone numbers are considered up to fifteen(15) digit numbers without a plus(`+`) sign in front or up to twelve(12) digit numbers with a plus(`+`) sign in the beginning. The numbers may contain spaces, commas(`,`), dots(`.`), dashes(`-`) or parentheses(`()`) as separators between the digits, but these separators are not taken into account and neither are they verbalized. The only symbol that is verbalized is the plus(`+`) sign in the beginning.	`<speak>You can call <say-as interpret-as="telephone">+30(532)-91-1234</say-as> for more information.</speak>`
`unit`	The `unit` option indicates that the text is a number followed by either an abbreviated form of a measurement unit or a fully verbalized form of a measurement unit, which should be vocalized all together correctly in singular or plural form.	`<speak>My yard is <say-as interpret-as="unit">21 foot</say-as> wide.</speak>`

prosody

The prosody element is used to specify the speaking rate of the tagged text in speech synthesis of the TTS engine. The use of the prosody element is optional.

The <prosody> tag currently has attributes that control the rate and the volume of the speech, which are both optional.

Attributes	Description
`rate`	Controls the rate of the speech
`volume`	Controls the volume of the speech

"rate" values

You can check out the rate values in detail below.

Value

Description

Example

relative number

The relative number is a multiplier modifying the default rate, which is represented as a numerical factor within the range of 0.5 to 2.0. A value of 1 signifies no alteration to the original rate, 0.5 indicates a reduction to half of the original rate, and 2 denotes a doubling of the original rate.

<speak>Please speak <prosody rate="0.6">as slow as I speak now.</prosody></speak>

constant values

A set of constant values that affect the speech rate. Valid values are:

x-slow
slow
medium
fast
x-fast
default

<speak>How fast can you say:<prosody rate="fast">how can a clam cram in a clean cream can?</prosody></speak>

"volume" values

You can check out the volume values in detail below.

Value

Description

Example

absolute number

The absolute number value is represented as a percentage within the range of 0.0 to 150.0, the default volume being 100.0. For example, a value of 50.0 indicates a decrease to half of the original volume, and 150.0 indicates 50% increase in volume.

<speak>Whisper <prosody volume="0.4">because they might hear us</prosody> or maybe not.</speak>

constant values

A set of constant values that control the speech volume. Valid values are:

silent
x-soft
soft
medium
loud
x-loud
default

<speak>All day, <prosody volume="loud">I have been singing loudly to you</prosody> and only you.</speak>