Skip to main content
Skip table of contents

SSML tags support of Omilia TTS

Besides the plain text input, it is possible to use Speech Synthesis Markup Language (SSML) in the input text for prompts in OCP miniApps®, thanks to the Omilia TTS engine.

The SSML of the TTS engine is based on the W3C SSML specification, however, not all the SSML elements and/or their attributes are supported. This document defines the SSML elements and the attributes that can be used.

Table of Supported SSML Elements

Here is a table containing the SSML elements that are currently supported:

Element

Tag

Usage

speak

<speak>

Encapsulates SSML text

break

<break>

Adds a pause in audio

say-as

<say-as>

Provides control on how text should be pronounced or interpreted

prosody

<prosody>

Provides control on the speaking rate

Description of SSML elements

speak

The speak element is the root element of SSML text. No attributes of speak are currently supported.

The use of the speak element is optional for the TTS engine, as it will be added in case it is missing. Note that generally this is not the case, because usually, it is a required element from other text-to-speech solutions.

Example

JS
<speak>Hello world. How are you?</speak>

break

The break element is used to manually insert appropriate pauses or breaks in the speech output of the TTS Engine. The use of the break element is optional.

The <break> tag can take two attributes: time and strength. The time attribute specifies the duration of the pause in seconds or milliseconds, and the strength attribute specifies the relative strength of the pause. If no attributes are given, a medium strength pause (<break strength="medium"/>) will be assumed by default.

Attributes

Description

time

The duration of the break in seconds or milliseconds (e.g. "1.5s" or "300ms")

strength

The relative strength of the pause. Valid values are: x-weak, weak, medium, strong, x-strong and none

Example

CODE
<speak>Let me think... <break time="0.8s"/> Ok... <break strength="medium"/> I think, I am ready</speak>

say-as

The say-as element is used to specify how text should be pronounced or interpreted in speech synthesis of TTS engine. The use of the say-as element is optional.

The <say-as> tag has a required attribute interpret-as, which is the main indicator of how the text should be verbalized. Currently, this is the only attribute supported.

Attributes

Description

interpret-as

Provides the main indication of how to verbalize the text

format

Used in conjunction with some values of interpret-as to provide more detailed information on how or what to verbalize

"interpret-as" values

You can check out the interpret-as values in detail below.

Value

Description

Example

characters/ spell-out

Both characters and spell-out options indicate spelling out each letter of the text, as in a-b-c.

<speak>Can you spell <say-as interpret-as="characters">AB3C</say-as>, faster than me?</speak>

number/ cardinal

Both number and cardinal options indicate that the text should be vocalized as numeric.

<speak>The meaning of life is <say-as interpret-as="number">42</say-as>. I know it!</speak>

ordinal

The ordinal option indicates that the text should be vocalized as an ordinal number.

<speak>She came <say-as interpret-as="ordinal">1st</say-as> in the race.</speak>

date

The date option indicates that the text is a date and should be vocalized accordingly. The values in the date text may be separated by /(slash), -(dash) or .(dot).

The optional attribute format can also be used to indicate which value is which part of the provided date. In the format attribute, the d(day), m(month) and y(year) character symbols are expected either in single character notation, e.g. mdy, or in multi-character notation, e.g. ddmmyyyy.

<speak>Your appointment is at <say-as interpret-as="date" format="mdy">10-12-2023</say-as>.</speak>

time

The time option indicates that the encapsulated text is a time and should be vocalized accordingly. The values in the time text may be separated by :(semicolon).

Daytime marker can also be included at the end of the text: a.m., am, A.M., AM or a for before noon and p.m., pm, P.M., PM or p for afternoon. The optional attribute format can also be used to indicate if the given time is in 12-hour format (hms12) or in 24-hour format (hms24), with hms12 being used by default when not specified.

<speak>Our shop opens at <say-as interpret-as="time" format="hms12">09:30 AM</say-as>.</speak>

digits

The digits option indicates that the text is a number that should be vocalized digit by digit.

<speak>The code is <say-as interpret-as="digits">5862</say-as>.</speak>

fraction

The fraction option indicates that the text is a fraction or a simple mathematical expression to be vocalized as digits accordingly.

<speak>He ate <say-as interpret-as="fraction">2+1/2</say-as> pieces of cake.</speak>

telephone

The telephone option indicates that the text is a telephone number and should be vocalized accordingly.

Valid telephone numbers are considered up to fifteen(15) digit numbers without a plus(+) sign in front or up to twelve(12) digit numbers with a plus(+) sign in the beginning. The numbers may contain spaces, commas(,), dots(.), dashes(-) or parentheses(()) as separators between the digits, but these separators are not taken into account and neither are they verbalized. The only symbol that is verbalized is the plus(+) sign in the beginning.

<speak>You can call <say-as interpret-as="telephone">+30(532)-91-1234</say-as> for more information.</speak>

unit

The unit option indicates that the text is a number followed by either an abbreviated form of a measurement unit or a fully verbalized form of a measurement unit, which should be vocalized all together correctly in singular or plural form.

<speak>My yard is <say-as interpret-as="unit">21 foot</say-as> wide.</speak>

prosody

The prosody element is used to specify the speaking rate of the tagged text in speech synthesis of the TTS engine. The use of the prosody element is optional.

The <prosody> tag currently has attributes that control the rate and the volume of the speech, which are both optional.

Attributes

Description

rate

Controls the rate of the speech

volume

Controls the volume of the speech

"rate" values

You can check out the rate values in detail below.

Value

Description

Example

relative number

The relative number is a multiplier modifying the default rate, which is represented as a numerical factor within the range of 0.5 to 2.0. A value of 1 signifies no alteration to the original rate, 0.5 indicates a reduction to half of the original rate, and 2 denotes a doubling of the original rate.

<speak>Please speak <prosody rate="0.6">as slow as I speak now.</prosody></speak>

constant values

A set of constant values that affect the speech rate. Valid values are:

  • x-slow

  • slow

  • medium

  • fast

  • x-fast

  • default

<speak>How fast can you say:<prosody rate="fast">how can a clam cram in a clean cream can?</prosody></speak>

"volume" values

You can check out the volume values in detail below.

Value

Description

Example

absolute number

The absolute number value is represented as a percentage within the range of 0.0 to 150.0, the default volume being 100.0. For example, a value of 50.0 indicates a decrease to half of the original volume, and 150.0 indicates 50% increase in volume.

<speak>Whisper <prosody volume="0.4">because they might hear us</prosody> or maybe not.</speak>

constant values

A set of constant values that control the speech volume. Valid values are:

  • silent

  • x-soft

  • soft

  • medium

  • loud

  • x-loud

  • default

<speak>All day, <prosody volume="loud">I have been singing loudly to you</prosody> and only you.</speak>

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.