Besides the plain text input, it is possible to use Speech Synthesis Markup Language (SSML) in the input text for prompts in OCP miniApps®, thanks to the Omilia TTS engine.
The SSML of the TTS engine is based on the W3C SSML specification, however, not all the SSML elements and/or their attributes are supported. This document defines the SSML elements and the attributes that can be used.
Here is a table containing the SSML elements that are currently supported:
Element | Tag | Usage |
---|
speak | <speak>
| Encapsulates SSML text |
break | <break>
| Adds a pause in audio |
say-as | <say-as>
| Provides control on how text should be pronounced or interpreted |
prosody | <prosody>
| Provides control on the speaking rate |
The speak
element is the root element of SSML text. No attributes of speak
are currently supported.
The use of the speak
element is optional for the TTS engine, as it will be added in case it is missing. Note that generally this is not the case, because usually, it is a required element from other text-to-speech solutions.
Example
JS
<speak>Hello world. How are you?</speak>
The break
element is used to manually insert appropriate pauses or breaks in the speech output of the TTS Engine. The use of the break
element is optional.
The <break>
tag can take two attributes: time
and strength
. The time
attribute specifies the duration of the pause in seconds or milliseconds, and the strength
attribute specifies the relative strength of the pause. If no attributes are given, a medium
strength pause (<break strength="medium"/>
) will be assumed by default.
Attributes | Description |
---|
time
| The duration of the break in seconds or milliseconds (e.g. "1.5s" or "300ms") |
strength
| The relative strength of the pause. Valid values are: x-weak , weak , medium , strong , x-strong and none |
Example
CODE
<speak>Let me think... <break time="0.8s"/> Ok... <break strength="medium"/> I think, I am ready</speak>
The say-as
element is used to specify how text should be pronounced or interpreted in speech synthesis of TTS engine. The use of the say-as
element is optional.
The <say-as>
tag has a required attribute interpret-as
, which is the main indicator of how the text should be verbalized. Currently, this is the only attribute supported.
Attributes | Description |
---|
interpret-as
| Provides the main indication of how to verbalize the text |
format
| Used in conjunction with some values of interpret-as to provide more detailed information on how or what to verbalize |
You can check out the interpret-as
values in detail below.
Value | Description | Example |
---|
characters / spell-out
| Both characters and spell-out options indicate spelling out each letter of the text, as in a-b-c. | <speak>Can you spell <say-as interpret-as="characters">AB3C</say-as>, faster than me?</speak>
|
number / cardinal
| Both number and cardinal options indicate that the text should be vocalized as numeric. | <speak>The meaning of life is <say-as interpret-as="number">42</say-as>. I know it!</speak>
|
ordinal
| The ordinal option indicates that the text should be vocalized as an ordinal number. | <speak>She came <say-as interpret-as="ordinal">1st</say-as> in the race.</speak>
|
date
| The date option indicates that the text is a date and should be vocalized accordingly. The values in the date text may be separated by / (slash), - (dash) or . (dot). The optional attribute format can also be used to indicate which value is which part of the provided date. In the format attribute, the d (day), m (month) and y (year) character symbols are expected either in single character notation, e.g. mdy , or in multi-character notation, e.g. ddmmyyyy . | <speak>Your appointment is at <say-as interpret-as="date" format="mdy">10-12-2023</say-as>.</speak>
|
time
| The time option indicates that the encapsulated text is a time and should be vocalized accordingly. The values in the time text may be separated by : (semicolon). Daytime marker can also be included at the end of the text: a.m. , am , A.M. , AM or a for before noon and p.m. , pm , P.M. , PM or p for afternoon. The optional attribute format can also be used to indicate if the given time is in 12-hour format (hms12 ) or in 24-hour format (hms24 ), with hms12 being used by default when not specified. | <speak>Our shop opens at <say-as interpret-as="time" format="hms12">09:30 AM</say-as>.</speak>
|
digits
| The digits option indicates that the text is a number that should be vocalized digit by digit. | <speak>The code is <say-as interpret-as="digits">5862</say-as>.</speak>
|
fraction
| The fraction option indicates that the text is a fraction or a simple mathematical expression to be vocalized as digits accordingly. | <speak>He ate <say-as interpret-as="fraction">2+1/2</say-as> pieces of cake.</speak>
|
telephone
| The telephone option indicates that the text is a telephone number and should be vocalized accordingly. Valid telephone numbers are considered up to fifteen(15) digit numbers without a plus(+ ) sign in front or up to twelve(12) digit numbers with a plus(+ ) sign in the beginning. The numbers may contain spaces, commas(, ), dots(. ), dashes(- ) or parentheses(() ) as separators between the digits, but these separators are not taken into account and neither are they verbalized. The only symbol that is verbalized is the plus(+ ) sign in the beginning. | <speak>You can call <say-as interpret-as="telephone">+30(532)-91-1234</say-as> for more information.</speak>
|
unit
| The unit option indicates that the text is a number followed by either an abbreviated form of a measurement unit or a fully verbalized form of a measurement unit, which should be vocalized all together correctly in singular or plural form. | <speak>My yard is <say-as interpret-as="unit">21 foot</say-as> wide.</speak>
|
The prosody
element is used to specify the speaking rate of the tagged text in speech synthesis of the TTS engine. The use of the prosody
element is optional.
The <prosody>
tag currently has attributes that control the rate
and the volume
of the speech, which are both optional.
Attributes | Description |
---|
rate
| Controls the rate of the speech |
volume
| Controls the volume of the speech |
You can check out the rate
values in detail below.
Value | Description | Example |
---|
relative number
| The relative number is a multiplier modifying the default rate, which is represented as a numerical factor within the range of 0.5 to 2.0. A value of 1 signifies no alteration to the original rate, 0.5 indicates a reduction to half of the original rate, and 2 denotes a doubling of the original rate. | <speak>Please speak <prosody rate="0.6">as slow as I speak now.</prosody></speak>
|
constant values
| A set of constant values that affect the speech rate. Valid values are: x-slow
slow
medium
fast
x-fast
default
| <speak>How fast can you say:<prosody rate="fast">how can a clam cram in a clean cream can?</prosody></speak>
|
You can check out the volume
values in detail below.
Value | Description | Example |
---|
absolute number
| The absolute number value is represented as a percentage within the range of 0.0 to 150.0, the default volume being 100.0. For example, a value of 50.0 indicates a decrease to half of the original volume, and 150.0 indicates 50% increase in volume. | <speak>Whisper <prosody volume="0.4">because they might hear us</prosody> or maybe not.</speak>
|
constant values
| A set of constant values that control the speech volume. Valid values are: silent
x-soft
soft
medium
loud
x-loud
default
| <speak>All day, <prosody volume="loud">I have been singing loudly to you</prosody> and only you.</speak>
|