15 TTS Servers

Emacspeak produces spoken output by communicating with one of many speech servers. This section documents the communication protocol between the client application i.e. Emacspeak, and the Text to Speech (TTS) server. This section is primarily intended for developers wishing to:

For additional notes on how to log and view TTS server commands when developing a speech server, see http://emacspeak.blogspot.com/2015/04/howto-log-speech-server-output-to-aid.html.

15.1 High-level Overview

The TTS server reads commands from standard input. Script speech-server can be used to cause a TTS server to communicate via a TCP socket. Speech server commands are used by the client application to make specific requests of the server; the server listens for these requests in a non-blocking read-eval-print (REPL) loop and executes requests as they become available. Requests can be classified as follows:

All commands are of the form

commandWord {arguments}

The braces are optional if the command argument contains no white space. The speech server maintains a current state that determines various characteristics of spoken output such as speech rate, punctuations mode etc. (see set of commands that manipulate speech state for complete list). The client application queues The text and non-speech audio output to be produced before asking the server to dispatch the set of queued requests, i.e. start producing output.

Once the server has been asked to produce output, it removes items from the front of the queue, sends the requisite commands to the underlying TTS engine, and waits for the engine to acknowledge that the request has been completely processed. This is a non-blocking operation, i.e., if the client application generates additional requests, these are processed immediately.

The above design allows the Emacspeak TTS server to be highly responsive; Client applications can queue large amounts of text (typically queued a clause at a time to achieve the best prosody), ask the TTS server to start speaking, and interrupt the spoken output at any time.

15.1.1 Commands That Queue Output.

This section documents commands that either produce spoken output, or queue output to be produced on demand. Commands that place the request on the queue are clearly marked.

version

Speaks the version of the TTS engine. Produces output immediately.

tts_say text

Speaks the specified text immediately. The text is not pre-processed in any way, contrast this with the primary way of speaking text which is to queue text before asking the server to process the queue.

Note that this command needs to handle the special syntax for morpheme boundaries ‘[*]’. The ‘[*]’ syntax is specific to the Dectalk family of synthesizers; servers for other TTS engines need to map this pattern to the engine-specific code for each engine. As an example, see servers/outloud A morpheme boundary results in synthesizing compound words such as left bracket with the right intonation; using a space would result in that phrase being synthesized as two separate words.

l c

Speak c a single character, as a letter. The character is spoken immediately. This command uses the TTS engine’s capability to speak a single character with the ability to flush speech immediately. Client applications wishing to produce character-at-a-time output, e.g., when providing character echo during keyboard input should use this command.

d

This command is used to dispatch all queued requests. It was renamed to a single character command (like many of the commonly used TTS server commands) to work more effectively over slow (9600) dialup lines. The effect of calling this command is for the TTS server to start processing items that have been queued via earlier requests.

s

Stop speech immediately. Spoken output is interrupted, and all pending requests are flushed from the queue.

q text

Queues text to be spoken. No spoken output is produced until a dispatch request is received via execution of command d.

c codes

Queues synthesis codes to be sent to the TTS engine. Codes are sent to the engine with no further transformation or processing. The codes are inserted into the output queue and will be dispatched to the TTS engine at the appropriate point in the output stream.

a filename

Cues the audio file identified by filename for playing, an ogg file.

p filename

dispatch play the audio file identified by filename for playing — ogg file.

t freq length

Queues a tone to be played at the specified frequency and having the specified length. Frequency is specified in hertz and length is specified in milliseconds.

sh duration

Queues the specified duration of silence. Silence is specified in milliseconds.

15.1.2 Commands That Set State

tts_reset

Immediately reset TTS engine to default settings. Stops all speech and clears the queue.

tts_set_punctuations mode

Queues setting TTS engine to the specified punctuation mode. Typically, TTS servers provide at least three modes:

  • None: Do not speak punctuation characters.
  • Some: Speak some punctuation characters. Used for English prose.
  • All: Speak out all punctuation characters; useful in programming modes.
tts_set_speech_rate rate

Immediately change speech rate. The interpretation of this value is typically engine specific.

tts_set_character_scale factor

Queues changing the scale factor. Scale factor applied to speech rate when speaking individual characters.Thus, setting speech rate to 500 and character scale to 1.2 will cause command l to use a speech rate of 500 * 1.2 = 600.

tts_split_caps flag

Queues changing of the state of split caps processing. Turn this on to speak mixed-case (AKA Camel Case) identifiers.

tts_sync_state punct splitcaps caps rate

Immediately apply the passed settings.

This ensures atomicity i.e., all state settings in the TTS engine happen at one shot. Note that failure to do this might result in some utterances being spoken with a partially set state.

  • punct: see tts_set_punctuations
  • splitcaps: see tts_split_caps
  • caps: engine specific implementation of capital letter handling
  • rate: see tts_set_speech_rate
set_next_lang say_it

Immediately switch to the next language on the server, maintained internally on the TTS server. If say_it is non-nil say the language change via the TTS.

set_previous_lang say_it

Immediately switch to the previous language on the server, maintained internally on the TTS server. If say_it is non-nil say the language change via the TTS.

set_lang language:voice say_it

Immediately switch to the requested language and/or voice, separated by a colon. Examples

  • set_lang "en"
  • set_lang "en:whisper"
  • set_lang ":whisper"

If say_it is non-nil, speak via the TTS the language and voice selected.

set_preferred_lang alias lang

Immediately set an alias in the TTS server mapping for example "en" to "en_GB". This can later be used by set_lang.