A downloadable voice synthesiser for Windows

Download NowName your own price

Vox | [vɑks] is an experimental voice synthesizer I made using Unreal Engine 5 and MetaSounds. It is not a realistic text-to-speech synthesizer, but rather a playful tool.

Will you be able to tame it and make it say or sing what you want through the controls? See how it reacts to your inputs.

I made a video to explain what the project is about and to show how it works:

Statement

The way I primarily see Vox is as an entity that responds to the inputs of the user. Through the controls, the user clumsily tries to tame the entity and experiences how it sonically and visually reacts to them.

Furthermore, I like how this experience makes you appreciate how complex human speech is. We don't think of it when we speak, but manually controlling a speech model to generate some kind of language is extremely difficult. Choosing and executing the right sequence of phonemes with one hand while controlling the pitch with the other to try and imitate human speech is definitely a challenge.

The main inspiration I had for this piece was the Bell Telephone Laboratory's Voder invented by Homer Dudley in 1937. The operators of the Voder had to train full-time for an entire year to use the machine properly.
Gladly, Vox does not generate consonants and focuses on vowels and pitch, so it should only take you a few minutes to produce some goofy and uncanny primitive language. The lack of consonants probably plays a role in how silly it sounds.

Audio

While there is a lot to explore in terms of procedural audio, the method is still rarely used in games and interactive media in general. Carefully designing systems to generate dynamic content that can adapt to the context in real time suits interactive media very well. I hope that this piece contributes to the collective efforts that are made toward making audio content more interactive in interactive media.

Moreover, realistic voice synthesis is normally used in a serious context and I think realistic and — especially — not-so-realistic voice synthesis (as is the case here) has a lot of potential in digital media and arts.

Visuals

The visuals of Vox are the result of a happy accident. First, I tried to use Unreal Engine's MetaHuman feature to make a very realistic human head and animate it with the amplitude of the sound synthesis and the player inputs. I thought the contrast between the realistic head and the goofy voice would be uncanny and hilarious. However, I didn't manage to use MetaHuman as I wanted, so I started looking at alternatives. Then, I realized using photographs of my face saying the different vowels at different pitches worked even better than what I had planned with MetaHuman while looking more unique.

Photography-based visuals are uncommon in games and interactive media as a whole. I think it’s an interesting medium to work with in that context and it’s an unexpected thing to combine with procedural audio. While it doesn’t provide the flexibility of procedural visuals, it participates in making this piece unique, eccentric, and uncanny.

How to use

In the Speech mode, you have to generate each phone with the press of a key and by changing the pitch with the mouse. It's very hard to control the vowels and the pitch at the same time, but when it's done right, it almost sounds like some kind of weird broken language.

These controls are inspired by the Bell Telephone Laboratory's Voder invented by Homer Dudley in 1937.

In the Drone Choir mode, you have four voices to play with and you won't have to manually manage the pitch. Play with the controls to get some moody drone music or some click-like throat singing.

Controls

Ideally, you need a keyboard and a mouse, but it's possible to manage without a mouse. Without a keyboard, you won't be able to control the 12 vowels in the Speech mode, but the Drone Choir mode will work fine with the mouse only.

Here are the alternative keyboard controls:

Shift: toggle the visibility of the visual interface

Speech mode

Q/W/E/R/A/S/D/F/Z/X/C/V (hold): play one of the 12 vowels
1/2/3: change the voice archetype between child, female, and male.
Y/O (hold): decrease/increase pitch a lot
U/I (hold): decrease/increase pitch a little
H/L (hold): decrease/increase the cord ripple frequency a lot (if the setting has been enabled in the Options screen)
J/K (hold) decrease/increase the cord ripple frequency a little (if the setting has been enabled in the Options screen)

Drone Choir mode

For the whole choir

1: start all the voices
2: stop all the voices
3: change the scale
4: change the preset (what voice archetypes the choir consists of)
7: increase the time interpolation for when vowels are changing
8: decrease the time interpolation for when vowels are changing
9: increase the time interpolation for when notes are changing
0: decrease the time interpolation for when notes are changing

For each voice

Q/W/E/R: start/stop a voice
A/S/D/F: change the vowel
Z/X/C/V: change the note
Y/U/I/O: change the voice archetype
H/J/K/L: change the octave the voice is singing

Accessibility

If you are interested in the project and need some critical accessibility features that are currently missing, please reach out in the comments and I will see what I can do.

Hearing

There are a few audio accessibility options available from the Options screen:

Options screen

Mono

Enabling this will make all the sound sources monophonic, which will be useful if you cannot hear well from one ear. You can only hear the difference in the Drone Choir mode.

To enable the mono option, press the down arrow key 5 times from the options screen and press Enter. It's disabled by default.

Hyperacusis Equaliser

Hyperacusis is the increased sensitivity to sound.

If you suffer from hyperacusis and are particularly sensitive to a specific frequency range, this equaliser can be useful to you.

Vision

Unfortunately, there is no screen reader support. It is still an experimental feature in Unreal Engine. However, the main menu has UI sounds that should help to know where you are. Each button you select with the arrow keys will have the voice synthesiser pronounce the vowels contained in that button's text.

For example, to access the Speech Mode:

launch the program
press any key twice to skip the intro. You should hear "ee" as in Speech.
press the Enter key to access the Speech Mode.

To access the Drone Choir Mode:

launch the program.
press any key twice to skip the intro. You should hear "ee" as in Speech.
press the down arrow key. You should hear "oh ahyuh" as in Drone Choir.
Press the Enter key to access the Drone Choir Mode.

To access the Options screen:

launch the program.
press any key twice to skip the intro. You should hear "ee" as in Speech.
press the down arrow key twice. You should hear "oh ahuh" as in Options.
Press the Enter key to access the Options screen.

To leave the Speech Mode, the Drone Choir Mode, or the Options screen, simply press the Escape key; it will bring you back to the main menu. To quit the game from the main menu, press the Escape key once; you will hear the vowel in "Quit". Press the Escape key again to confirm.

Controls binding

Unfortunately, the controls are not configurable. I will try to do better on my next projects.

Platforms

The program is only available for Windows for now, but I may be able to make a build for both Linux and macOS. Please tell me in the comments if you are interested. If there is enough demand, I will look into it.

Thanks

Miina, Quentin, Clément, Florent, @AutSciPerson, Aaron & Dan, and the Redhill audio team, thank you all for the great feedback and advice :)

More information

Status	Released
Category	Other
Platforms	Windows
Author	Martin Bussy-Pâris
Made with	Unreal Engine
Tags	artgame, choir, drone-music, Experimental, generative-music, Procedural Generation, procedural-audio, sound-synthesis, speech-synthesis, voice-synthesis
Average session	A few minutes
Languages	English
Inputs	Keyboard, Mouse
Accessibility	Color-blind friendly, High-contrast, Blind friendly
Links	YouTube

Download