Phonemic-level duration control using attention alignment for natural neural speech synthesis (Accepted to ICASSP 2019 oral)

Nowadays, attention-based neural speech synthesis systems such as Tacotron 2 successfully achieve human-level good quality. These recent breakthroughs, however, have some limitations

Controllability
- Unlike human, attention-based speech synthesis produces the same output for the same input text.
- It is because learning process rivets attention to produce speech with a fixed duration
- However, human can maintain timing control.
Slow speed of convergence
- Many of the attention-based neural speech synthesis are known to be suffered from slow speed of convergence.

Especially, lack of phonemic-level or syllable-level duration control defines the boundary of the neural speech synthesis for being applied to contents markets such as video, game, music or etc. Our study, phonemic-level duration control using attention alignment for neural speech synthesis (Hyperlink: https://ieeexplore.ieee.org/document/8683827) having executive loss function and embedding timing information from phoneme-level duration extractor successfully solves these problems. This study was published as conference paper for ICASSP 2019 oral session. You can hear our applied demo for Rap-like speech synthesis below.

Rap-like speech synthesis example
- During inference, change inputs from time to corresponding musical duration (i.e. {‘\e’, 0.05 -> ‘\e’, 1/16})
- Sync time of musical notes with beat per minutes(BPM).
- Using this techniques, we collaborated with Korean Hip-hop band, XXX, for 2019 SXSW presentation on Texas Austin, US.
- Demo from a test data set (Beat from “Vengence” Future House Samples, BPM125 “LJ Speech - no title “20180614”)