Nowadays, attention-based neural speech synthesis systems such as Tacotron 2 successfully achieve human-level good quality. These recent breakthroughs, however, have some limitations

  1. Controllability
  2. Slow speed of convergence

http://humelo.dothome.co.kr/data/file/research/453099649_ayHm1vdX_22c23076f2a55bf3974687f4c317dfb41a17311f.png

Especially, lack of phonemic-level or syllable-level duration control defines the boundary of the neural speech synthesis for being applied to contents markets such as video, game, music or etc. Our study, phonemic-level duration control using attention alignment for neural speech synthesis (Hyperlink: https://ieeexplore.ieee.org/document/8683827) having executive loss function and embedding timing information from phoneme-level duration extractor successfully solves these problems. This study was published as conference paper for ICASSP 2019 oral session. You can hear our applied demo for Rap-like speech synthesis below.

http://humelo.dothome.co.kr/data/file/research/453099649_QZIT1o5B_45c902eb944a93fd03df6365eab94b6ef394ce52.png

http://humelo.dothome.co.kr/data/file/research/453099649_5HQ7NKcd_e26f1fc46a299a8eea6e3cb6b32893d852be93a8.png