Nowadays, attention-based neural speech synthesis systems such as Tacotron 2 successfully achieve human-level good quality. These recent breakthroughs, however, have some limitations
Especially, lack of phonemic-level or syllable-level duration control defines the boundary of the neural speech synthesis for being applied to contents markets such as video, game, music or etc. Our study, phonemic-level duration control using attention alignment for neural speech synthesis (Hyperlink: https://ieeexplore.ieee.org/document/8683827) having executive loss function and embedding timing information from phoneme-level duration extractor successfully solves these problems. This study was published as conference paper for ICASSP 2019 oral session. You can hear our applied demo for Rap-like speech synthesis below.