Text to Speech Historical Perspective
A brief historical account.
1960s: The first computer-based speech synthesis systems were developed. One of the early systems, the Pattern Playback, was developed by Franklin Cooper and his colleagues. It could read out a limited set of words.
1970s: Linear predictive coding (LPC) was introduced, a method for encoding the spectral envelope of a digital speech signal. This became the standard for much of the speech synthesis research that followed.
1980s: The first commercially available TTS systems came onto the market. The DECtalk system, for example, was one of the early electronic speech synthesis systems which could read text out loud in a clear voice. Stephen Hawking famously used a version of DECtalk.
1990s: The focus of TTS research began to shift towards concatenative synthesis, where large databases of small speech fragments are recorded and then concatenated to produce complete utterances. This method could produce very natural sounding speech but required large amounts of storage.
2000s: As computational power and storage capacity increased, unit-selection synthesis became popular. This method uses a large database of recorded speech to choose the best pieces of recordings to construct new utterances.
2010s: With the rise of deep learning, neural network-based approaches to TTS, such as WaveNet and Tacotron, began to emerge. These models could produce highly natural and fluent speech, rivaling the quality of human voices. Major companies like Google and Baidu began using these techniques in their voice assistant products.
2020s: Continued advancements in deep learning and AI led to even more natural-sounding TTS systems. Customizable voice models, real-time TTS, and multi-language support became more common, opening up myriad applications from audiobooks to customer service.

