Talk #1: Lhotse: a speech data representation library for the modern deep learning ecosystem (Piotr Żelasko, Meaning)
Speech data is notoriously difficult to work with due to a variety of codecs, lengths of recordings, and meta-data formats. We present Lhotse, a speech data representation library that draws upon lessons learned from Kaldi speech recognition toolkit and brings its concepts into the modern deep learning ecosystem. Lhotse provides a common JSON description format with corresponding Python classes and data preparation recipes for over 30 popular speech corpora. Various datasets can be easily combined together and re-purposed for different tasks. The library handles multi-channel recordings, long recordings, local and cloud storage, lazy and on-the-fly operations amongst other features. We introduce Cut and CutSet concepts, which simplify common data wrangling tasks for audio and help incorporate acoustic context of speech utterances. Finally, we show how Lhotse leverages PyTorch data API abstractions and adopts them to handle speech data for deep learning.
Dr. Piotr Żelasko is an expert in automatic speech recognition (ASR) and spoken language understanding, with extensive experience in developing practical and scalable ASR solutions for industrial-strength use. He got his Ph.D. awarded at AGH University of Science and Technology in Kraków (Poland). He worked with successful speech processing start-ups: Techmo (Poland) and IntelligentWire (USA, acquired by Avaya), and held a research scientist position at the Center for Language and Speech Processing in Johns Hopkins University (USA). He is currently the Chief Scientific Officer at Meaning (meaning.team).
Talk #2: Speech Synthesis: Let the machines speak! (Jan Vainer, Meaning)
With the rising popularity of personal voice assistants such as Alexa or Siri, the quality of the text-to-speech systems becomes the determining factor of success. Speech synthesis has come a long way in terms of voice naturalness, clarity, quality of prosody and ability to control the system outputs. However, synthetic speech still sounds too monotonic in many situations. It is often inappropriate in dialogue situations, or when reading an exciting novel. We are going to summarise the new as well as some of the old methods for speech synthesis and discuss the current problems and the future of TTS.
Jan Vainer is an applied research scientist mainly interested in generative models of speech, including text-to-speech (TTS) and speech-to-speech (S2S) modeling. He graduated from the Charles University in Prague with a Master’s degree in Artificial Intelligence. He published an Interspeech 2020 paper from his award-winning diploma thesis about SpeedySpeech TTS. He is currently working at Meaning.