The New Technology of AI-generated Voice

Reading Time: 3 minutes
Whisper is the newest OpenAI technology for an automated speech recognition system. (Photo/OpenAI)

 Just two years ago, on November 30th, 2022 a model of ChatGPT was launched by OpenAI, whose goal was to enhance all the different types of artificial intelligence (AI) and make it beneficial for mankind. Since then, AI has evolved significantly and now has reached new feats of technology. A few months ago, on February 15th, 2024, OpenAI had released Sora, the AI video generator. It was influential, creating life-like videos. It is now available for the use of various filmmakers and artists who enhance their work with the help of AI. Other than Sora, there were also other uses of AI, for example, to help generate images. This technology is known as DALL-E and is currently being integrated with more accurate and precise ChatGPT models to help see texts being answered and also to help visualize the solutions provided by the AI model.

Recently, OpenAI has also created Whisper, an automatic speech recognition system (ASR) that helps aid in transcribing audio files. This AI is advanced, having been trained by collecting multilingual data for over 680,000 hours, while the newer version 3 (v3) is even better, having trained for a total of five million hours of audio, making its ability to make mistakes while transcribing low. When looked at closely, the training of the model was broken up into two parts, one million for helping identify and understand sounds (this helps better detect voice patterns and accents from different parts of the world). The main purpose of Whisper is to help with speech translation, language identification, and also voice activity detection. Whisper can also be integrated with other AI devices from OpenAI to help heighten its uses. The new version (v3) of Whisper can be applied in a variety of sizes to help best accommodate the given application. The AI works in a simple encoder-decoder Transformer. An encoder is used as a sensing device that provides feedback. The audio itself is separated into 30-second pieces which are then converted to a log-Mel spectrogram, which is then passed through an encoder. Lastly, the decoder (the trained part) is used to transcribe the audio files into the needed elements.  According to, “The smallest, Tiny, comes in at 39 million parameters and requires around 1 GB of VRAM to run. The base version stands at 74 million parameters and is around 16 times faster at processing audio than the previous model. The largest, the aptly named Large, stands at a whopping 1550 million parameters and needs 10 GB of VRAM to run.” In simpler terms, GB is a unit of VRAM. VRAM also known as video RAM is a random access memory used by the application to keep image data code that is essential to view on a computer. VRAM is often used for projects like video games and 3D graphic design programs. Whisper is perfect when game developers want to make a game global, transcribing the same game in a variety of different languages using Whisper in half the time it would take to release a game in one other language manually. 

Overall, a closer look into Whisper, the AI speech recognition system opens a vast majority of possibilities when it comes to how it should be used. It can help in a variety of different tasks by making things more efficient and accessible, which makes it all the more crucial in the technology of the future.

Written by Divya Saha

Share this:

You may also like...