OpenAI’s AI model automatically recognizes and translates speech into English

Bing Edwards / Ars Technica

On Wednesday, OpenAI released a new open source AI model called hiss Recognizes and translates voice at a level close to human recognition ability. It can transcribe interviews, podcasts, conversations, and more.

open ai whisper coaches Over 680,000 hours of audio and text data matching nearly 10 languages ​​were collected from the web. According to OpenAI, this open group approach “improved the robustness of accents, background noise, and technical language.” It can also detect the spoken language and translate it into English.

OpenAI describes Whisper as a file decoder converter, a type of neural network that can use context extracted from input data to learn correlations that can then be translated into model outputs. OpenAI offers this overview of the Whisper process:

The incoming audio is split into 30-second parts, converted into a logarithmic spectrogram, and then passed to an encoder. The decoder is trained to predict the corresponding text comment, mixed with special tokens that direct the individual model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and speech translation into English.

Through open outsourcing Whisper, OpenAI hopes to offer a new core model that others can build upon in the future to improve speech processing and accessibility tools. OpenAI has a proven track record on this front. In January 2021, OpenAI was released CLIPan open source computer vision model that arguably ignited the modern era of rapidly advancing image synthesis technology such as the DALL-E 2 and stable spread.

At Ars Technica, we tested Whisper from code Available on github, and we fed her multiple samples, including a podcast episode and a particularly hard-to-understand audio section taken from a phone interview. Although it took a while while running the standard Intel desktop CPU (the real-time tech didn’t work yet), Whisper did a good job transcribing audio into text through its Python demo—much better than some AI Supported Software Voice Transcription Services We Tried In The Past.

Example of taking the console out of OpenAI's Whisper demo while it's transcribing a podcast.
Zoom / Example of taking the console out of OpenAI’s Whisper demo while it’s transcribing a podcast.

Bing Edwards / Ars Technica

With the right setup, Whisper can easily be used to transcribe interviews, podcasts, and possibly translate podcasts produced in languages ​​other than English into English on your device — for free. This is a powerful combination that may eventually disrupt the copy industry.

As with nearly every major new AI model these days, Whisper brings both positive advantages and the potential for abuse. on a whisper model card (Under the “Broader Implications” section), OpenAI warns that Whisper could be used to automate monitoring or identify individual speakers in a conversation, but the company hopes it will be used “primarily for useful purposes.”