Nvidia Introduced an AI Model That Modifies Sounds Simply Using Text and Generates Novel Sound

IBL News | New York

On Monday, Nvidia showed a new AI model that understands and generates sound as humans do.

Called Fugatto (Foundational Generative Audio Transformer Opus 1), this model generates or transforms any mix of music, voices, and sounds described with prompts using any combination of text and audio files.

However, Santa Clara, California-based Nvidia, the world’s largest supplier of chips and software for AI systems, said it is still debating whether and how to release it publicly.

For example, Fugatto can create a music snippet based on a text prompt, remove or add instruments from an existing song, change the accent or emotion in a voice, and even let people produce sounds never heard.

Another case can be an online course spoken by any family member or friend.

Music producers can use Fugatto to prototype or edit an idea for a song quickly, trying out different styles, voices, and instruments. They could also add effects and enhance the overall audio quality of an existing track.

“This thing is wild, and the idea that I can create entirely new sounds on the fly in the studio is incredible,” said Ido Zmishlany, a multi-platinum producer and songwriter and cofounder of One Take Audio, a member of the NVIDIA Inception program for cutting-edge startups.

Fugatto is a foundational generative transformer model that builds on Nvidia’s prior work in speech modeling, vocoding, and understanding.

The full version uses 2.5 billion parameters and was trained on a bank of NVIDIA DGX systems packing 32 NVIDIA H100 Tensor Core GPUs.

Other players like Runway and Meta have introduced models that generate audio or video from a text prompt.