SING: Significantly faster AI-based audio

WHAT THE RESEARCH IS:

A neural audio synthesizer that generates musical notes based on specific instrument, pitch, velocity (the force with which a note is played), and other inputs. Human evaluators assessed whether the notes the Symbol-to-Instrument Neural Generator (SING) creates seem natural — as if they were played on an actual flute, guitar, or other instrument. In many cases, they rated the system to be more realistic-sounding than similar artificial intelligence networks, even though the system requires a fraction of the time for training and audio generation. In one second of computation, SING can produce 512 seconds of audio.

HOW IT WORKS:

Typical AI-based audio generation systems create audio samples one by one. SING uses the same kinds of training data — such as data sets of existing musical note recordings — but produces its audio in much larger batches, generating the waveform for as many as 1,024 audio samples at once. This end-to-end training process significantly reduces the amount of computational power needed. In tests using the NSynth data set, SING was able to generate notes from almost 1,000 different instruments, with dozens of pitches per instrument and five different levels of force. In one experiment, the system’s ability to generate notes for pitches not seen during training was rated to be closer to the ground truth than DeepMind’s WaveNet system in almost 70 percent of cases when compared with a state-of-the-art encoder based on WaveNet — but SING’s generation time is 2,500x faster and its training time is 32x faster.

WHY IT MATTERS:

SING opens up new opportunities for creating high-quality audio in real time that sounds closer to what real musical instruments generate, especially when compared with traditional synthesizers. This research could potentially be used to separate a song into its own audio file for each instrument or voice, or even to interpret a musical score and produce audio in the style of a musician. The system’s faster generation and training times could make previously compute-heavy applications, such as automatic music generation, more accessible to researchers with limited training resources.