Neural egg separation: Extracting audio from noise

WHAT THE RESEARCH IS:

A new method for identifying distinct images and sounds within noisy environments, a long-standing challenge for machine learning (ML) systems. Called Neural Egg Separation (NES) — a reference to separating egg whites and yolks — this approach isolates audio and visual sources through a series of comparisons between signals that are clear and ones that are obscured.

HOW IT WORKS:

Humans are inherently good at separating out individual sounds and visuals, such as hearing another person’s voice at a crowded cocktail party or spotting an animal as it moves through bushes. But applications that rely on machine learning often struggle with this task. A supervised approach to solving this problem — which would involve training on samples of every source — assumes a volume of training data that might not be feasible. An entirely unsupervised approach would risk models making inaccurate assumptions about the sources of mixed signals.

The researchers propose NES, a semi-supervised approach that combines aspects of training and estimation. In this iterative method, the system separates known and unknown distributions by mixing signals together and conducting multiple analyses. In the process, the system gradually injects more of the known signal into the mix, while the model continuously improves at isolating and extracting it. Our experiments show that NES significantly outperforms other methods that employ similar amounts of supervision — and is competitive even with systems that use full supervision.

WHY IT MATTERS:

In addition to generally improving ML systems’ ability to understand audio and visual input in realistically cluttered and noisy conditions, this approach could lead to tools that actually enhance people’s natural capacity to isolate signals. Related applications could range from making audio clearer in videos recorded at concerts (or similar events) to developing AR-based applications that could amplify, in real time, a specific audio source or visual feature.