Knowledge Boosting During Low-latency Inference

Vidya Srinivas¹   Malek Itani¹   Tuochao Chen¹

Sefik Emre Eskimez²,   Takuya Yoshioka ³,   Shyamnath Gollakota¹,

¹ University of Washington, ² Microsoft, ³ AssemblyAI

25th Interspeech Conference (Interspeech 2024)



[Paper]   [Code]   [TSE Dataset Part 1]   [TSE Dataset Part 2]   [SS Dataset Part 1]   [SS Dataset Part 2]  

Abstract

Models for low-latency, streaming applications could benefit from the knowledge capacity of larger models, but edge devices cannot run these models due to resource constraints. A possible solution is to transfer hints during inference from a large model running remotely to a small model running on-device. However, this incurs a communication delay that breaks real-time requirements and does not guarantee that both models will operate on the same data at the same time. We propose knowledge boosting, a novel technique that allows a large model to operate on time-delayed input during inference, while still boosting small model performance. Using a streaming neural network that processes 8 ms chunks, we evaluate different speech separation and enhancement tasks with communication delays of up to six chunks, or 48 ms. Our results show larger gains where the performance gap between the small and large models is wide, demonstrating a promising method for large-small model collaboration for low-latency applications.

Target applications with knowledge boosting for augemented audio.

The knowledge boosting framework used for a source separation network.






Audio Samples

We show comparisions of our method (KB) with a model of the same size as our small model below (model without KB). Each of the KB audio samples are shown at a delay of 6 audio chunks (each 8 ms), corresponding a communication delay of 48 ms.



Sample 1 (Task: Target Speech Extraction)

Input Mixture



Target Model without KB KB Delay 48 ms






Sample 2 (Task: Target Speech Extraction)

Input Mixture



Target Model without KB KB Delay 48 ms






Sample 3 (Task: Source Separation)

Input Mixture



Target 1 Model without KB KB Delay 48 ms


Target 2 Model without KB KB Delay 48 ms


Keywords: model collaboration, source separation, audio enhancement