Dynamic Gated Recurrent Neural Network for Compute-efficient Speech Enhancement

Longbiao Cheng1, Ashutosh Pandey2, Buye Xu2, Tobi Delbruck1, Shih-Chii Liu1

1Institute of Neuroinformatics, University of Zurich and ETH Zurich
2Reality Labs Research, Meta

Introduction

As showing below, when Recurrent Neural Network (RNN) processing natural signals such as speech, the activations of some neurons change slowly over steps.


RNNactivations
Activation patterns of neurons in a Gated Recurrent Unit (GRU) layer over time steps. The GRU layer is from a speech enhancement model. Each row represents a neuron, while columns show activations at different time steps.

From this observation, we propose a new method that reduces the computes of conventional RNNs by updating only a selected subset of neurons at each step.

Dynamic Gated RNN (DG-RNN)

In conventional RNN models, every neuron in the hidden state is updated at each step. In contrast, DG-RNN introduces a novel component: a binary select gate \(\boldsymbol{g}_t\). This gate dynamically determines which subset of neurons should be updated at each step \(t\).

Neurons that are not selected by the select gate will skip their update process at that step, maintaining their values from the previous hidden state. This selective updating leads to a computation reduction.


DG-RNN
Illustration of the update processes of (A) conventional RNN and (B) DG-RNN at step \(t\).

Dynamic GRU (D-GRU)

When applying DG-RNN to GRU, no extra parameters are needed.

GRU hidden state update equation at step \(t\): \[\begin{align} \color{blue}{\text{Reset Gate: }} \boldsymbol{r}_t &= \sigma(\mathbf{W}_{ir}\boldsymbol{x}_t + \boldsymbol{b}_{ir} + \mathbf{W}_{hr}\boldsymbol{h}_{t-1} + \boldsymbol{b}_{hr}) \\ \color{blue}{\text{Candidate State: }} \boldsymbol{c}_t &= \tanh(\mathbf{W}_{ic}\boldsymbol{x}_t + \boldsymbol{b}_{ic} + \boldsymbol{r}_t \ast (\mathbf{W}_{hc}\boldsymbol{h}_{t-1} + \boldsymbol{b}_{hc})) \\ \color{green}{\text{Update Gate: }} \boldsymbol{z}_t &= \sigma(\mathbf{W}_{iz}\boldsymbol{x}_t + \boldsymbol{b}_{iz} + \mathbf{W}_{hz}\boldsymbol{h}_{t-1} + \boldsymbol{b}_{hz}) \label{eq:gru_z} \\ \text{State Update: }\boldsymbol{h}_t &= \boldsymbol{z}_t \ast \boldsymbol{c}_t + (1 - \boldsymbol{z}_t) \ast \boldsymbol{h}_{t-1} \label{eq:gru_h} \end{align}\] For a neuron \(j\), when \(z^j_t\) is close to 1, the hidden state \(h^j_t\) is largely replaced by the candidate hidden state \(c^j_t\). Conversely, \(z^j_t\) close to 0 means that \(h^j_{t}\) is close to \(h^j_{t-1}\).

In proposed D-GRU, we only update neurons with the top-\(A\) largest values in the update gate \(\boldsymbol{z}_t\). For neurons are not selected, the computation of reset gate \( r_t^j \) and candidate state \( c_t^j \) can be skipped.

Since the computation for the update gate \(\boldsymbol{z}\) cannot be saved, the total computation of the D-GRU is \((1+2\mathcal{P})/3\) of that in the conventional GRU. Here, \(\mathcal{P}\) is the ratio of selected neurons.

Demo

Audio examples comparing speech enhancement performance of D-GRU based networks (with \(\mathcal{P} \in [25\%, 50\%, 75\%]\)) and conventional GRU based networks (\(\mathcal{P} = 100\%\)).

Scroll down for more samples. Zoom in to see the spectrogram details.


Cite

@inproceedings{cheng2024dynamic,
title={Dynamic Gated Recurrent Neural Network for Compute-efficient Speech Enhancement},
author={Cheng, Longbiao and Pandey, Ashutosh and Xu, Buye and Delbruck, Tobi and Liu, Shih-Chii},
year=2024,
booktitle={Proc. INTERSPEECH 2024},
}