Adaptive Computation Time
My notes for the paper: Adaptive Computation Time for Recurrent Neural Networks1.
Additive vs multiplicative halting probability
Multiplicative: In the paper (footnote 1), the authors discuss throughly their considerations for deciding the computation time. It is acknowledged by the authors that using the logits $h_n^t$ as the halting probability at step $n$ might be more straightforward. Therefore, the overall halting probability is calculated as $$p_t^n = h_t^n \prod_{u=1}^{n-1} (1 - h_t^u).$$ We use $(1 - h_t^u)$ for previous update steps to indicate that the updating is not stopped until $n$.
As each $p_t^n \in (0, 1)$ is relatively independent with each other and $\sum p_t^n$ is not bound to 1, this approach does not restrict the update depth to grow arbitrarily. The model can be of course trained to lower the expected ponder time $\rho_t = \sum n p_t^n$, but it is observed in the experiments that the resulting model is not preferable in two ways:
- $h_t^1$ is usually just below threshold, intermediate $h_t^n = 0$, and final $h_t^N$ is high enough to halt the update.
- as the expectation is low, $p_t^N \ll p_t^1$, but the network learns to have a much higher magnitude of output states at step $N$, so that the final output is still dominated by the final state.
Additive: In contrast, the additive approach have an constraint of $\sum p_t^n = 1$, so that the probability is decreased monotonically with the number of updates growing larger. Though being non-differentiable, the total ponder time (total updates at all positions) is penalized to avoid consuming unnecessary computation. There is still one drawback of this approach, however. The performance is sensitive to the penalty factor $\tau$, which is not intuitive to choose as a hyperparameter.