Introduction

Deep CNNs have been widely applied in image and video understanding, but limited by their convolutional operators which are:

Dedicated to capture local features and relations
Inefficient in modeling long-range interdependencies

Though stacking multiple convolutional operators can enlarge the receptive filed, it also comes with some unfavorable issues:

Make the model unnecessarily deep and large
Features far away from a location have to pass through a stack of layers before affecting the location
Features visible to a distant location are delayed ones from several layers behind

This paper proposes the double attention block that enables a convolutional layer to sense the entire patio-temporal space from its adjacent layer immediately.

Model

The double attention mechanism consists of two steps:

Gather features from entire space into a compact set through second-order attention pooling
Select and distribute features to each location via another attention

Let $X \in R^{c \times d \times h \times w}$ denotes the input tensor for a spate-temporal convolutional layer, with $c$ as the number of channels, $d$ as the temporal dimension, $h, w$ as the spatial dimension of the input frames. For every spit-temporal input location $i=1,\dots, dhw$ with local feature $v_i$, the new feature is defined as:
$$
z_i = F_{distr}(G_gather(X),v_i)
$$

$G_{gather}$: aggregate features from the entire input space

$F_{distr}$: distribute the gathered information to each location i, conditioned on the local feature $v_i $

The First Attention Step: Feature Gathering

Bilinear pooling

Bilinear pooling can capture second-order features and preserve complex relations better, compared with the conventional average and max pooling.

Given two feature maps $A=[a_1, \cdots, a_{dhw}]\in R^{m \times dhw}, B=[b_1, \cdots, b_{dhw}]\in R^{n \times dhw}$, the output $G=[g_1, \dots, g_n]\in R^{m\times n}$ of bilinear pooling within $A$ and $B$ is defined as:
$$
G_{bilinear}(A,B)= AB^T=\sum_{\forall i}a_ib_i^T \tag{2} \label{eq2}
$$

Rewrite B as $B = [\bar b_1, \cdots, \bar b_n]$ , where $\bar b_i$ is a $dhw$-dimensional row vector, Eqn.($\ref{eq2}$) can be reformulated as:
$$
g_i = A\bar b_i^T=\sum_{\forall j}\bar b_{ij}a_j
$$
Apply a softmax onto B to ensure $\sum_j \bar b_{ij}=1$, we can get:
$$
g_i = Asoftmax(\bar b_i)^T
$$

Benefits of second-order attention pooling

Capture the global features when $\bar b_i$ is densely attened on all locations
Capture the existence of specific semantic when $\bar b_i$ is sparsely attended on a specific region

Implementation Detail

In the implementation, $A=\phi(X;W_\phi), B=softmax(\theta(X;A_\theta))$ are outputs of two different convolutional layers transforming the input X

The Second Attention Step: Feature Distribution

Given the gathered features $G_{gather}(X)$, this paper then selects a subset of featue vectors from $G_{gather}(X)$ based on the need of feature $v_i$ at location $i$ with soft attention:
$$
z_i = \sum_{\forall j}v_{ij}G_j=G_{gather}(X)v_i, where \sum_{\forall j}v_{ij}=1
$$

$V= softmax(\rho(X; W_\rho))$, where $W_\rho$ contains parameter for this convolutional layer

Double Attention Block

The overall architecture can be formulated as:
$$
\begin{split}
Z &= F_{distr}(G_{gather}(X),V) \
&= G_{gather}(X)softmax(\rho(X,W_\rho))\
&=[\phi(X;W_\phi)softmax(\theta(X,W_\theta))^T]softmax(\rho(X, W_\rho))
\end{split}
$$

Blue Flamingo

A^2-Nets: Double Attention Networks

Introduction

Model

The First Attention Step: Feature Gathering

The Second Attention Step: Feature Distribution

Double Attention Block