Introduction

This paper tackles object localiztion under a weakly supervised paradigm, where only image-level labels indicate the presence of an object.

Current Weakly Supervised Object Localization (WSOL) methods model the missing object locations as latent variables, and alternate between object candidates generation and candidates refinement to optimize the localization model.However, such optimization is non-convex and easy to get stuck in local minima.

To solve the problem, this paper proposes a Multi-view Learning Localization Network (ML-LocNet), which exploits the complementary and consensus properties of differnt views, to describe the target regions more accurately and mitigate the over-fitting issue. ML-LocNet is independent of the backbone architecture and can be easily combined with exsiting WSOL models.

Model

The model learning consists of two phases:

Augment region rerpresentation with multi-view feature concatenation, and mine object instances with image-level supervision.
Utilize a multi-view co-training algorithm to refine the localization model with the mined instances

Phase One: Multi-view Representation Learning

In the first phase, the network aggreates region scores for computing classification loss.

Region Dropout

Since high-scored regions tend to focus on objet parts instead of the whole object, the network would quickly converge to local minima due to overfitting.

To solve this problem, this paper performs a random dropout on region proposals $\mathcal R$ (obtained by Edge boxes in this paper), and only pass part of regions $\mathcal R’$ to the ROI pooling layers.

Multi-view Features

The base network is divided into three branches, with each branch representing one view. This paper add a convolutional block, called the view adaptation block, to each view.

Formally, for a feature layer of size $m\times n$ with $p_I$ channels, the view adptation block is a $3\times3\times P_I \times P_O$ small kernel which produces an output feature map with size $m\times n\times p_O$.

Then, each view is followed by an ROI pooling layer to project region proposals $\mathcal R$ to feature maps to produce the feature map $\phi_i(x,\mathcal R)$.

Finally, the features among differne views are weightedly combined to form the final representation:
$$
\phi(x,\mathcal R) = [\alpha_1\phi_1(x,\mathcal R)\ \alpha_2\phi_2(x,\mathcal R)\ \alpha_3\phi_3(x,\mathcal R)]
$$
Two-stream Network

Then, the concatenated output $\phi(x, \mathcal R)$ is barnched into the two-stream architecture of WSDDN to obtain category specific scores.

Given an image $x$ with region proposals $R$ (obtained by Edge Boxes in this paper), and image level labels $y\in {1,-1}^C$, where $y_c=1\ (y_c=-1)$ indictes the presence (absense) of an object class $c$, the score of resion $r$ corresponding to class $c$ is defined as:
$$
x_{cr} = \frac{e^{\phi^{cr}(x,f_{c_{8C}})}}{\sum_{i=1}^Ce^{\phi^{ir}(x, f_{c_{8C}})}}.*\frac{e^{\phi^{cr}(x,f_{c_{8R}})}}{\sum_{j=1}^{|\mathcal{R}|}e^{\phi^{cj}(x, f_{c_{8R}})}}
$$

$\phi(x, f_{c_{8C}}), \phi(x, f_{c_{8R}})$: the outputs of $f_{c_{8C}}$ and $f_{c_{8R}}$ layer, respectively, with size $C\times|\mathcal R|$, where $C$ as the number of categories and $|\mathcal R|$ as the number of regions

Based on $x_{cr}$, the category specific score y is defined as:
$$
\phi^c(x,w_{cls}) =\sum_{j=1}^{|\mathcal R|}x_{cj}
$$

$w_{cls}$: a non-linear mapping from input $x$ to classification output.

Loss function

The network is trained y back-propagating a image-level binary log loss:
$$
L_{cls}(x,y) = \sum_{i=1}^C \log(y_i(\phi^i(x, w_{cls})-1/2)+1/2)
$$

Phase Two: Mutil-view Co-training

The second phase introduces a multi-view instance refinement procedure, which trains the network with instance-level loss and refines the network via a multi-view co-training strategy.

Initial Object Instances

In order to relieve the issue that high scored regions tend to focus on object parts rather than the whole object, this paper consider the top scored regions as soft vectors. The object heatmap $H^c$ for class c which shows the confidence that pixel $p$ lies in an object is computed as:
$$
H^c(p)=\sum_r x_{cr}D_r(p)/Z \tag{7}
$$

$D_r(p)=1$ when the $r$th region proposal contains p

$Z$ : a normalization constant to assure $\max H^c(p)=1$

The hetmap $H_c$ is binarized with threshold $T=5$ and the tightest bounding box that encloses the largest connect component is choosen as the mined object instance.

Multi-View Co-training

Constructs different views with different convolutional parameters
Run random mini-batch sampling for each view
For instance mining, the man localized outputs of any two views are used for the mined object isntance of the rest view during the next training

Weighted Loss

For a relocalized object $x_c^o$ with label $y_c^o=1$, the weighted loss in the next training step is defined as:
$$
L_{cls} (x_c^o, y_c^o, M_{k+1}) = -\phi^c(x_c^o,w_{loc}^k)\log\phi^c(x_c^o,w_{loc}^{k+1})
$$

$\phi(x_c^o, w_{loc}^k)$: the detection score returned by network $M^k$, where $w_{loc}^k$ is the network parameter of $M_k$

The whole algorithm is summarized as:

Blue Flamingo

ML-LocNet: Improving Object Localization with Multi-view Learning Network

Introduction

Model

Phase One: Multi-view Representation Learning

Phase Two: Mutil-view Co-training