Introduction
This paper focus on object detection under a weakly supervised paradigm, where only image-level lables indicatie the presense of an object.
The main challenge in weakly supervised object detection is how to disentangle object instance from the complex backgrounds.
Exsiting methods model the object locations as latent variables, and optimize them via different heuristic methods. However, such optimization is non-convex and easy to get stuck in local minimums.
To solve this problem, this paper proposes a zigzag learning strategy for weakly supervised object detection. It progressively feeds the images into the learning model in an easy-to-difficult order to gain stronger detection ability, and introduces a masking regularization strategy over the last convolutional, which randomly erases the discriminative regions during training to avoid overfitting.
Model
The overall architecture consists of three modules:
- The first module estimate image difficulty automatically
- The second module progressively adds sample to network training in an easy-to-difficult order
- The third module regularize the high responsive patches to enhance the generalization ability of the model
Step1 - Estimate Image Difficulty
This paper utilzes the WSDDN as the baeline network.
Given an image $x$ with region proposals $R$ (obtained by Edge Boxes in this paper), and image level labels $y\in {1,-1}^C$, where $y_c=1\ (y_c=-1)$ indictes the presence (absense) of an object class $c$, the score of resion $r$ corresponding to class $c$ is defined as:
$$
x_{cr} = \frac{e^{\phi^{cr}(x,f_{c_{8C}})}}{\sum_{i=1}^Ce^{\phi^{ir}(x, f_{c_{8C}})}}.*\frac{e^{\phi^{cr}(x,f_{c_{8R}})}}{\sum_{j=1}^{|\mathcal{R}|}e^{\phi^{cj}(x, f_{c_{8R}})}}
$$
$\phi(x, f_{c_{8C}}), \phi(x, f_{c_{8R}})$: the outputs of $f_{c_{8C}}$ and $f_{c_{8R}}$ layer, respectively, with size $C\times|\mathcal R|$, where $C$ as the number of categories and $|\mathcal R|$ as the number of regions
For each class $y_c=1$ in image x, the region scores $x_{cr}(r\in{1,\dots,|R|})$ are sorted in a descending order to form the sorted list $x_{cr’}$, with $r’$ as a permutation of ${1,\dots,|\mathcal R|}$. And the accumulated scores of $x_{cr’}$ is computed to obtain $X_c\in R^{|\mathcal R|}$, with
$$
X_{cr} = \frac{\sum_{j=r’(1)}^{r’(r)} x_{cj}}{ \sum_{j=1}^{|\mathcal R|}x_{cj}}
$$
$X_c$ is in the range of [0, 1] and indicates the convergence degree of the region scores. If the top scores only focus on a few regions $X_c$, then $X_c$ converges quickly to 1, and the target object is easier to be picked out.
This paper introduces Energy Accumulated Scores (EAS) to quantify the convergence of $X_c$.
$$
EAS(X_c, t) = \frac{X_{cj_{[t]}}}{j_{[t]}},\ j_{[t]}=\arg \min_j X_{cj} \geq t.
$$
$t$: a threshold
And the mean Energy Accumulated Scores (mEAS) as the mean scores at a set of eleven equally spaced energy levels $[0, 0.1,\dots,1]$:
$$
mEAS(X_c) = \frac{1}{11}\sum_{t\in[0,0.1,\dots,1]}EAS(X_c,t) \tag{6}
$$
Mining object instances
In order to relieve the issue that high scored regions tend to focus on object parts rather than the whole object, this paper consider the top scored regions as soft vectors. The object heatmap $H^c$ for class c which shows the confidence that pixel $p$ lies in an object is computed as:
$$
H^c(p)=\sum_r x_{cr}D_r(p)/Z \tag{7}
$$
$D_r(p)=1$ when the $r$th region proposal contains p
$Z$ : a normalization constant to assure $\max H^c(p)=1$
The hetmap $H_c$ is binarized with threshold $T=5$ and the tightest bounding box that encloses the largest connect component is choosen as the mined object instance.
Step2 - Progessive Detection Network
Given the image diffuicult scores and the mined seed postitive instances, the training images $\mathcal D$ is splitted into $K$ folds $\mathcal D={\mathcal D_1, \dots, \mathcal D_k}$ in an easy-to-difficult order.
The detector alternates between model training and object relocalization:
- Run a fast-RCCn on the first fold $\mathcal D_1$ to obtain a trained model $M_{\mathcal D_1}$
- Use the trained model $M_{\mathcal D_1}$ to discover object instances in fold $\mathcal D_2$, iteratively.
Weighted Loss
The weighted loss of region $x_c^o$ in the next training step is defined as:
$$
L_{cls} (x_c^o, y_c^o, M_{k+1}) = -\phi^c(x_c^o,M_k)\log\phi^c(x_c^o,M_{k+1})
$$
$x_c^o$: the relocalized object with instance label $y_c^o = 1$
$\phi(x_c^o, M_k)$: the detection score returned by network $M_k$
Step3 - Convolutional Feature Masking Regularization
Due to lask of object annotations, the initial seed inevitablyt includes inacuurate samples and tends to overfit those inaccurate instance. To avoid this issue, this paper proposes a regularization strategy to randomly mask out those discriminative details at previous training, which enforces the network to focus on those less discriminative details.
Given an image $x$ and the mined object $x_c^o$ for each $y_c=1$, a region $\Omega\in x_c^o$ with $S_\Omega/S_{x_c^o}=\tau$ randomly, where $S_\Omega$ denotes the area of region $\Omega$, and masked out by setting $\phi(\Omega, f_{conv})=0$.
$\phi(x,f_{conv})$: the last convolutional feature maps
The whole process is summarized as:
Question
In the estimating image difficulty step, given an image $x$ with label $y\in{1,-1}^C$, we compute $mEAS(X_c)$ for each $y_c=1$, while in the processive learning step, the training images are divided into flods according to mEAS.
What is the mEAS for an image x, is this just computed by $mEAS(X) =\sum_{c=1}^C mEAS(X_c)$?
Given the observation that different categories have different detection difficulties, there may exists situations when an image with low mEAS has quite high $mEAS(X_c)$ for some category $c$, or an image with high mEAS has quite low $mEAS(X_c)$ for some $c$. In the former case, the $x_c^o$ can’t be effectively used, while in the latter, $x_c^o$ may lead to inaccurate instance.
So may be we should consider to design a more elegent image difficulty criterion to fully utilize the difference between categories and semantic context between them, to get better performance.