Introduction

Human pose estimation, aiming to estimate joint locations of human body, is a fundamental task in computer vision. However, distracting factors, $e.g.$occlusion, self-similarity, large deformation, huge variation in pose configuration and appearance, often lead to inaccurate joint localization, even false joint categorization.

Human body parts, generated by human parsing methods, can provide useful contextual cues to help localize body joints.

Some exsiting research works exploit the parsing information to help improve pose estimation performance, however, they generally perform human body parsing and pose estimation separately and utilize parsing results to refine body joint localization as post processing.

This paper designs a novel Parsing Induced Learner (PIL) to learn to fast adapt the pose estimation model conditioned on the parsing information extracted from a specific sample.

Model

Given an input GRB image $I\in R^{M\times N \times 3}$ of size $M\times N$, our goal is to detect the locations $P = {(x_i, y_i)}_{i=1}^J$ of human joints with assistance of the corresponding human parsing map $S\in {0,1,\dots,L}^{M\times N}$ of $I$.

$(x_i, y_i)$ : coordinates of the $i$th joint

$J, L$: the number of joint and body part categories, respectively

0 in $S$: the background category

The parsing induced pose estimation model is formulated as：
$$
f_{[\theta,\theta’]}:I\to P, where\ \theta’ = g(I,S) \tag{1} \label{eq1}
$$

$\theta, \theta’$: learnable parameters

This paper design a Parsing Induced Learner (PIL) to explictly learn function$g(\cdot)$. The proposed PIL consists of:

A parsing encoder $E_{\theta^S}^S(\cdot)$ for extracting the parsing features
A parameter adapter $K_\phi(\cdot)$ for learning the dynamic parameters $\theta’$ by taking the features output of the parsing encoder $E_{\theta^S}^S(\cdot)$ .

Therefore, $\theta’$ can be formulated as:
$$
\theta’ = g(I,S):= K_\phi(E_{\theta_S}^S(I))
$$

Pose Encoder

The pose encoder $E_{\theta^p}^P(\cdot)$ extracts dicriminative features $F^P = E_{\theta^P}^P(I)$ from the imput image $I$.

This paper implements it via VGG16 based Fully Convolutional Network (FCN) and Hourglass network, respectively.

PIL

The proposed PIL consists of a parsing encoder for extracting the parsing features, and a parameter adapter for learning the dynamic parameters $\theta’$

Parsing Encoder

The parsing encoder $E_{\theta_S}^S(\cdot)$ extracts features $F^S = E_{\theta_S}^S(I)$ for both parsing and pose estimaton.

This paper implements it via VGG16 based FCN and Hourglass network, respectively.

Parameter adapter

The parameter adapter $K_\phi(\cdot)$ is a one-shot learner to predit the dynamic parameters $\theta’$ via taking in the output $F^S$ of $I$ from the parsing encoder network, and the tensor $\theta’\in R^{h\times h \times c}$ is taken as the predicted dynamic convolutional kernals of the pose encoder network.

h=7: the convolutional kernel size

$c=c_i\times c_o$: the numbers of channels to learn for adapative convolution

$c_i, c_o$: the number of input and output channels, respectively

This paper implements it by a small CNN with learnable parameters $\phi$.

In practice, it is infeasible to directly predict all the convolutional parameters due to their large scale. To solve this problem, this paper perform the following factorization to reduce the number of free parameters
$$
\theta’=U * \tilde{\theta}*_c V
$$

*: convolutional operation

$*_c$: the channel wise convolution

U, V: parameters to learn for the adaptive convolutional

$\tilde{\theta}\in R^{h\times h\times c_i}$: actual parameters to predict by the parameter adapter

Adaptive Convolution

Given $F^P$, this paper first uses a $1\times 1$ convolution on it with V, the conducts the dynamic convolutional by group with $\tilde{\theta}$, and finnally adopts another $1\times 1$ convolutional with U to generate $F^a$.
$$
F^a = \theta’ * F^P = U * \tilde{\theta} *_c V * F^P
$$

$F^a$: the extracted features by dynamic parameters $\theta’$

$U\in R^{1\times 1\times c_i\times c_o}, V\in R^{1\times 1\times c_i \times c_i}$: learnable parameters

Feature Fusion

This paper regards $F^a$ as a residue component, and fuses it with the original features $F^P$ via addition:
$$
F^{P*} = F^P + F^a
$$

$F^{P*}$: the final feature refined by parsing information for humna pose estimation

Classifiers

This paper implements a $1\times 1$ convolution as a linear classifer $C_{\omega^P}^P$ on $F^{P*}$, another linear classifer $C_{\omega^S}^S$ on $F^S$, respectively, to predict the confidence maps for each kind of joints, and generate the parsing prediction.

$\omega^P, \theta^P$ together instantiate $\theta$ in Eqn.($\ref{eq1}$).

Lose function

This paper defines the following loss function for training
$$
\mathcal{L}:= \mathcal{L}^P(C_{\omega^P}^P(E_{[\theta^P, \theta’]}(I)),\hat{P}) + \beta\mathcal{L}^S(C_{\omega^S}^S(E_{\theta^S}^S(I)),\hat{S})
$$

$\hat{P}, \hat{S}$: pose and parsing annotations

$\mathcal{L}^P$: the pose loss function, Mean Square Error loss

$\mathcal{L}^S$: the parsing loss function, Cross Entropy loss

$\beta$: a trade-off coefficient

Pose estimation

Single-person pose estimation

Directly output the positions with maximum responses for each type of body joints.

Multi-person pose estimation

Perform Non-maximum Suppression(NMS) to find joint candidates on the predicted confidence maps.

Question

We can apply parameteric pose NMS to refine the results of multi-person pose estimation.

Parameteric pose NMS

RMPE:Regional Multi-person Pose Estimation

NMS Scheme

The most confident pose is selected as reference, and some poses
close to it are subject to elimination by applying elimination
criterion.
This process is repeated on the remaining poses
set until redundant poses are eliminated and only unique
poses are reported.

Elimination Criterion

This paper defines pose similarity to eliminate the poses which are too close and too similar to each others. The elimination criterion is writtern as follows:
$$
f(P_i, p_j,|\Lambda,\eta) = 1[d(P_i,P_j|\Lambda,\lambda)\leq \eta]
$$

$P_i$: a pose with $m$ joints denoted as ${\langle k_i^1, c_i^1\rangle,\dots,\langle k_i^m, c_i^m\rangle}$

$k_i^j, c_i^j$: the $j^{th}$ location and confidence score of joints, repectively

$d(P_i, P_j|\Lambda)$: a pose distance metric to measure the pose similarity

$\eta$: a threshold as elimination criterion

$\Lambda$: a parameter set of function $d(\cdot)$

If $d(\cdot)$ is smaller than $\eta$, the output of $f(\cdot)$ should be 1, indicating that pose $P_i$ should be eliminated due to redundancy with reference pose $P_j$

Pose Distance

A soft matching function to count the number of joints matching between poses is defined as:
$$
K_{Sim}(P_i,P_j|\sigma_1) =
\begin{cases}
\sum_n tanh \frac{c_i^n}{\sigma_1}\cdot tanh\frac{c_j^n}{\sigma_1},& \text{if}\ k_j^n\ \text{is within}\ \mathcal{B}(k_i^n) \
0 & otherwise
\end{cases}
$$

$B_i$: the box for $P_i$

$\mathcal{B}(k_i^n)$: a box center at $k_i^n$, and each dimension of $\mathcal{B}(k_i^n)$ is 1/10 of the original box $B_i$

$\sigma_1$: a parameter

The distance between parts is written as:
$$
H_{Sim}(P_i,P_j|\sigma_2) = \sum exp[-\frac{(k_i^n-k_j^n)^2}{\sigma_2}]
$$
The final distance function is written as:
$$
d(P_i, P_j|\Lambda)=K_{Sim}(P_i, P_j|\sigma_1)+\lambda H_{Sim}(P_i,P_j|\sigma_2)
$$

$\lambda$: a weight parameter

$\Lambda = {\sigma_1,\sigma_2,\Lambda}$

Optimization

The four parameters in the eliminate criterion $f(P_i,P_j|\Lambda, \eta)$ are optimized to achieve the maximal mAP(mean average percision) for the validation set.

This paper optimizes two parameters at a time by fixing the other two parameters in an interative manner to avoid intractable search in a 4D space.

Blue Flamingo

Human Pose Estimation with Parsing Induced Learner