0%

Hierarchical Contextual Refinement Networks for Human Pose Estimation

Introcution

Predicting human pose is a challenge problem due to high flexibility of joints and possible occlusion. One of the chanllenges is from the heterogeneous flexibilities and complexity of human joints.

However, most existing approaches ignore the difference of such joint complexities and stick to dealing with all joints together in a holistic way.

This paper takes the complexities of joints into consideration.

They devise a complexity-aware hierarchical model that distributes body joints into different layers according to their complexities. Easy joints are estimated first and difficult ones are addressed later utilizing the estimation results for easier ones.

Four layers of HCRN

  1. neck
  2. head, left/right shoulder, left/right hip
  3. left/right elbow, left/right knee
  4. left/right wrist, left/right ankle

Model

The goal is to estimate the positions $\mathcal{P}={P_i}_{i=1}^N​$ of $N​$ joints $\mathcal{J}={J_i}_{i=1}^N​$ for given human body image $I​$.

Front-end CNN

This paper extracts the feature $F\in R^{c\times h\times w}$ from “fc7” of VGG16, “res5c” of Res101, and the highest-level features of the last stage of CPM as well as HG, as input to HCRN, respectively.

c: the dimension of discriminative representation

h,w: the spatial-size height and width of F

HCRU

F is then fed into the HCRN to generate heat maps $\mathcal{H} = {H_i}_{i=1}^N$ for all joints. Each element in one heatmap $\mathcal{H}_i$ indicates the possibility of the corresponding location containing the $i$-th joint.

CRUs are organized into a complexity-aware hierarchy of human joints, and HCRN applies CRUs layer-by-layer to generate heatmaps for all joints.

The position $P_i$ for joint $J_i$ is localized by taking the location with the maximum confidence score on heatmap $H_i$.

Contextual Refinement Unit (CRU)

CRU estimates heatmap $H_i^l​$ for joint $J_i^l​$ in the layer $l​$ according to the deep representation $F​$ and the heatmap $H_{i*}^{l-1}​$ of the corresponding joint $J_{i*}^{l-1}​$ in the layer $(l-1)​$.

There are four steps of calculations.

Step 1: Heatmap Initialization

The first step is to estimate initial heatmap $O_l​$ for a joint $J^l​$ in layer $l​$
$$
O^l=\sigma(W_O^l * F + B_O^l)
$$

*: a $1\times 1$ convolutional operator

$\sigma(\cdot)​$: the sigmoid activation function

$W_O^l\in R^{c\times 1\times 1}, B_O^l\in R^{1\times 1\times 1}$: weight parameter

c: the channel dimension of F

Step 2: Heatmap Diffusion

The second step is to spread the confidence score of each position in the heatmap $H^{l-1}$ from previous layer to its neighbors. CRU perfomrs probability diffusion operation $\Phi(\cdot)$ on deformation information of $H^{l-1}$ to generate the diffused heat map $\hat H^{l-1}$. For a pixel $P$ in $H^{l-1}, $ the probability diffusion function $\Phi(\cdot)$ is
$$
\Phi(H^{l-1}(P))=\max_{\delta\in\mathcal{N}}(H^{l-1}(P+\delta)-W_d^{l-1}d(\delta))
$$

$\delta=(\delta_x, \delta_y)$: the position offset

$\mathcal{N}=[-r,r]\times[-r,r]$: the range of $\delta$ defining the fusion field ($r = 7$ in this paper)

$d(\delta)=[\delta x, \delta y, \delta x^2, \delta y^2]$: the deformation feature

$W_d^{l-1}​$: a 4D weight vector

Step3: Heatmap Stacking

The third step is to stack the initial heatmap $O^l​$ with the transformed heatmap $\hat{H}^{l-1}​$
$$
C^l = O^l \bigoplus \hat{H}^{l-1}
$$

$\bigoplus​$: the concatenation operator

Step 4: Heatmep Refinement

The fourth step is to refine the initial heatmap $O^l$ by selecting proper contextual information from $\hat{H}^{l-1}$ to generate the estimation result $H^l$ for joint $J^l$ in the layer $l$.
$$
H^l=\sigma(W_C^l * C^l + B_C^l)
$$

*: a $1\times 1$ convolutional operator

$\sigma(\cdot)$: the sigmoid activation function

$W_C^l\in R^{2\times 1\times 1}, B_C^l\in R^{1\times 1\times 1}$: weight parameter

Question

  1. Is the order of complexity fixed for joints in every images?
  2. The worst situation: what if the neck(Layer 1) is occluded?
  3. How to apply this framework to multi-person pose estimation?
  4. How to apply it to video human pose estimation?