0%

Pose Patition Networks for Multi-Person Pose Estimation

Introduction

Multi-person pose estimation aims to localize body joints of multiple persons captured in a 2D monocular image.

Chanllenges

It is chanlleging due to the highly complex joint configuration, partial or even complete joint occlusion, significant overlap between neighboring persons, unknown number of persons, and the difficulties in allocating joints to multiple persons.

Existing multi-person pose estimation approaches usually perform joint detection and partition sparately, mainly falling in two different categories.

Top-down Strategy

  1. Detect persons
  2. Perform pose estimation for each single person individually

Superiority

Avoid complex joint partitions

Drawback

  1. The performance is critically limited by the quality of person detections
  2. Suffer from high joint detection complexity, which linearly increases with the number of persons in the image

Bottom-up Strategy

  1. Detect all joint candidates
  2. Partition them to corresponding person instances according to affinities

Superiority

  1. Lower joint detection complexity than top-down ones
  2. Better robustness to errors from early commitment

Drawback

HIgh complexity of partitioning joints to corresponding persons

PPN

This paper proposes a novel Pose Partition Network based on the Hourglass network for learning joint detector and dense regressor, simultaneously.

  1. Model person detection and joint partition by a dense regression module via votes from joint candidates in a carefully desingned embedding space
  2. Perform a local greedy inference algorithm to obtain joint categorization and association by assuming independence among person detections

Model

Given an image $I$, PPN infers the joint locations $p$, labels $u$ and proximates $b$ providing the largest likelihood probability.

The aiming conditional distibution with learnable parameters $\Theta$ is defined as:
$$
P(p,u,b,|I,\Theta) = \sum_g P(p,u,b,g|I,\Theta) = \sum_g \underbrace{P(p|I,\Theta)P(g|I,\Theta, p)}{\text{partition generation}}\underbrace{P(u,b|I,\Theta, p, g)}{\text{joint configuration}}
$$

I: an image containing multiple persons

$p={p_1,p_2,\dots,p_N}$: spatial coordinates of N joint candidates from all persons in $I$ with $p_v=(x_v,y_v)^T, \forall v=1,\dots,N$

$u={u_1,u_2,\dots, u_N}$: the labels of corresponding joint candidates, where $u_v\in{1,2,\dots,K}$ and $K$ is the number of joint categories

$b\in R^{N\times N}$: the proximities between joints, with $b_(v,w)$ encoding the proximity between the $v$th joint candidate $(p_v, u_v)$ and the $w$th joint candidate $(p_w,u_w)$ and giving the probability for them to be from the same person, only the joints falling in the same partition have non-zero proximities

$g = {g_1, g_2, \cdots, g_M}$: latent variables to encode joint partitions, each $g_i$ is a collection of joints candidates (without labels) belonging to a specific person detection.

Maximizing the above likelihood probability gives the optimal estimation for multiple persons in $I$.

Since driectly maximizing the above likelihood is computationally intractable, this paper proposes to maximize its lower bound induced by a single “optimal” patition:
$$
P(p,u,b|I,\Theta) \geq P(p|I, \Theta){\max_g P(g|I,\Theta,p)}P(u,b|I,\Theta,p,g)
$$
Then, $P(p,u,b,g|I,\Theta)$ is further factorized as:
$$
P(p,u,b,g|I,\Theta)=P(p,g|I,\Theta)\times \prod_{g_i\in g}P(u_{g_i}|I,\Theta,p,g_i)P(b_{g_i}|I, \Theta, p,g_i, u)
$$

$u_{g_i}, b_{g_i}$: the labels of joints falling in the partition $g_i$ and their proximities, respectively

$P(p,u,b,g|I,\Theta)$ is defined as a Gibbs distribution:
$$
P(p,u,b,g|I,\Theta)\propto \text{exp}{-E(p,u,b,g)}
$$

$E(p,u,b,g)$ is the energy function for the joint distribution $P(p,u,b,g|I,\Theta)$

$$
E(p,u,b,g)=-\varphi(p,g) - \sum {g_i\in g}(\sum_{p_v\in g_i}\psi(p_v,u_v) + \sum{p_v,p_w\in g_i}\phi(p_v, u_v, p_w, u_w)) \tag{1} \label{eq1}
$$

$\varphi(p,g)$ : score the quality of joint partitions $g$ generated from joint candidates $p$ for the input image $I$

$\phi(p_v, u_v, p_w, u_w)$: score how likely the postions $p_v$ with lable $u_v$ and $p_w$ with label $u_w$ belong to the same person

Joint Candidate Detection

The joint confidence maps constructed by modeling the joint locations as Gaussian peaks is used to encode probabilities of joints presenting at each position in the image.

For a position $p_v$ in the given image,
$$
C_j^i(p_v) = \text{exp}(-||p_v-p_j^i||_2^2/\sigma^2)
$$

$C_j$: the confidence map for the $j$th joint

$C_j^i$: the confidence map of the $j$th joint for the $i$th person

$$
C_j(p_v) = \max_i C_j^i(p_v)
$$

In experiment, this paper:

  1. Find peaks with confidence greater than a given threshold $\tau​$ (set as 0.1) on predicted maps $\tilde{C}​$ for all types of joints
  2. Perform NMS to find the joint candidate set $\tilde{p}={p_1,p_2,\dots,p_N}$

Pose Partition via Dense Regression

The proposed partition model transform all the joint candidates into an embedding space $\mathcal{H}$ to collectively determine the centroid hypotheses of their corresponding person instances, and partitions joints into different person instances by a single feed-forward pass.

In $\mathcal{H}$, each person corresponds to a single point, and each point $h_*\in \mathcal{H}$ represents a hypotheis about centroid location of a specific person instance.

The probability of generating joint partition $g_​$ at location $h_​$ is calculated by summing the votes from different joint candidates together.
$$
P(g_*|h_*) \propto \sum_j w_j (\sum _{p_v\in \tilde{p}}1[\tilde{C}_j(p_v)\geq\tau]exp{-||f_j(p_v)-h_*||_2^2})
$$

$1[\cdot]​$: the indicator function

$w_j​$: the weight for the votes from $j​$th joint category (fixed 1 for all kinds of joints in this paper)

$f_j:p\to\mathcal{H}$ : densely transform every fixel in the image to the embedding space $\mathcal{H}$

To learn $f_j$, a target regreesion map $T_j^i$ for the $j$th joint of the $i$th person is defined as:
$$
T_j^i(p_v) =
\begin{cases}
o_{j,v}^i/Z & \text{if } p_v \in \mathcal{N}j^i, \
0 & \text{othervise},
\end{cases}
o
{j,v}^i = (p_c^i-p_v)=(x_c^i-x_v, y_c^i-y_v)
$$

$p_c^i​$: the centroid position of the $j​$th person

$Z=\sqrt{H^2+W^2}$: the normalization factor, with $H,W$ as the height and width of image $I$

$\mathcal{N}_j^i = {p_v|\ ||p_v-p_j^i||_2 \leq r}$: the neighbor positions of $j$th joint of the $i$th person, with $r$ as a neghiborhood size constanct (set as 7 in this paper)

Then, the target regression map $T_j$ for the $j$th joint as the average fro all persons is written as:
$$
T_j(p_v)=\frac{1}{N_v}\sum_i T_j^i(p_v)
$$

$N_v$: the number of non-zero vectors at position $P_v$ across all persons

After predicting the regression map $\tilde{T_j}​$, this paper defines ransformation function $f_j​$ for position $p_v​$ as:
$$
f_j(p_v)=p_v+Z\tilde{T}j(p_v)
$$
The score $\varphi(p,g)$ is defined as:
$$
\varphi(p,g)=\sum_i \log P(g_i|h_i) \tag{2} \label{eq2}
$$
After generating $P(g
|h_)​$, this paper adopt the Agglomerative Clustering to find peaks by clustering the votes in the embedding space $\mathcal{H}​$. This paper assumes that the set of joint candidates casting votes in each cluster corresponds to a joint partition $g_i​$:
$$
g_i = {p_v|p_v\in\tilde{p}, \tilde{C}_j(p_v)\geq \tau, f_j(p_v)\in\mathcal{C}_i}
$$

$\mathcal{C} = {\mathcal{C}_1,\dots,\mathcal{C}_M}$: the clustering result on the vote set $h$, with $h={h_v|h_v=f_j(p_v), \tilde{C}_j(p_v)\geq \tau, p_v\in\tilde{p}}$, $\mathcal{C}_i$ as the $i$th cluster and $M$ as the number of cluster

Local Greedy Inference for Pose Estimation

As shown in the energy function $\ref{eq1}$, the score $\varphi(p,g)$ is now a constant after getting joint partition according to $\ref{eq2}$. So the optimization is simplified as:
$$
\tilde{u},\tilde{b} = arg \min_{u,b}(-\sum_{g_i\in\tilde{g}}(\sum_{p_v\in g_i}\psi(p_v, u_v)+\sum_{p_v,p_w\in g_i}\phi(p_v, u_v, p_w, u_w)))
$$

$\tilde{g}$: the generated partition set

$\psi(p_v, u_v)$: the confidence score at $p_v$ from the $u_v$th joint detector

$\phi(p_v, u_v, p_w, u_w)$: the similairty socre of votes of $p_v$ and $p_w$ in the embedding space

$$
\phi(p_v, u_v, p_w, u_w) = 1[\tilde{C}{u_v}(p_v)\geq \tau]1[\tilde{C}{u_w}(p_w)\geq\tau]\exp{-||h_v-h_w||_2^2}
$$

$h_v = p_v + Z\tilde{T}_{u_v}(p_v)$

$h_w=p_w+Z\tilde{T}_{u_w}(p_w)​$

Implementation

PPN utilizes the Hourglass modele to learn iamge representations and then separates into two branches:

  1. Produce dense regression maps via one $3\times3$ convolution on feature maps and one $1\times1$ convolution for classification
  2. Produce joint detection confidence maps

Losses to learn joint detection and dense regression:
$$
\begin{gather}
L_{joint}^t\triangleq\sum_j\sum_v||\tilde{C}j^t(p_v)-C_j(p_v)||_2^2\
L
{regression}^t\triangleq\sum_j\sum_v||\tilde{T}_j^t(p_v)-T_j(p_v)||_2^2
\end{gather}
$$

$\tilde{C}_j^t,\tilde{T}_j^t$: the predicted joint confidence maps and regression maps at the $t$th stage, respectively

The total loss:
$$
L = \sum_{t=1}^T(L_{joint}^t+\alpha L_{regression}^t)
$$

$T = 8$: the number of hourglass modules used in this paper

$\alpha=1$: the weighted factor