Introduction
Human parsing and pose estimation are two challenging tasks for human body configuration analysis, and these two tasks are highly correlated and could provide beneficial clues for each other. Human pose can offer structure information for body part segmentation and labeling, while human parsing can facilitate localizing body joints.
However, existing methods usually train the task-specific models separately and leverage the guidance information for post-processing.
This paper proposes a novel Mutual Learning to Adapt (MuLA) model to exploit mutual guidance information between human parsing and pose estimation in both training and inference phases. This model:
- Introduce a learning to adapt mechanism where the guidance inforamtion from one task can be transffered to modify model parameters for the other parallel task.
- Capable of recurrently performing model adaptation and continously refining both models
Model
Given the notations:
$I\in R^{H\times W\times 3}$: an RGB image with height H and width W
$S={s_i}_{i=1}^{H\times W}$: the human parsing result of I, where $s_i\in{0,\dots,P}$ is the semantic part label of the $i$th pixel and $P$ is the total number of semantic part categories (0 represents the background category)
$J = {(x_i, y_i)}_{i=1}^N$: the body joint locations of the human instance in I, where $(x_i, y_i)$ represents the spatial coordinates of $i$th body joints and $N$ is the number of joint categories.
The proposed MuLA aims at simulataneously predicting human parsing $S$ and pose $J$ via fully expoliting their mutual benefits.
And it can be formulated as following recurrent learning process:
$$
\begin{gather}
S^{(t)}= g_{[\psi^{(t)}, \psi_^{(t)}]}(F_S^{(t)}),\ \text{where}\ \psi_^{(t)}=h’(F_J^{(t)},\hat{J}) \
J^{(t)}= g_{[\phi^{(t)}, \phi_^{(t)}]}(F_J^{(t)}),\ \text{where}\ \phi_^{(t)}=h’(F_S^{(t)},\hat{S}) \ \tag{1} \label{eq1}
\end{gather}
$$
$t$: the iteration index
$g_{[\psi^{(t)}, \psi_^{(t)}]}(\cdot), g_{[\phi^{(t)}, \phi_^{(t)}]}(\cdot)$: the parsing and pose model respectively
$\psi_*, \phi_*$: parameters adaptable to the other task
$F_S^{(t)}, F_J^{(t)}$: the extracted features for parsing and pose prediction, $F_S^{(1)}=F_J^{(1)}= I$
$h’(\cdot, \cdot), g’(\cdot, \cdot)$: the adpating functions
$\hat{S},\hat{J}$: the parsing and pose annotations for I
The proposed MuLA model includes three components:
- A representation encoding module
- A mutual adaptation module
- A classification module
Representation Encoding Module
The representation encoding module consists of two encoders $E_{\psi_e^{(t)}}^S(\cdot)$ and $E_{\phi_e^{(t)}}^J(\cdot)$ for transforming $F_S^{(t)}$ and $F_J^{(t)}$ into high-level preliminary representaion $R_S^{(t)}$ and $R_J^{(t)}$ for human parsing and pose estimation.
$$
\begin{gather}
R_S^{(t)} = E_{\psi_e^{(t)}}^S(F_S^{(t)}),\
R_J^{(t)} = E_{\phi_e^{(t)}}^J(F_J^{(t)})
\end{gather}
$$
This paper implenments it with two archietectures, respectively:
- VGG16-FCN: a fully convolutional version of VGG network with 16 layers with the total stride reduced from 32 to 8 and the last two max-pooling layers removed
- Hourglass network, with making the output layer aim for semantic part labeling instead of joint confidence regression
Mutual adaptation module
The mutual adaptation module includes two adapters $A_{\psi_\alpha^{(t)}}(\cdot)$ and $A_{\phi_\alpha^{(t)}}(\cdot)$ to predict adaptive parameters $\psi_^{(t)}\in R^{h\times h\times c}$ and $\phi_^{(t)}\in R^{h\times h\times c}$.
$h$: the kernel size
$c= c_i\times c_o$: the number of kernels, with $c_i$ and $c_o$ as input and output channel number respectively
$$
\begin{gather}
\psi_^{(t)}=h’(F_J^{(t)},\hat{J}):= A_{\psi_\alpha^{(t)}}(E_{\phi_e^{(t)}}^J(F_J^{(t)})),\
\phi_^{(t)} = g’(F_S^{(t)},\hat{S}):=A_{\psi_\alpha^{(t)}}(E_{\psi_e^{(t)}}^S(F_S^{(t)}))
\end{gather}
$$
To reduce the number of kernels to predict by adapters $A_{\psi_\alpha^{(t)}}(\cdot)$and $A_{\phi_\alpha^{(t)}}(\cdot)$, this paper decompose parameters $\psi_^{(t)}$ and $\phi_^{(t)}$ via:
$$
\psi_^{(t)}=U_S^{(t)}\otimes \tilde{\psi}_^{(t)} \otimes_c V_S^{(t)}\ \text{and}\ \phi_^{(t)}=U_J^{(t)}\otimes \tilde{\phi}_^{(t)} \otimes_c V_J^{(t)}
$$
$\otimes$: convolution operation
$\otimes_c$: channel-wise convolution operation
$U_S^{(t)}/U_J^{(t)}, V_S^{(t)}/V_J^{(t)}$: parameter bases implemented with $1\times 1$ convolutions
$\tilde\psi_^{(t)}\in R^{h\times h\times c_i}, \tilde\phi_^{(t)}\in R^{h\times h\times c_i}$: actual parameters predict by $A_{\psi_\alpha^{(t)}}(\cdot)$ and $A_{\phi_*^{(t)}}(\cdot)$
Then, $\psi_^{(t)}$ and $\phi_^{(t)}$ are conducted on $R_S^{(t)}$ and $R_J^{(t)}$ as dynamic convolution kernels to learn complementary representations $R_{S*}^{(t)}$ and $R_{J*}^{(t)}$:
$$
\begin{gather}
R_{S*}^{(t)}= \psi_^{(t)}\otimes R_S^{(t)} = U_S^{(t)}\otimes \tilde{\psi}_^{(t)} \otimes_c V_S^{(t)}\otimes R_S^{(t)}, \
R_{J*}^{(t)}= \phi_^{(t)}\otimes R_J^{(t)} = U_J^{(t)}\otimes \tilde{\phi}_^{(t)} \otimes_c V_J^{(t)}\otimes R_J^{(t)}
\end{gather}
$$
Finally, complementary and preliminary representaions are fused via addition to generate tailored representations $\bar R_S^{(t)}$ and $\bar R_J^{(t)}$:
$$
\bar R_S^{(t)} = R_S^{(t)} + R_{S*}^{(t)}\ \text{and}\ \bar R_J^{(t)}=R_J^{(t)} + R_{J*}^{(t)}
$$
Classification Module
The classification module uses two classifiers $C_{\psi_w^{(t)}}^S(\cdot)$ and $C_{\phi_w^{(t)}}^J(\cdot)$, implemented with $1\times1$ convolutional layers to predict semantic part probability maps $S^{(t)}$ and body joints confidence maps $J^{(t)}$.
$[\psi_e^{(t)}, \psi_w^{(t)}], [\phi_e^{(t)},\phi_w^{(t)}]$: together instantiate parameters $\psi^{(t)}$ and $\phi^{(t)}$ in Eqn. $\ref{eq1}$, respectively.
And the inputs $F_S^{(t+1)}$ and $F_J^{(t+1)}$ for the next stage is:
$$
\begin{gather}
F_S^{(t+1)}=M_{\psi_m^{(t)}}^S(E_{[\psi_e^{(t)},\psi_^{(t)}]}(F_S^{(t)}), S^{(t)}),\
F_J^{(t+1)}=M_{\phi_m^{(t)}}^J(E_{[\phi_e^{(t)},\phi_^{(t)}]}(F_J^{(t)}), J^{(t)})
\end{gather}
$$
$M_{\psi_m^{(t)}}^S(\cdot, \cdot), M_{\phi_m ^{^(t)}}^J(\cdot, \cdot)$: two mapping modules
$E_{[\psi_e^{(t)},\psi_^{(t)}]}, E_{[\phi_e^{(t)},\phi_^{(t)}]}$: the derived adptive encoders
This paper uses $1\times1$ convolutions on $S^{(t)}$ and $J^{(t)}$ to map predictions into the representation space, applying $1\times1$ convolutions on $\bar R_S^{(t)}$ and $\bar R_J^{(t)}$ to map the highest-level presentations of the previous stage into preliminary representations for the following state, and then integrate them via addition to obtain $F_S^{(t+1)}$ and $F_J^{(t+1)}$.
Training and Inference
$$
\mathcal{L} = \sum_{t=1}^T(\mathcal{L}^S(C_{\psi_w^{(t)}}^S(E_{[\psi_e^{(t)},\psi_^{(t)}]}(F_S^{(t)})),\hat{S})+\beta\mathcal{L}^J(C_{\phi_w^{(t)}}^J(E_{[\phi_e^{(t)},\phi_^{(t)}]}(F_J^{(t)})),\hat{J}))
$$
T: the total number od iterations in MuLA
$\mathcal{L}^S(\cdot, \cdot)$: the loss function for human parsing, CrossEntropy loss
$\mathcal{L}^J(\cdot, \cdot)$: the loss function for human pose estimation, Mean Square Error
$\beta$: a trade-off weight
Human parsing
The category with maximum probability at each position of $S^{(T)}$ is output as semantic part label.
Pose estimation
Single-person case
The position with maximum confidence for each confidence map in $J^{(T)}$ is taken as the location of each type of body joints
Multi-person case
Perform Non-Maximum Suppression (NMS) to each confidence map in $J^{(T)}$ for generating joint candidates.