CVPR2018_相关文章文类


关键点检测

  • Weakly and Semi Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer

|Abstract| | - | | Human body part parsing, or human semantic part segmentation, is fundamental to many computer vision tasks. In conventional semantic segmentation methods, the ground truth segmentations are provided, and fully convolutional networks (FCN) are trained in an end-to-end scheme. Although these methods have demonstrated impressive results, their performance highly depends on the quantity and quality of training data. In this paper, we present a novel method to generate synthetic human part segmentation data using easily-obtained human keypoint annotations. Our key idea is to exploit the anatomical similarity among human to transfer the parsing results of a person to another person with similar pose. Using these estimated results as additional training data, our semi-supervised model outperforms its strong-supervised counterpart by 6 mIOU on the PASCAL-Person-Part dataset, and we achieve state-of-the-art human parsing results. Our approach is general and can be readily extended to other object/animal parsing task assuming that their anatomical similarity can be annotated by keypoints. The proposed model and accompanying source code will be made publicly available. |

  • Detect-and-Track: Efficient Pose Estimation in Videos

|Abstract| | - | | This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video. We propose an extremely lightweight yet highly effective approach that builds upon the latest advancements in human detection and video understanding. Our method operates in two-stages: keypoint estimation in frames or short clips, followed by lightweight tracking to generate keypoint predictions linked over the entire video. For frame-level pose estimation we experiment with Mask R-CNN, as well as our own proposed 3D extension of this model, which leverages temporal information over small clips to generate more robust frame predictions. We conduct extensive ablative experiments on the newly released multi-person video pose estimation benchmark, PoseTrack, to validate various design choices of our model. Our approach achieves an accuracy of 55.2% on the validation and 51.8% on the test set using the Multi-Object Tracking Accuracy (MOTA) metric, and achieves state of the art performance on the ICCV 2017 PoseTrack keypoint tracking challenge. |

  • Human Pose Estimation With Parsing Induced Learner

|Abstract| | - | | Human pose estimation still faces various difficulties in challenging scenarios. Human parsing, as a closely related task, can provide valuable cues for better pose estimation, which however has not been fully exploited. In this paper, we propose a novel Parsing Induced Learner to exploit parsing information to effectively assist pose estimation by learning to fast adapt the base pose estimation model. The proposed Parsing Induced Learner is composed of a parsing encoder and a pose model parameter adapter, which together learn to predict dynamic parameters of the pose model to extract complementary useful features for more accurate pose estimation. Comprehensive experiments on benchmarks LIP and extended PASCAL-Person-Part show that the proposed Parsing Induced Learner can improve performance of both single- and multi-person pose estimation to new state-of-the-art. Cross-dataset experiments also show that the proposed Parsing Induced Learner from LIP dataset can accelerate learning of a human pose estimation model on MPII benchmark in addition to achieving outperforming performance. |

  • The Best of Both Worlds: Combining CNNs and Geometric Constraints for Hierarchical Motion Segmentation

|Abstract| | - | | Traditional methods of motion segmentation use powerful geometric constraints to understand motion, but fail to leverage the semantics of high-level image understanding. Modern CNN methods of motion analysis, on the other hand, excel at identifying well-known structures, but may not precisely characterize well-known geometric constraints. In this work, we build a new statistical model of rigid motion flow based on classical perspective projection constraints. We then combine piecewise rigid motions into complex deformable and articulated objects, guided by semantic segmentation from CNNs and a second object-level statistical model. This combination of classical geometric knowledge combined with the pattern recognition abilities of CNNs yields excellent performance on a wide range of motion segmentation benchmarks, from complex geometric scenes to camouflaged animals. |

  • 3D Human Pose Estimation in the Wild by Adversarial Learning
  • PoseTrack: A Benchmark for Human Pose Estimation and Tracking
  • 2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning
  • V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation From a Single Depth Map
  • RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews From Unsupervised Viewpoints
  • 3D Pose Estimation and 3D Model Retrieval for Objects in the Wild
  • Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals
  • Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation
  • Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes - The Importance of Multiple Scene Constraints

实例分割

  • Deep Extreme Cut: From Extreme Points to Object Segmentation

|Abstract| | - | | This paper explores the use of extreme points in an object (left-most, right-most, top, bottom pixels) as input to obtain precise object segmentation for images and videos. We do so by adding an extra channel to the image in the input of a convolutional neural network (CNN), which contains a Gaussian centered in each of the extreme points. The CNN learns to transform this information into a segmentation of an object that matches those extreme points. We demonstrate the usefulness of this approach for guided segmentation (grabcut-style), interactive segmentation, video object segmentation, and dense segmentation annotation. We show that we obtain the most precise results to date, also with less user input, in an extensive and varied selection of benchmarks and datasets. All our models and code are publicly available on http://www.vision.ee.ethz.ch/~cvlsegmentation/dextr. |

  • Learning to Segment Object Candidates
  • Learning to Refine Object Segments
  • MaskLab: Instance Segmentation by Refining Object Detection With Semantic and Direction Features
  • Weakly Supervised Instance Segmentation Using Class Peak Response
  • Motion Segmentation by Exploiting Complementary Geometric Models
  • SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance Segmentation

相关资料