To better equip the research community in evaluating and improving robotic perception solutions for warehouse picking challenges, the PRACSYS lab at Rutgers University provides a new rich data set and software for utilizing it. The dataset contains 10,368 depth and RGB registered images, complete with hand-annotated 6DOF poses for 24 of the APC objects (mead_index_cards excluded). Also provided are 3D mesh models of the 25 APC objects, which may be used for training of recognition algorithms.

Perception for Warehouse Picking

There is significant interest in warehouse automation, which involves picking and placing products placed in shelving units. This interest is exemplified by the first Amazon Picking Challenge (APC), which brought together multiple academic and industrial teams from around the world. The robotic challenge involved perception, motion planning and grasping of 25 different objects placed in a semi-structured way inside the bins of an Amazon-Kiva Pod. Solving such problems can significantly alter the logistics of distributing products. Frequently, manipulation research on pick-and-place tasks has focused on flat surfaces, such as tabletops. These are relatively simpler problems, which do not involve many of the issues that often arise in warehouse automation, where the presence of shelves plays a critical role.

Amazon Picking Challenge Items with Broad Attribute Categorization

An accurate pose estimation is crucial for successfully picking an object inside a shelf. In flexible warehouses, this pose will not be a priori known but must be detected from sensors, especially visual ones. The increasing availability of RGB-D sensors, which can both sense color and depth in a scene, brings the hope that such problems can be solved easily. But warehouse shelves have narrow, dark and obscuring bins that complicate object detection. Clutter can further challenge detection through the presence of multiple objects in the scene. A variety of objects can arise, some of which may be texture-less and not easily identifiable from color, while other reflective and virtually undetectable by a depth sensor. Furthermore, some popular depth sensors exhibit limits in terms of the smallest and highest sensing radius that make it harder for a manipulator to utilize them. Thus, RGBD-based object detection and pose estimation is an active research area and a critical capability for warehouse automation.

Our Data Collection Hardware Setup

Objects and 3D Mesh Models

The selected objects correspond to those that were used during the first Amazon Picking Challenge (APC), which took place in Seattle during May 2015. The same is true for the shelving unit, which is the one provided by Amazon for the purposes of the competition. The set of APC objects exhibit good variety in terms of various characteristics, such as size, shape, texture, transparency and are good candidates for objects that need to be transported by robotic units in warehouses. The provided dataset comes together with 3D-mesh object models for each of the APC competition objects.

For most objects, the CAD 3D scale models of the objects were first constructed. Then, it was possible to apply texture using the open-source MeshLab software. For simple geometric shapes, such as cuboids, this simple combination of CAD modeling and texturing is sufficient and can yield results of similar quality to more involved techniques. Several more complicated object models were produced using 3D photogrammetric reconstruction from multiple views of a monocular camera.

Dataset Design & Extent

Data collection was performed using a Microsoft Kinect v1 2.5D RGB-D camera mounted securely to the end joint of a Motoman Dual-arm SDA10F robot. Two LED lighting strips were added to the camera so as to control the illumination of the environment across images. The position of the camera was calibrated prior to data collection to ensure accurate transformations between the base of the robot, the camera, and the detected and annotated ground truth poses.

In designing this dataset, the intention was to provide the community with a large scale representation of the problem of 6DOF pose estimation using current 2.5D RGB-D technology in a cluttered warehouse environment. This involved representing the challenge in such a way that would allow researchers to determine the effects of several parameters to success ratio and accuracy, such as the effects of clutter and object types. Thus, for each object-pose combination we collected data: (1) with only the object of interest occupying the bin, (2) with a single additional item of clutter within the bin, and (3) with two additional items of clutter. In this way, the dataset allows one to parse out the degree to which environmental clutter affects accuracy. The dataset presents these varying clutter scenarios for each item inside of each of the 12 bins within the shelving unit.

An Example Shelf Arrangement

Additionally, while rotating each object and accompanying clutter items throughout the bins of the Kiva Pod, the pose remains consistent within each bin. The pose of an object changes, however, each time the object is placed in a different bin. This ensures that the set of chosen poses represent good coverage of likely positions for the objects of interest.

In order to provide better coverage of the scene and the ability to perform pose estimation from multiple vantage points, data from 3 separate positions (referred to, here, as “mapping” positions) located in front of the bin: i) One directly in front of the center of a bin at a distance of 48cm, ii) a second roughly 10cm to the left of the first position, and iii) a third with the same distance to the right of the first position. Four 2.5D images were collected at each mapping position. In all, the dataset can be broken down into the following parameters:

• 24 Objects of interest
• 12 Bin locations per object
• 3 Clutter states per bin
• 3 Mapping positions per clutter state
• 4 Frames per mapping position

Considering all these parameters, the dataset ends up corresponding to 10,368 2.5D images from different viewpoints, for different objects and varying amounts of clutter.

For each image, there is a YAML file available containing the transformation matrices (rotation, translation) between: (1) the base of the robot and the camera, (2) the camera and the ground truth pose of the object, and (3) the base of the robot and ground truth pose of the object.

Shortly, we will also be releasing a small ROS package which we hope helps the robotics community to work with and evaluate their own algorithms using this dataset.


Sample Dataset: Sample Dataset (~200MB)

If you’d like to preview our dataset in order to ensure that its use is right for your project, please consider downloading the sample images above. These images come from three (3) separate static scenes of the bin environment. Each scene consists of twelve objects (one per bin), captured from 3 viewpoints. These three particular scenes feature the SAME set of object poses, with varying amounts of environmental clutter.

Full Dataset: Full Dataset (~5.0GB)

This is the entire 10,368 RGB-D image and ground truth pose dataset. All frames in this file come organized together in a single directory, in order to evaluate algorithms easily on the entire dataset at once. In a separate directory (work_orders/) we include the original “work orders” for every scene contained in the dataset. If desired, these files can be used to piece back together the full original scene (12 bin-object combinations + specific clutter items) that frames originated from.

Object Models: Object Models (~80MB)

Full 3D object model meshes are provided in this tarball for all 25 amazon picking challenge objects. We provide models in OBJ format for primary use in detection, though simplified meshes are also included in STL format.

Camera Calibrations File: Calibration File

The calibration file contains OpenCV Matrices for rgb_calibration and ir_calibration.

Instructions for Use

We provide depth and RGB images in PNG format, which are readily imported to Matlab, R, and other major data analysis platforms. Transformation matrices, including that of ground truth object pose, are collected into a single YAML file corresponding to each individual RGB-D image pair.

We provide and recommend to use the Rutgers 3D object models, as ground truth annotation relates directly to these models. If you would like to use your own set of models, or replace some of the available models with others, it is important that the principal axes and origin of your models mimic the axes used in our model of that object.

Naming Conventions

Files within the dataset follow a strict naming convention in order to be easily parsed based on researchers’ needs. The naming convention is as follows:


obj_name: the name of the object, using APC naming
f_type: file type; any one of {image, depth, pose}
bin: {A-L} corresponding to the top-to-bottom left-to-right locations of the bins in the shelf
clutter: {1-3} indicates the number of items in the bin of interest (including target object)
map_pos: camera mapping position; any of {1-3}
frame: any of {1-4}

External Links for More Detail

Details on the Amazon Picking Challenge including naming conventions for objects
MeshLab: Open source tool for viewing and editing 3D mesh objects
[ArXiv Submission] With additional details about our dataset


If you have questions regarding the dataset or would like to contact the authors, please e-mail us:


The R U PRACSYS team would like to thank the sponspors of Rutgers University’s participation to the Amazon Picking Challenge: