Robot grasping – why depth helps
Getting to grips with grasping
We often take for granted our ability to interact with the world around us – to pick up objects, turn them around, throw, drop, hold or put down, regardless of what that object is. We automatically adjust our grasp to delicate objects, larger or smaller things. If you take a look at the objects around you right now, consider the variety of sizes, shapes and materials. A robot trying to pick up that variety of objects needs to locate the objects, evaluate size, shape, material and more, and then attempt to turn that information into movement and commands to the gripper. Even just continuing to hold an object once it has been picked up can be challenging – what if the object moves? If you’ve ever tried to win a stuffed toy in a ‘claw’ game machine in an arcade, this is probably a problem you’re familiar with. Aiming the robotic arm in the right place is only part of the problem.
Picking up objects
In this blog post from the Berkeley Artificial Intelligence Research group they discuss using RGB systems to pick up objects – while it’s possible, a 2D image based system requires “many months of training time with robots physically executing grasps” – something that objectively requires expensive training time, along with the need to accurately label that data is not an ideal solution. The team from BAIR is developing an ongoing research project called “Dexterity Network” which attempts to train robots to grasp objects using a variety of synthetic data sets, analytic robustness models, stochastic sampling and deep learning techniques. The overall goal is to train a deep network that can detect whether a grasp attempt on an object will succeed. Their system uses depth images from an RGB-D sensor like the Intel® RealSense™ D400 series depth cameras. The grasp is specified as the planar position, angle and depth of a gripper relative to the RGB-D sensor, with the network being used to figure out which grasps will successfully allow an object to be picked up.
In order to train such a system, a lot of data needs to be generated. In the case of Dex-Net, a large number of digital models of objects from a variety of sources are used. The 3D meshes are then subjected to a variety of parallel jaw grasps from different angles in different places. Each grasp is then evaluated for it’s success by looking at the pose, friction, mass and external forces like gravity on the object. Each object and grasp combination also has a simulated depth map generated for it, so that once the network has been trained on the variety of grips, it can use those simulated depth maps to compare with a real depth map situated on a robot arm, and choose the grip with the highest chance of success. This solution avoids having to manually train the robot with physically executed grips, a method which is faster and cheaper and more robust to a large variety of objects.
Making the bed
In our future automated homes, perhaps one day even the most complex of tasks will be performed by robots. A task that is suited to the state of robotics today would be bed making – unlike something more complex like cooking, bed making isn’t time sensitive, and any errors are forgivable.
The bed making task in the case of BAIR’s research was defined as one in which the robot identifies the corners of the blankets, in order to pull the corners to the correct locations. Initially, they used white blankets with marked red corners to train the network to easily identify the corners – they used the depth camera data in combination with the RGB sensor data. From there, the next deep network was trained using the depth data only, in the hope that this would allow to generalize to all blanket colors and types.
In this situation, an additional advantage of using a depth camera is that it allows the bed and surface of bed to be removed from the background – by making the assumption that the corners of the blanket will be found in a specific area on or around the surface of the bed, anything else within view of the camera can be ignored by creating a virtual bounding box around the bed, and blacking out anything else outside the bounding box.
The testing showed that despite the network being trained on white blankets with red corners, a depth enabled system could successfully generalize to other blanket colors, where utilizing the same data with only the RGB sensor could not. This suggests that for any commercial solution, depth is a vital part of the combination.
Read the original post in depth here.
Subscribe here to get blog and news updates.
You may also be interested in
“Intel RealSense acts as the eyes of the system, feeding real-world data to the AI brain that powers the MR
In a three-dimensional world, we still spend much of our time creating and consuming two-dimensional content. Most of the screens