Over the years, many different traits have tried to capture the psyche of human beings, but to be honest; none have done a better job than our trait of improving at a consistent pace. This factor, on its part, has enabled the world to clock some huge milestones, with technology emerging as quite a major member of the stated group. The reason why we hold technology in such a high regard is, by and large, predicated upon its skill-set, which guided us towards a reality that nobody could have ever imagined otherwise. Nevertheless, if we look beyond the surface for one hot second, it will become abundantly clear how the whole runner was also very much inspired from the way we applied those skills across a real world environment. The latter component, in fact, did a lot to give the creation a spectrum-wide presence, and as a result, initiated a full-blown tech revolution. Of course, the next thing this revolution did was to scale up the human experience through some outright unique avenues, but even after achieving a feat so notable, technology will somehow continue to bring forth the right goods. The same has turned more and more evident in recent times, and assuming one new discovery ends up with the desired impact, it will only put that trend on a higher pedestal moving forward.
The researching teams at Massachusetts Institute of Technology and Computer Science and Artificial Intelligence Laboratory (CSAIL) have successfully developed a system called Feature Fields for Robotic Manipulation (F3RM), which is designed to blend 2D images with foundation model features and create 3D scenes that help robots identify and grasp nearby items. According to certain reports, F3RM comes decked up with an ability to interpret open-ended language prompts from humans, an ability making it extremely useful around real-world environments that contain thousands of objects like warehouses and households. But how does the system work on a more granular level? Well, the proceedings begin from F3RM taking picture on a given selfie stick. Mounted on the stated stick, the camera snaps 50 images at different poses, enabling it to build a neural radiance field (NeRF), a deep learning method which takes 2D images to construct a 3D scene. The resulting RGB photos collage creates a “digital twin” of its surroundings in what is a 360-degree representation that shows what’s nearby. Apart from the highly detailed neural radiance field, F3RM also builds a feature field to augment geometry with semantic information. You see, it uses CLIP, a vision foundation model trained on hundreds of millions of images to efficiently learn visual concepts. Hence, with the pictures’ 2 features reconstructed in an enhanced form, F3RM effectively lifts the 2D features into a 3D representation.
“Visual perception was defined by David Marr as the problem of knowing ‘what is where by looking,'” said Phillip Isola, senior author on the study, MIT associate professor of electrical engineering and computer science, and CSAIL principal investigator. “Recent foundation models have gotten really good at knowing what they are looking at; they can recognize thousands of object categories and provide detailed text descriptions of images. At the same time, radiance fields have gotten really good at representing where stuff is in a scene. The combination of these two approaches can create a representation of what is where in 3D.”
Having lifted the 2D features into a 3D representation, it’s time for us to understand how the new system helps in actually controlling the objects. Basically, after receiving a few demonstrations, the robot applies what it knows about geometry and semantics to grasp objects it has never encountered before. As a result, when a user submits a text query, the robot searches through the space of possible grasps to identify those most likely to succeed in picking up the object requested by user. During this selection process, each potential option is scored based on its relevance to the prompt, similarity to the demonstrations the robot has been trained on, and if it causes any collisions. Based on those scores, the robot picks the most suitable course of action. In case that somehow wasn’t impressive enough, then it must be mentioned how F3RM further enables users to specify which object they want the robot to handle at different levels of linguistic detail. To explain it better, if there is a metal mug and a glass mug, the user can ask the robot for the “glass mug.” Hold on, we aren’t done, as even if both the mugs are of glass and and one of them is filled with coffee and the other with juice, the user can ask the robot for “glass mug with coffee.”
“If I showed a person how to pick up a mug by the lip, they could easily transfer that knowledge to pick up objects with similar geometries such as bowls, measuring beakers, or even rolls of tape. For robots, achieving this level of adaptability has been quite challenging,” said William Shen, Ph.D. student at MIT, and co-lead author on the study.”F3RM combines geometric understanding with semantics from foundation models trained on internet-scale data to enable this level of aggressive generalization from just a small number of demonstrations.”
In a test conducted on the system’s ability to interpret open-ended requests from humans, the researchers prompted the robot to pick up Baymax, a character from Disney’s “Big Hero 6.” Interestingly, F3RM had never been directly trained to pick up a toy of the cartoon superhero, but the robot was successful in leveraging its spatial awareness and vision-language features from the foundation models to decide which object to grasp and how to pick it up.
“Making robots that can actually generalize in the real world is incredibly hard,” said Ge Yang, postdoc at the National Science Foundation AI Institute for Artificial Intelligence and Fundamental Interactions. “We really want to figure out how to do that, so with this project, we try to push for an aggressive level of generalization, from just three or four objects to anything we find in MIT’s Stata Center. We wanted to learn how to make robots as flexible as ourselves, since we can grasp and place objects even though we’ve never seen them before.”