The research we’re talking about here is from a paper titled, “Robot Learning Manipulation Action Plans by ‘Watching’ Unconstrained Videos
from the World Wide Web.” The paper is really about visual processing: watching a human interacting with objects in a video, and then figuring out what that human is doing and how they’re doing it, with a final step of replicating those actions using the manipulation capabilities of a robot (Baxter, in this case).The University of Michigan has a dataset called YouCook, which consists of 88 open-source third-person YouTube cooking videos. Each video was given a set of unconstrained natural language descriptions by humans, and each video also has frame-by-frame object and action annotations. Using these data, the UMD researchers developed two convolutional neural networks: one to recognize and classify the objects in the videos, and the other to recognize and classify the grasps that the human is using.
via IEEE Spectrum
Image: University of Maryland