Photographers struggling to find the perfect angle for a group shot have often relied upon clumsy tripods, clunky self-timers, or, worst of all, missing out on being in the frame to take the photo themselves. Enter PhotoBot, a robot photographer who promises to capture a good shot and can take instructions and use a reference photo when finding the ideal composition.
“We introduce PhotoBot, a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer,” the researchers explain. “We propose to communicate photography suggestions to the user via reference images that are selected from a curated gallery.”
“It was a really fun project,” PhotoBot co-creator and researcher Oliver Limoyo tells Spectrum IEEE. Limoyo worked on the project while working at Samsung alongside manager and co-author Jimmy Li.
Say cheese! We introduce PhotoBot, a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer. PhD candidate @OliverLimoyo will present this work at #IROS2024!
Paper: https://t.co/DHGFvfOKJf pic.twitter.com/BPrxDkMxlD— STARS Laboratory (@utiasSTARS) October 3, 2024
Limoyo and Li were already working on a robot that could take pictures when they saw the Getty Image Challenge during COVID lockdowns. This challenge tasked people with recreating their favorite artworks using only three objects they found around their homes. It was a fun, exciting way to keep people engaged and connected during the early days of the pandemic.
Beyond achieving this worthwhile task, Getty’s competition also inspired Limoyo and Li to have their PhotoBot use a reference image to inform its novel photo captures. As Spectrum IEEE explains, they had to then engineer a way for PhotoBot to accurately match a reference photo and adjust its camera to match that image.
It is even more sophisticated in practice than it initially sounds. PhotoBot requires a written description of the type of photo a person wants. The robot then analyzes its environment, identifying people and objects within its line of sight. PhotoBot finds similar photos with corresponding labels within its database. Next, a large language model (LLM) compares the user’s text input with the objects around PhotoBot and its database to select appropriate reference photographs.
Suppose a person wants a picture of them looking happy and is surrounded by a few friends, some flowers in a vase, and maybe a pizza. PhotoBot will see all this, label the people and objects, and then find photos within its database that best match the requested photo and include similar components.
Once the user selects the reference shot they like best, PhotoBot will adjust its camera to match the framing and perspective of the reference image. Again, this is a more complex situation than it initially seems, as PhotoBot operates within a three-dimensional space but is trying to match the look of a two-dimensional reference photo.
As for how good PhotoBot is at its job, photographers shouldn’t necessarily panic about the impending reality of a robot photographer. However, PhotoBot did a good job, beating eight humans about two-thirds of the time in terms of respondent preference.
Li and the rest of the team are no longer working on PhotoBot, but the creator thinks their work has possible implications for smartphone photo assistant apps.
“Imagine right on your phone, you see a reference photo. But you also see what the phone is seeing right now, and then that allows you to move around and align,” Li remarks.
Image credits: Photos from the research paper, ‘PhotoBot: Reference-Guided Interactive Photography via Natural Language,’ by Limoyo, Li, Rivkin, Kelly, and Dudek.