Imagine2Servo: Intelligent Visual Servoing with Diffusion-Driven Goal Generation for Robotic Tasks

Abstract

Visual servoing, the method of controlling robot motion through feedback from visual sensors, has seen significant advancements with the integration of optical flow-based methods. However, its application remains limited by inherent challenges such as the necessity for a target image at test time, the requirement of substantial overlap between initial and target images, and the reliance on feedback from a single camera. This paper introduces Imagine2Servo, an innovative approach leveraging diffusion-based image editing techniques to enhance visual servoing algorithms by generating intermediate goal images. This methodology allows for the extension of visual servoing applications beyond traditional constraints, enabling tasks like long-range navigation and manipulation without predefined goal images. We propose a pipeline that synthesizes subgoal images grounded in the task at hand, facilitating servoing in scenarios with minimal initial and target image overlap and integrating multi-camera feedback for comprehensive task execution. Our contributions demonstrate a novel application of image generation to robotic control, significantly broadening the capabilities of visual servoing systems. Real-world experiments validate the effectiveness and versatility of the Imagine2Servo4 framework in accomplishing a variety of tasks, marking a notable advancement in the field of visual servoing

Architecture

Results

PyBullet

Cross the door

In instances of notable disparities in initial translation and rotation, where the door is scarcely visible, our techniques adeptly model the door to generate sub-goals

Move to the front of the door

Imagine2Servo formulates pertinent sub-goals and designs trajectories to circumnavigate the door safely, avoiding collisions.

RLBench

Reach the charger

We achieve a precise end-effector pose required to unplug the charger.

Reach the moon shaped object

Our model accurately discerns the object specified in the textual instruction amidst multiple objects within the scene.

Reach the window

Despite obscured visibility of the door handle, our model generates sub-goals and takes relevant actions to accomplish the task.

Real-world (xARM7)

Put the hexagon in the shape sorter

Our model generates relevant sub-goals and precisely models trajectories to place the hexagon into its shape sorter

Stack on the red square

Our model accurately discerns the color and shape specified in the text prompt, even amidst multiple objects sharing identical color and shape within the scene.

Stack on the blue circle

Our model precisely identifies the blue circle and stacks the triangle on the blue circle