Implement model cursor for visual feedback

### Feature request

Update: see https://github.com/OpenAdaptAI/OpenAdapt/issues/760#issuecomment-2347337901 for the latest requirements.

We want to be able to give the model the ability to:
1.  paint a red dot on its suggested target location
2. look at the screenshot with the dot on it,
3. optionally self correct.

Thank you @LunjunZhang for the suggestion 🙏 

This involves creating a CursorReplayStrategy (based on the [VanillaReplayStrategy](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/vanilla.py)) that implements the required prompting.

### Motivation

Correct errors, e.g. missed segmentations.

Possibly related: https://arxiv.org/abs/2406.09403:

> Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn.
...
> Sketchpad substantially improves performance on all tasks over strong base models with no sketching, yielding an average gain of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in [this https URL](https://visualsketchpad.github.io/).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement model cursor for visual feedback #760

Feature request

Motivation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement model cursor for visual feedback #760

Description

Feature request

Motivation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions