macos-computer-use

/home/avalon/.hermes/skills/apple/macos-computer-use/SKILL.md · raw

Drive the macOS desktop in the background — screenshots, mouse, keyboard, scroll, drag — without stealing the user's cursor, keyboard focus, or Space. Works with any tool-capable model. Load this skill whenever the computer_use tool is available. version: 1.0.0 platforms: [macos] metadata: hermes: tags: [computer-use, macos, desktop, automation, gui] category: desktop related_skills: [browser]


macOS Computer Use (universal, any-model)

You have a computer_use tool that drives the Mac in the background. Your actions do NOT move the user's cursor, steal keyboard focus, or switch Spaces. The user can keep typing in their editor while you click around in Safari in another Space. This is the opposite of pyautogui-style automation.

Everything here works with any tool-capable model — Claude, GPT, Gemini, or an open model running through a local OpenAI-compatible endpoint. There is no Anthropic-native schema to learn.

The canonical workflow

Step 1 — Capture first. Almost every task starts with:

computer_use(action="capture", mode="som", app="Safari")

Returns a screenshot with numbered overlays on every interactable element AND an AX-tree index like:

#1  AXButton 'Back' @ (12, 80, 28, 28) [Safari]
#2  AXTextField 'Address and Search' @ (80, 80, 900, 32) [Safari]
#7  AXLink 'Sign In' @ (900, 420, 80, 24) [Safari]
...

Step 2 — Click by element index. This is the single most important habit:

computer_use(action="click", element=7)

Much more reliable than pixel coordinates for every model. Claude was trained on both; other models are often only reliable with indices.

Step 3 — Verify. After any state-changing action, re-capture. You can save a round-trip by asking for the post-action capture inline:

computer_use(action="click", element=7, capture_after=True)

Capture modes

mode Returns Best for
som (default) Screenshot + numbered overlays + AX index Vision models; preferred default
vision Plain screenshot When SOM overlay interferes with what you want to verify
ax AX tree only, no image Text-only models, or when you don't need to see pixels

Actions

capture           mode=som|vision|ax   app=…  (default: current app)
click             element=N     OR     coordinate=[x, y]
double_click      element=N     OR     coordinate=[x, y]
right_click       element=N     OR     coordinate=[x, y]
middle_click      element=N     OR     coordinate=[x, y]
drag              from_element=N, to_element=M        (or from/to_coordinate)
scroll            direction=up|down|left|right   amount=3 (ticks)
type              text="…"
key               keys="cmd+s" | "return" | "escape" | "ctrl+alt+t"
wait              seconds=0.5
list_apps
focus_app         app="Safari"  raise_window=false   (default: don't raise)

All actions accept optional capture_after=True to get a follow-up screenshot in the same tool call.

All actions that target an element accept modifiers=["cmd","shift"] for held keys.

Background rules (the whole point)

  1. Never raise_window=True unless the user explicitly asked you to bring a window to front. Input routing works without raising.
  2. Scope captures to an app (app="Safari") — less noisy, fewer elements, doesn't leak other windows the user has open.
  3. Don't switch Spaces. cua-driver drives elements on any Space regardless of which one is visible.

Text input patterns

Drag & drop

Prefer element indices:

computer_use(action="drag", from_element=3, to_element=17)

For a rubber-band selection on empty canvas, use coordinates:

computer_use(action="drag",
             from_coordinate=[100, 200],
             to_coordinate=[400, 500])

Scroll

Scroll the viewport under an element (most common):

computer_use(action="scroll", direction="down", amount=5, element=12)

Or at a specific point:

computer_use(action="scroll", direction="down", amount=3, coordinate=[500, 400])

Managing what's focused

list_apps returns running apps with bundle IDs, PIDs, and window counts. focus_app routes input to an app without raising it. You rarely need to focus explicitly — passing app=... to capture / click / type will target that app's frontmost window automatically.

Delivering screenshots to the user

When the user is on a messaging platform (Telegram, Discord, etc.) and you took a screenshot they should see, save it somewhere durable and use MEDIA:/absolute/path.png in your reply. cua-driver's screenshots are PNG bytes; write them out with write_file or the terminal (base64 -d).

On CLI, you can just describe what you see — the screenshot data stays in your conversation context.

Safety — these are hard rules

Failure modes

When NOT to use computer_use