android-cloud-device-agent

/home/avalon/.hermes/skills/software-development/android-cloud-device-agent/SKILL.md · raw

Android Cloud Device Agent

Use this when Alex wants a Hermes skill, CLI, prototype, or research plan for controlling persistent Android devices in the cloud. The target pattern is a long-lived Android profile/device with stable state and device identity, controlled by APIs plus ADB/Appium and optionally guided by screenshots/vision.

Scope and Safety

Provider Selection Criteria

Prioritize providers that offer all of:

  1. Persistence: device/profile state survives stop/start and remains available as long as rented/paid.
  2. Programmatic control: ADB or equivalent APIs for screenshots, shell commands, input injection, app install/start/stop, and status.
  3. Device identity: stable Android/device serial, IMEI or equivalent, MAC/Bluetooth metadata, timezone/locale/region metadata.
  4. Monthly/unlimited billing: avoid per-minute 24/7 use unless only doing a tiny spike.
  5. Google Play Services support: verify on the exact region/Android model before assuming Play Store flows work.
  6. Webhooks/task status: useful for long-running screenshots, RPA tasks, installs, and login/download operations.

Default to running Hermes on the VPS/host and controlling the cloud Android device remotely; do not start by trying to run Hermes inside Android/Termux. Treat the Android device as the target environment and the host as the agent brain/tool runner.

Layer provider-specific APIs under stable Hermes primitives:

provider API  -> lifecycle, screenshots, app management, billing/status
ADB           -> shell, screencap, input tap/swipe/text, uiautomator dump, apk install
MobileRun/DroidRun -> natural-language mobile-agent tasks over an ADB-connected device
Appium/uiautomator2 -> higher-level element targeting when accessibility tree is usable
LLM vision loop -> screenshot + UI tree -> decide next action -> execute -> verify

See references/mobile-agent-frameworks.md for session notes on MobileRun/DroidRun, Appium fallback, GeeLark provisioning guidance, and first live verification commands. See references/geelark-account-onboarding.md for GeeLark signup URLs/selectors, Playwright/Xvfb inspection notes, and CAPTCHA-specific pitfalls from account onboarding.

Expose a local CLI or tool server with commands like:

android-device list
android-device create --provider geelark --name test-phone --android "Android 15" --region us --monthly
android-device start <phone_id>
android-device stop <phone_id>
android-device status <phone_id>
android-device screenshot <phone_id> --out /tmp/screen.png
android-device shell <phone_id> "pm list packages"
android-device adb-connect <phone_id>
android-device tap <phone_id> 320 800
android-device swipe <phone_id> 300 1100 300 200 600
android-device text <phone_id> "hello"
android-device install-apk <phone_id> ./app.apk
android-device open-app <phone_id> com.example.app

Vision-Control Loop

  1. Start/verify the device is running.
  2. Capture screenshot with provider screenshot API or adb exec-out screencap -p.
  3. Dump UI hierarchy when possible: bash adb shell uiautomator dump /sdcard/window.xml adb pull /sdcard/window.xml /tmp/window.xml
  4. Send screenshot plus UI tree to the model.
  5. Choose one action: tap, swipe, text, back/home, wait, open app, shell command.
  6. Execute via ADB/provider shell API.
  7. Re-screenshot and verify the state changed.
  8. Save a trace of screenshots, action JSON, package name, device id, and errors.

Mobile Agent Framework Integration

Prefer this order for turning a GeeLark phone into a Hermes-operated mobile agent:

  1. Direct GeeLark + ADB wrapper first: prove lifecycle, ADB, screenshots, Google packages, and persistence with deterministic commands.
  2. MobileRun/DroidRun second: run droidrun/mobilerun on the Hermes host/VPS against the ADB-connected GeeLark device for natural-language mobile control. This was the primary open-source project identified for “browser-use but mobile.”
  3. Appium/uiautomator2 fallback: use when structured element selection is more reliable than pure screenshot/vision actions.
  4. On-device Hermes/Termux only later: treat running Hermes inside Android as an experimental phase after host-driven control works.

GeeLark Provider Notes

GeeLark is a strong first prototype candidate because official docs expose persistent cloud-phone management, ADB, screenshots, shell execution, app management, and Google RPA endpoints. See references/geelark-cloud-phone-openapi.md for concise endpoint/pricing/payment notes. See references/geelark-account-onboarding.md before automating account signup or code-sending flows.

Recommended GeeLark flow:

  1. Confirm plan/API access and rent one monthly cloud phone, not per-minute, for persistent 24/7 testing.
  2. Create/list/select the phone; record cloud phone ID and equipment info.
  3. Start the phone.
  4. Enable ADB, wait about 3 seconds, fetch ADB IP/port/password, then connect.
  5. Verify primitives: - status returns started - screenshot returns a valid image - shell pm list packages works - ADB input tap changes UI - uiautomator dump returns XML
  6. Test Google Play only with owned/pre-created test accounts.
  7. Stop/restart later and confirm app/account state persists.

Pricing Rule of Thumb

For persistent workers, calculate 30-day always-on cost before recommending a billing mode:

monthly_minutes = 60 * 24 * 30
payg_cost = per_minute_rate * monthly_minutes

If a monthly rental/unlimited plan exists, it is usually the right choice for long-lived Android agents.

Implementation Checklist

Common Pitfalls

Verification

A prototype is not proven until all of these pass:

android-device status <id>
android-device screenshot <id> --out /tmp/screen.png
android-device shell <id> "pm list packages"
android-device adb-connect <id>
adb -s <serial> shell input tap 100 100
adb -s <serial> shell uiautomator dump /sdcard/window.xml

Then restart the cloud phone and confirm persistence:

  1. Install/open a test app.
  2. Stop the phone.
  3. Start it again.
  4. Confirm the app and account/session state are still present.