Human-like input
Mouse moves along randomized Bézier curves with jitter and overshoot; keystrokes carry natural 40–220 ms delays.
Perseus Pisces captures the screen, asks an LLM what to do, and drives the real mouse and keyboard. Then it records the task as a semantic action graph and replays it deterministically — with drift detection when the UI shifts underneath it. Every action is gated by an auditable whitelist.
The core of Perseus Pisces is a semantic action graph — a recording that matches what the screen means, not where its pixels were. It survives moved windows, restyled themes, and different resolutions.
(x, y).
// One recorded step — matched by meaning, not pixels { "action": { "type": "mouse_click", "reason": "Export report" }, "target": { "role": "button", "label": "Export", "context": "top-right" }, "screen_state_before": { "perceptual_hash": "c3e1a0…" } } # Replay anywhere — parameters fill the blanks $ perseus replay f3a1 --param filename=june-2026.csv Replay finished: 3 steps, drift=false, 0 warnings
Built from twelve focused subsystems — each using only the Go standard library and OS primitives.
Mouse moves along randomized Bézier curves with jitter and overshoot; keystrokes carry natural 40–220 ms delays.
Every action clears a glob-rule whitelist and is written to an append-only NDJSON log before it ever executes.
Capture tasks as a semantic action graph and replay them with perceptual-hash state matching and drift detection.
Anthropic, OpenAI, and GitHub Copilot — all with vision, reasoning controls, and structured action output.
Expose the desktop as Model Context Protocol tools over stdio, so any MCP client can drive it — pure stdlib JSON-RPC.
A dark, precision-tooling control room embedded in the binary via go:embed. Vanilla JS, no CDN, no frameworks.
No go get, no npm, no Docker. GUI primitives are OS syscalls;
go.sum stays empty and builds are reproducible.
One codebase compiles to Windows (syscall), macOS (CoreGraphics), and Linux (X11/XTest, Wayland fallback).
Each iteration mirrors how a person works at a computer — observe, decide, act, verify.
Take a screenshot of the live screen.
The LLM reads the screen and returns a structured action.
The whitelist approves, denies, or pauses for the user.
Human-like mouse and keyboard input executes it.
Screenshot, decision, and result are logged.
The agent reads untrusted pixels and can drive real input and shell commands — so containment is built into every layer, not bolted on.
Every action — replayed, agent-driven, or via MCP — clears an allow / deny / ask glob-rule engine before it executes.
During replay the model only sees the screen at genuine branch points, scoped to the recording — on-screen text can't hijack the run.
Shell actions run as explicit argv, never re-parsed by a shell — removing the classic command-injection surface.
The server binds 127.0.0.1; actions stream to an append-only redacted NDJSON audit; secrets are AES-256-GCM encrypted at rest.
screenshot, mouse_*,
type_string, key_press,
shell_command and more as MCP tools.
// claude_desktop_config.json { "mcpServers": { "perseus": { "command": "perseus", "args": ["mcp"] } } }
Requires Go 1.22+. No other toolchain, runtime, or package manager.
# Download a prebuilt binary from the releases page, # or build from source (Go 1.22+, zero dependencies): git clone https://github.com/DebajyotiSaikia/computer-use cd computer-use go build -trimpath -ldflags="-s -w" -o perseus . # Start the agent + local web UI ./perseus start # Authenticate a provider ./perseus auth anthropic # Or expose your desktop to any MCP client ./perseus mcp