--navigator-type
Set the LLM that powers the navigator. The value is a
provider connection string, the same
format the llm formatter uses — a hosted provider, or ollama for a model running
locally. There is no default: --navigate needs a backend, supplied by
--navigator-type or reused from --llm-format-model when only that is set.
zshot --navigate "Click the News link" \
--navigator-type "ollama://localhost:11434?model=gemma4" \
https://example.comzshot --navigate "Click the News link" \
--navigator-type "anthropic://$ANTHROPIC_API_KEY@?model=claude-haiku-4-5" \
https://example.comWith ANTHROPIC_API_KEY already exported, a bare provider name picks the
provider’s default model and reads the key from the environment:
zshot --navigate "Click the News link" \
--navigator-type anthropic \
https://example.comSee Providers for the connection-string format, the supported providers, the per-provider key env vars, and query options.
stdio: drive the navigator from an external agent
The stdio protocol hands the navigator’s decisions to an external agent — a
program or a person at a terminal — instead of an LLM. Each turn zshot writes a
plain-text frame to the agent and reads back the agent’s reply; the agent plays
exactly the role the model otherwise would.
zshot --navigate "Reach the dashboard" \
--navigator-type "stdio" \
-f shot.png https://example.comEndpoints:
stdio— read replies from stdin, write prompts to stdout.stdio://3— one bidirectional fd (e.g. asocketpairthe parent passes in).stdio://?rfd=3&wfd=4— separate inherited read and write fds.stdio://?read=/in.fifo&write=/out.fifo— named pipes (FIFOs). Use FIFOs, not regular files: a plain file gives the reader EOF instead of blocking for the reply.
Frame protocol. zshot writes the rendered prompt, then any screenshot — written
to a per-session temp file and referenced by an <img src="file://…"> line
(vision is on by default; opt out with ?vision=false) — then a <<END>> line.
Referencing the image by path keeps the agent’s context small and lets a
co-located agent open the file only when it needs the pixels; the temp directory
is removed when the session ends. The agent replies with text in the navigator
action language (CLICK 42, FILL …, WAIT 5, READ, DONE, IMPOSSIBLE, RESTART),
terminated by a <<END>> line or a blank line — the blank line makes interactive
use easy: type your action and press Enter on an empty line. zshot sends a short
protocol preamble on the first frame so the agent knows the framing is required.
Limits. stdio is Unix-only and Pro-tier. It is rejected in HTTP server mode.
In MCP server mode it may not use fd 0/1 (reserved for JSON-RPC) — pass an
explicit inherited fd. Bare stdio writes to stdout, so it cannot be combined
with -f -. It requires exactly one URL: a single channel can’t serve concurrent
sessions without interleaving frames. When the navigator and formatter both use
stdio they share one open channel for the run, and each request/reply is one
atomic exchange. ?timeout=<secs> bounds the wait for each reply (default 60s,
0 disables) so a stalled agent fails the step rather than hanging the run.