zshot/cliDownload

--navigator-type

ValueCONFIG
LicensePRO

Set the LLM that powers the navigator. The value is a provider connection string, the same format the llm formatter uses — a hosted provider, or ollama for a model running locally. There is no default: --navigate needs a backend, supplied by --navigator-type or reused from --llm-format-model when only that is set.

zshot --navigate "Click the News link" \
  --navigator-type "ollama://localhost:11434?model=gemma4" \
  https://example.com
zshot --navigate "Click the News link" \
  --navigator-type "anthropic://$ANTHROPIC_API_KEY@?model=claude-haiku-4-5" \
  https://example.com

With ANTHROPIC_API_KEY already exported, a bare provider name picks the provider’s default model and reads the key from the environment:

zshot --navigate "Click the News link" \
  --navigator-type anthropic \
  https://example.com

See Providers for the connection-string format, the supported providers, the per-provider key env vars, and query options.

stdio: drive the navigator from an external agent

The stdio protocol hands the navigator’s decisions to an external agent — a program or a person at a terminal — instead of an LLM. Each turn zshot writes a plain-text frame to the agent and reads back the agent’s reply; the agent plays exactly the role the model otherwise would.

zshot --navigate "Reach the dashboard" \
  --navigator-type "stdio" \
  -f shot.png https://example.com

Endpoints:

  • stdio — read replies from stdin, write prompts to stdout.
  • stdio://3 — one bidirectional fd (e.g. a socketpair the parent passes in).
  • stdio://?rfd=3&wfd=4 — separate inherited read and write fds.
  • stdio://?read=/in.fifo&write=/out.fifo — named pipes (FIFOs). Use FIFOs, not regular files: a plain file gives the reader EOF instead of blocking for the reply.

Frame protocol. zshot writes the rendered prompt, then any screenshot — written to a per-session temp file and referenced by an <img src="file://…"> line (vision is on by default; opt out with ?vision=false) — then a <<END>> line. Referencing the image by path keeps the agent’s context small and lets a co-located agent open the file only when it needs the pixels; the temp directory is removed when the session ends. The agent replies with text in the navigator action language (CLICK 42, FILL …, WAIT 5, READ, DONE, IMPOSSIBLE, RESTART), terminated by a <<END>> line or a blank line — the blank line makes interactive use easy: type your action and press Enter on an empty line. zshot sends a short protocol preamble on the first frame so the agent knows the framing is required.

Limits. stdio is Unix-only and Pro-tier. It is rejected in HTTP server mode. In MCP server mode it may not use fd 0/1 (reserved for JSON-RPC) — pass an explicit inherited fd. Bare stdio writes to stdout, so it cannot be combined with -f -. It requires exactly one URL: a single channel can’t serve concurrent sessions without interleaving frames. When the navigator and formatter both use stdio they share one open channel for the run, and each request/reply is one atomic exchange. ?timeout=<secs> bounds the wait for each reply (default 60s, 0 disables) so a stalled agent fails the step rather than hanging the run.