--archive-body-size-limit | Maximum per-resource body size captured into an archive STD |
--archive-cdx | Path to write a CDXJ index of the WARC's records STD |
--archive-spill-max-bytes | Cap the total bytes a crawl or sitemap walk spills to disk PRO |
--crawl-allow-url | Re-permit a URL that a --crawl-deny-url glob blocked PRO |
--crawl-delay-cap | Ceiling in seconds for a robots.txt Crawl-delay during a sitemap walk PRO |
--crawl-deny-url | Skip URLs matching this glob during a sitemap walk or link crawl PRO |
--crawl-link-depth | Hops to follow from the seed when crawling links PRO |
--crawl-link-selector | Additional CSS selector for crawlable links PRO |
--crawl-links | Crawl links discovered on each captured page into the same WARC PRO |
--crawl-max-links | Cap on total pages fetched by a crawl or sitemap walk PRO |
--crawl-media-sources | Fetch <audio>/<video> source URLs on each walked page so deferred media is archived PRO |
--crawl-page-timeout | Per-page capture budget in seconds for WARC output PRO |
--crawl-sitemap-max-depth | Levels of <sitemapindex> to follow when capturing a sitemap PRO |
--crawl-url-is-sitemap | Treat the target as a manifest of URLs to capture into one WARC PRO |
--har | Path to write a diagnostic HAR (HTTP Archive) file STD |
--har-capture-bodies | HAR will contain response bodies STD |
--har-captures-navigation | HAR will contain navigator session PRO |
--warc | Path to write a WARC for this request STD |
--warc-captures-navigation | WARC will contain navigator session STD |
--warc-no-gzip | Disable gzip compression for output WARC STD |