Mastering Web Fetch in OpenClaw: Efficient Web Content Extraction

The web_fetch tool is a powerful utility in OpenClaw for retrieving and extracting readable content from web pages without the overhead of full browser automation. It's designed for lightweight web research, making it perfect for quickly gathering information from articles, documentation, and other text-heavy pages.

What is web_fetch?

web_fetch is a tool that performs HTTP GET requests and extracts the main readable content from HTML pages, converting it into clean markdown or plain text. Unlike browser automation, web_fetch does not execute JavaScript, making it significantly faster and more resource-efficient for simple content extraction tasks.

The tool uses Readability.js for content extraction by default, which intelligently identifies and extracts the primary content of a web page, stripping away navigation, ads, and other non-essential elements. This makes it ideal for creating summaries, archiving articles, or processing documentation.

Basic Usage

To use web_fetch, you simply provide a URL:

await web_fetch({
  url: "https://docs.openclaw.ai/tools/web.md"
});

This returns the extracted content in markdown format by default. You can also specify plain text extraction:

await web_fetch({
  url: "https://docs.openclaw.ai/tools/web.md",
  extractMode: "text"
});

Configuration Options

web_fetch supports several configuration parameters to customize its behavior:

url: The HTTP or HTTPS URL to fetch (required)
extractMode: Either "markdown" or "text" (default: "markdown")
maxChars: Maximum number of characters to return (useful for truncating lengthy pages)

Advanced Configuration

For more granular control, you can configure web_fetch in your OpenClaw configuration:

{
  tools: {
    web: {
      fetch: {
        enabled: true,
        maxChars: 50000,
        maxCharsCap: 50000,
        maxResponseBytes: 2000000,
        timeoutSeconds: 30,
        cacheTtlMinutes: 15,
        maxRedirects: 3,
        userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        readability: true,
        firecrawl: {
          enabled: true,
          apiKey: "FIRECRAWL_API_KEY_HERE",
          baseUrl: "https://api.firecrawl.dev",
          onlyMainContent: true,
          maxAgeMs: 86400000,
          timeoutSeconds: 60,
        },
      },
    },
  },
}

Firecrawl Integration

When configured, web_fetch can use Firecrawl as a fallback for pages where Readability fails. Firecrawl provides advanced scraping capabilities, including JavaScript execution and bot circumvention, making it suitable for more complex sites.

To enable Firecrawl, set the API key in your configuration. Firecrawl requests are cached by default, reducing repeated API calls for the same content.

When to Use web_fetch vs Browser Automation

Choose web_fetch when:

You need to extract content from static HTML pages
Performance and speed are critical
You're processing multiple articles or documentation pages
JavaScript execution is not required to access the content

Use browser automation when:

The content is loaded dynamically via JavaScript
You need to interact with page elements (click, type, etc.)
Authentication or complex navigation is required
You're dealing with single-page applications

Best Practices

Check cache first: Results are cached for 15 minutes by default, reducing unnecessary network requests
Set character limits: Use maxChars to prevent memory issues when processing exceptionally long pages
Handle failures gracefully: Some sites may not be parseable; have fallback strategies
Respect robots.txt: Always consider the website's terms of service and crawling policies

Troubleshooting

If web_fetch returns an error:

Verify the URL is accessible and uses HTTP/HTTPS
Check if the page requires JavaScript to display content
Test with browser automation as an alternative
Ensure Firecrawl is configured if needed for complex sites

The web_fetch tool strikes an excellent balance between simplicity and functionality, making it an essential component of any OpenClaw workflow that involves web content extraction.