Files
pi_mcps/docs/wiki/pages/mcp-webscraper.md
T
Patrick Plate dabdda167f docs(wiki): migrate to git-based workflow with persistent wiki/ clone
- Extract all wiki content from create_wiki_pages.py into docs/wiki/pages/*.md
- Add docs/wiki/deploy_wiki.sh: copies pages to wiki/ repo, commits, pushes
- Add /wiki/ to .gitignore (anchored — does not affect docs/wiki/)
- 12 pages: Home, MCP-Servers-Overview, mcp-image-gen, ComfyUI-Setup,
  mcp-webscraper (8 tools incl. search_hint), BigMind (schema v8),
  Development-Conventions, Java-Projects, Java-wellmann-shop,
  Java-mss-failsafe, Java-Architecture, _Sidebar
- Workflow: edit docs/wiki/pages/*.md → ./docs/wiki/deploy_wiki.sh
2026-04-05 09:48:19 +02:00

3.0 KiB

🕸️ mcp-webscraper — Web Scraping

Webscraper Banner

mcp-webscraper is a FastMCP server providing comprehensive web scraping, data extraction, and search capabilities. It fetches pages, converts HTML to clean Markdown, extracts tables, links, CSS sections, metadata, sitemaps, and can perform web searches via Brave Search.

Tools

Tool Description
webscraper_fetch(url, max_chars=5000) Title + full page as Markdown + metadata
webscraper_fetch_links(url, deduplicate=True) All href links found on the page
webscraper_fetch_tables(url) All HTML tables converted to Markdown
webscraper_fetch_all(url, max_chars=5000) Everything in one call (fetch + links + tables + meta)
webscraper_fetch_section(url, selector) Specific CSS selector section only
webscraper_fetch_meta(url) Title, description, Open Graph tags
webscraper_fetch_sitemap(url, max_urls=100) Parse sitemap.xml, return URL list
webscraper_search_hint(query, max_results=5) Brave Search — top URLs + snippets for a query

Stack

  • HTTP client: httpx (async, with SSL support, Chrome/Linux User-Agent)
  • HTML parser: BeautifulSoup4 + lxml
  • Markdown converter: html2text
  • Search backend: Brave Search (search.brave.com) — works without CAPTCHA
  • SSL: Custom cert bundle for Fedora 43 compatibility

Search Hint Strategy

webscraper_search_hint uses Brave Search because:

  • Returns real results without CAPTCHA or consent walls
  • Google blocks plain HTTP with 302 consent redirect
  • DuckDuckGo blocks with CAPTCHA

Use it sparingly — once per research task — to get oriented before deep-scraping individual pages.

# Get top 5 results for a query
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5)

SSL Note — Fedora 43 Comodo Root CA

Fedora 43 is missing the Comodo AAA Services Root CA needed for Cloudflare-protected sites. The fix is bundled at mcp/webscraper/certs/comodo-aaa-services-root.pem.

The server automatically uses this cert bundle — no manual configuration needed.

Quick Start

cd mcp/webscraper
uv sync
uv run python src/server.py

Run Tests

cd mcp/webscraper
uv run pytest tests/ -v
# 23/23 tests passing

Usage Examples

# Fetch a page as Markdown
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)

# Extract all links from Gitea repo
webscraper_fetch_links("http://192.168.188.119:30008/pplate/pi_mcps")

# Get all tables from a documentation page
webscraper_fetch_tables("https://pypi.org/project/fastmcp/")

# Get Open Graph metadata
webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")

# Fetch specific section by CSS selector
webscraper_fetch_section("https://docs.python.org", "#content")

# Quick search orientation
webscraper_search_hint("Gitea wiki git clone", max_results=3)