docs: promote webscraper_search_hint in wiki and mode rules

This commit is contained in:
Patrick Plate
2026-04-05 10:11:33 +02:00
parent 4202094f01
commit 4107b8ede2
3 changed files with 193 additions and 9 deletions
+62 -9
View File
@@ -25,20 +25,70 @@
- **Search backend:** Brave Search (`search.brave.com`) — works without CAPTCHA
- **SSL:** Custom cert bundle for Fedora 43 compatibility
## Search Hint Strategy
---
`webscraper_search_hint` uses Brave Search because:
## 🔍 Search: The Two-Step Research Pattern
`webscraper_search_hint` is the **entry point for all web research**. The recommended workflow is:
```
Step 1: webscraper_search_hint("your query") → get candidate URLs + snippets
Step 2: webscraper_fetch(best_url) → get full page content
```
This avoids scraping irrelevant pages and gives you an overview before committing to a deep read.
### Why Brave Search?
`webscraper_search_hint` uses Brave Search (`search.brave.com`) because:
- ✅ Returns real results without CAPTCHA or consent walls
- ✅ No API key required — works with plain HTTP GET
- ✅ Handles special characters (C++, &, %, etc.) via URL encoding
- ❌ Google blocks plain HTTP with 302 consent redirect
- ❌ DuckDuckGo blocks with CAPTCHA
Use it sparingly — once per research task — to get oriented before deep-scraping individual pages.
### Return Value
The tool returns a structured dict:
```json
{
"query": "FastMCP tool decorator",
"search_url": "https://search.brave.com/search?q=FastMCP+tool+decorator&source=web",
"result_count": 5,
"hint": "FastMCP Docs (https://docs.fastmcp.dev): The @mcp.tool() decorator registers a function as... | PyPI FastMCP (https://pypi.org/project/fastmcp/): FastMCP 2.x — modern MCP server framework... | ...",
"results": [
{
"title": "FastMCP Docs",
"url": "https://docs.fastmcp.dev",
"snippet": "The @mcp.tool() decorator registers a function as an MCP tool..."
},
...
]
}
```
The `hint` field is a pipe-separated string of `"Title (url): snippet[:120]"` entries — immediately actionable for deciding which URL to fetch next.
### Example: Two-Step Research Flow
```python
# Get top 5 results for a query
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5)
# Step 1: Orient — what pages exist about this topic?
result = webscraper_search_hint("httpx async client timeout settings", max_results=5)
# hint: "HTTPX Docs (https://www.python-httpx.org/...): Configure timeout... | ..."
# Step 2: Deep-dive the most relevant result
content = webscraper_fetch("https://www.python-httpx.org/advanced/timeouts/", max_chars=8000)
```
### Known Limitations
- **Reddit / Stack Overflow snippets** may be empty — these platforms block snippet extraction
- **Brave CSS selectors** use Svelte-generated class names that may change. If you get 0 results, the scraper's selectors may need updating (last verified: 2026-04-05)
- **Use sparingly** — once per research task to get oriented, not for every query
---
## SSL Note — Fedora 43 Comodo Root CA
Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at [`mcp/webscraper/certs/comodo-aaa-services-root.pem`](../src/branch/main/mcp/webscraper/certs/).
@@ -58,13 +108,16 @@ uv run python src/server.py
```bash
cd mcp/webscraper
uv run pytest tests/ -v
# 23/23 tests passing
# 28/28 tests passing
```
## Usage Examples
```python
# Fetch a page as Markdown
# Step 1: Search — get candidate URLs for a topic
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5)
# Step 2: Deep-dive the most relevant URL
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)
# Extract all links from Gitea repo
@@ -79,6 +132,6 @@ webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")
# Fetch specific section by CSS selector
webscraper_fetch_section("https://docs.python.org", "#content")
# Quick search orientation
webscraper_search_hint("Gitea wiki git clone", max_results=3)
# Search with special characters (C++, &, % all work)
webscraper_search_hint("C++ std::optional usage", max_results=3)
```