docs: make example of using a Filter with DefaultMarkdownGenerator directly runnable#1465
docs: make example of using a Filter with DefaultMarkdownGenerator directly runnable#1465fardhanrasya wants to merge 1 commit intounclecode:mainfrom
DefaultMarkdownGenerator directly runnable#1465Conversation
…directly runnable with asyncio
WalkthroughDocumentation examples were updated to show an asynchronous usage pattern: imports now include asyncio and CacheMode; logic is wrapped in an async main() with asyncio.run(main()); AsyncWebCrawler is used within an async context; CrawlerRunConfig demonstrates cache_mode=CacheMode.BYPASS. No library APIs changed. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Script as Script (__main__)
participant Crawler as AsyncWebCrawler
participant Site as Target Website
User->>Script: Run `python script.py`
Script->>Script: asyncio.run(main())
Script->>Crawler: async with AsyncWebCrawler()
Crawler->>Site: Fetch & process content (async)
Site-->>Crawler: HTML/Content
Crawler-->>Script: Result (raw/fit markdown)
Script->>User: Print lengths and exit
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
deploy/docker/c4ai-doc-context.md (1)
4895-4899: Add a success check and guard fit_markdown to avoid None errorsIf the crawl fails, or fit_markdown is None, the current prints can raise. A tiny guard improves resilience without distracting from the example.
- async with AsyncWebCrawler() as crawler: - result = await crawler.arun("https://news.ycombinator.com", config=config) - print("Raw Markdown length:", len(result.markdown.raw_markdown)) - print("Fit Markdown length:", len(result.markdown.fit_markdown)) + async with AsyncWebCrawler() as crawler: + result = await crawler.arun("https://news.ycombinator.com", config=config) + if result.success and result.markdown: + print("Raw Markdown length:", len(result.markdown.raw_markdown)) + fit = result.markdown.fit_markdown or "" + print("Fit Markdown length:", len(fit)) + else: + print(f"Error: {result.error_message}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
deploy/docker/c4ai-doc-context.md(1 hunks)docs/md_v2/core/quickstart.md(1 hunks)
🔇 Additional comments (2)
docs/md_v2/core/quickstart.md (1)
62-80: Good async pattern and config usage.Wrapping in
async def main()and usingasyncio.run(main())is correct; imports andCacheMode.BYPASSexample look consistent with prior notes.deploy/docker/c4ai-doc-context.md (1)
4880-4902: Runnable async example looks goodImporting asyncio, wrapping in async main(), and using asyncio.run(main()) makes the snippet copy-paste runnable. Using CacheMode.BYPASS here is also appropriate for fresh content.
| import asyncio | ||
| from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode | ||
| from crawl4ai.content_filter_strategy import PruningContentFilter | ||
| from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator | ||
|
|
||
| md_generator = DefaultMarkdownGenerator( | ||
| content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed") | ||
| ) | ||
| async def main(): | ||
| md_generator = DefaultMarkdownGenerator( | ||
| content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed") | ||
| ) | ||
|
|
||
| config = CrawlerRunConfig( | ||
| cache_mode=CacheMode.BYPASS, | ||
| markdown_generator=md_generator | ||
| ) | ||
| config = CrawlerRunConfig( | ||
| cache_mode=CacheMode.BYPASS, | ||
| markdown_generator=md_generator | ||
| ) | ||
|
|
||
| async with AsyncWebCrawler() as crawler: | ||
| result = await crawler.arun("https://news.ycombinator.com", config=config) | ||
| print("Raw Markdown length:", len(result.markdown.raw_markdown)) | ||
| print("Fit Markdown length:", len(result.markdown.fit_markdown)) | ||
| async with AsyncWebCrawler() as crawler: | ||
| result = await crawler.arun("https://news.ycombinator.com", config=config) | ||
| print("Raw Markdown length:", len(result.markdown.raw_markdown)) | ||
| print("Fit Markdown length:", len(result.markdown.fit_markdown)) | ||
|
|
||
| if __name__ == "__main__": | ||
| asyncio.run(main()) | ||
| ``` |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Make the snippet resilient to both string and structured result.markdown shapes.
Elsewhere in the doc, result.markdown is treated like a string, but here it’s accessed as an object with raw_markdown/fit_markdown. To avoid version drift and keep this runnable across releases, compute lengths via getattr fallback.
Apply this diff:
- print("Raw Markdown length:", len(result.markdown.raw_markdown))
- print("Fit Markdown length:", len(result.markdown.fit_markdown))
+ md = result.markdown
+ raw = getattr(md, "raw_markdown", md)
+ fit = getattr(md, "fit_markdown", md)
+ print("Raw Markdown length:", len(raw))
+ print("Fit Markdown length:", len(fit))Also consider aligning the earlier “Your First Crawl” example to use the same shape (or add a short note clarifying the return type), so readers don’t get conflicting guidance. I can open a follow-up PR if you’d like.
🏁 Script executed:
#!/bin/bash
set -e
echo "Searching for AsyncWebCrawler class definition..."
grep -R "class AsyncWebCrawler" -n .
echo "Searching for arun method definition..."
grep -R "async def arun" -n .
echo "Searching for Result class definitions..."
grep -R "class .*Result" -n .
echo "Searching for .markdown usage in code (in crawl4ai folder if exists)..."
if [ -d "crawl4ai" ]; then
grep -R "\.markdown" -n crawl4ai || echo "No .markdown in crawl4ai folder"
else
echo "No crawl4ai directory, searching entire repo..."
grep -R "\.markdown" -n .
fi
echo "Locating 'Your First Crawl' in docs..."
grep -R "Your First Crawl" -n docs/md_v2/core/quickstart.md
echo "Locating 'result.markdown' in docs..."
grep -R "result.markdown" -n docs/md_v2/core/quickstart.mdLength of output: 282964
Make snippet resilient to both string and structured result.markdown
File: docs/md_v2/core/quickstart.md (lines 100-122) — CrawlResult.markdown can be a plain/str-subclass or a MarkdownGenerationResult object; use getattr fallback.
- print("Raw Markdown length:", len(result.markdown.raw_markdown))
- print("Fit Markdown length:", len(result.markdown.fit_markdown))
+ md = result.markdown
+ raw = getattr(md, "raw_markdown", md)
+ fit = getattr(md, "fit_markdown", md)
+ print("Raw Markdown length:", len(raw))
+ print("Fit Markdown length:", len(fit))Also align the earlier “Your First Crawl” example (or add a short note clarifying the return shape) to avoid conflicting guidance.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import asyncio | |
| from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode | |
| from crawl4ai.content_filter_strategy import PruningContentFilter | |
| from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator | |
| md_generator = DefaultMarkdownGenerator( | |
| content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed") | |
| ) | |
| async def main(): | |
| md_generator = DefaultMarkdownGenerator( | |
| content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed") | |
| ) | |
| config = CrawlerRunConfig( | |
| cache_mode=CacheMode.BYPASS, | |
| markdown_generator=md_generator | |
| ) | |
| config = CrawlerRunConfig( | |
| cache_mode=CacheMode.BYPASS, | |
| markdown_generator=md_generator | |
| ) | |
| async with AsyncWebCrawler() as crawler: | |
| result = await crawler.arun("https://news.ycombinator.com", config=config) | |
| print("Raw Markdown length:", len(result.markdown.raw_markdown)) | |
| print("Fit Markdown length:", len(result.markdown.fit_markdown)) | |
| async with AsyncWebCrawler() as crawler: | |
| result = await crawler.arun("https://news.ycombinator.com", config=config) | |
| print("Raw Markdown length:", len(result.markdown.raw_markdown)) | |
| print("Fit Markdown length:", len(result.markdown.fit_markdown)) | |
| if __name__ == "__main__": | |
| asyncio.run(main()) | |
| ``` | |
| import asyncio | |
| from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode | |
| from crawl4ai.content_filter_strategy import PruningContentFilter | |
| from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator | |
| async def main(): | |
| md_generator = DefaultMarkdownGenerator( | |
| content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed") | |
| ) | |
| config = CrawlerRunConfig( | |
| cache_mode=CacheMode.BYPASS, | |
| markdown_generator=md_generator | |
| ) | |
| async with AsyncWebCrawler() as crawler: | |
| result = await crawler.arun("https://news.ycombinator.com", config=config) | |
| md = result.markdown | |
| raw = getattr(md, "raw_markdown", md) | |
| fit = getattr(md, "fit_markdown", md) | |
| print("Raw Markdown length:", len(raw)) | |
| print("Fit Markdown length:", len(fit)) | |
| if __name__ == "__main__": | |
| asyncio.run(main()) |
🤖 Prompt for AI Agents
In docs/md_v2/core/quickstart.md around lines 100 to 122, the example assumes
result.markdown is always a MarkdownGenerationResult object and fails if it's a
plain string; update the snippet to handle both shapes by using a getattr
fallback (e.g., obtain raw_markdown = getattr(result.markdown, "raw_markdown",
result.markdown) and fit_markdown = getattr(result.markdown, "fit_markdown",
result.markdown)) before printing lengths, and also adjust the earlier “Your
First Crawl” example or add a short note clarifying that CrawlResult.markdown
may be either a string or a MarkdownGenerationResult object so readers aren’t
given conflicting guidance.
|
Closing as duplicate of #1715. Thanks for the contribution. |
Summary
Update the example of using a Filter with
DefaultMarkdownGeneratorso it can be copy-pasted and run directly.Previously, the snippet was missing
import asyncioandasyncio.run(main()).List of files changed and why
deploy/docker/c4ai-doc-context.md– fixed example by wrapping inmain()function and addingasyncio.run(main())to make it runnable.docs/md_v2/core/quickstart.md– same as aboveHow Has This Been Tested?
python example.py.Checklist:
Summary by CodeRabbit