docs: make example of using a Filter with `DefaultMarkdownGenerator` directly runnable by fardhanrasya · Pull Request #1465 · unclecode/crawl4ai

fardhanrasya · 2025-09-02T11:45:29Z

Summary

Update the example of using a Filter with DefaultMarkdownGenerator so it can be copy-pasted and run directly.
Previously, the snippet was missing import asyncio and asyncio.run(main()).

List of files changed and why

deploy/docker/c4ai-doc-context.md – fixed example by wrapping in main() function and adding asyncio.run(main()) to make it runnable.
docs/md_v2/core/quickstart.md – same as above

How Has This Been Tested?

Ran the updated snippet locally with python example.py.
Verified that it successfully crawls Hacker News and prints raw/fit Markdown lengths.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code (N/A – documentation only)
I have made corresponding changes to the documentation
I have added/updated unit tests (N/A – documentation only)
New and existing unit tests pass locally with my changes (not applicable for docs)

Summary by CodeRabbit

Documentation
- Updated examples to use an asynchronous main coroutine with asyncio.run for clearer execution flow.
- Demonstrated configuring cache behavior via a public enum in the sample configuration.
- Consolidated creation of the markdown generator and crawler config inside the async function.
- Showed async context usage for the crawler and updated invocation pattern.
- Added sample output of Raw Markdown and Fit Markdown lengths.
- No changes to public APIs; improvements are example-only.

…directly runnable with asyncio

coderabbitai · 2025-09-02T11:45:37Z

Walkthrough

Documentation examples were updated to show an asynchronous usage pattern: imports now include asyncio and CacheMode; logic is wrapped in an async main() with asyncio.run(main()); AsyncWebCrawler is used within an async context; CrawlerRunConfig demonstrates cache_mode=CacheMode.BYPASS. No library APIs changed.

Changes

Cohort / File(s)	Summary of Changes
Docs: Async example update `docs/md_v2/core/quickstart.md`	Rewrote example to use async main(), asyncio.run entrypoint, AsyncWebCrawler in async context, and CacheMode in CrawlerRunConfig. Structural/indentation adjustments only; no API changes.
Deploy docs: Async example update `deploy/docker/c4ai-doc-context.md`	Updated example to async pattern with asyncio, async main(), async with AsyncWebCrawler, explicit CacheMode.BYPASS in config, and added prints for raw/fit markdown lengths. No API changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Script as Script (__main__)
  participant Crawler as AsyncWebCrawler
  participant Site as Target Website

  User->>Script: Run `python script.py`
  Script->>Script: asyncio.run(main())
  Script->>Crawler: async with AsyncWebCrawler()
  Crawler->>Site: Fetch & process content (async)
  Site-->>Crawler: HTML/Content
  Crawler-->>Script: Result (raw/fit markdown)
  Script->>User: Print lengths and exit

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I hop through loops of async night,
Awaiting awaits in moonlit byte,
With CacheMode set to swift bypass,
I nibble docs and nose the class—
When main() runs, I thump with cheer,
Async burrows are crystal clear! 🐇✨

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

deploy/docker/c4ai-doc-context.md (1)

4895-4899: Add a success check and guard fit_markdown to avoid None errors

If the crawl fails, or fit_markdown is None, the current prints can raise. A tiny guard improves resilience without distracting from the example.

-    async with AsyncWebCrawler() as crawler:
-        result = await crawler.arun("https://news.ycombinator.com", config=config)
-        print("Raw Markdown length:", len(result.markdown.raw_markdown))
-        print("Fit Markdown length:", len(result.markdown.fit_markdown))
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://news.ycombinator.com", config=config)
+        if result.success and result.markdown:
+            print("Raw Markdown length:", len(result.markdown.raw_markdown))
+            fit = result.markdown.fit_markdown or ""
+            print("Fit Markdown length:", len(fit))
+        else:
+            print(f"Error: {result.error_message}")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e651e04 and 1274a81.

📒 Files selected for processing (2)

deploy/docker/c4ai-doc-context.md (1 hunks)
docs/md_v2/core/quickstart.md (1 hunks)

🔇 Additional comments (2)

docs/md_v2/core/quickstart.md (1)

62-80: Good async pattern and config usage.

Wrapping in async def main() and using asyncio.run(main()) is correct; imports and CacheMode.BYPASS example look consistent with prior notes.

deploy/docker/c4ai-doc-context.md (1)

4880-4902: Runnable async example looks good

Importing asyncio, wrapping in async main(), and using asyncio.run(main()) makes the snippet copy-paste runnable. Using CacheMode.BYPASS here is also appropriate for fresh content.

coderabbitai · 2025-09-02T11:50:33Z

docs/md_v2/core/quickstart.md

+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
 from crawl4ai.content_filter_strategy import PruningContentFilter
 from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

-md_generator = DefaultMarkdownGenerator(
-    content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
-)
+async def main():
+    md_generator = DefaultMarkdownGenerator(
+        content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
+    )

-config = CrawlerRunConfig(
-    cache_mode=CacheMode.BYPASS,
-    markdown_generator=md_generator
-)
+    config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        markdown_generator=md_generator
+    )

-async with AsyncWebCrawler() as crawler:
-    result = await crawler.arun("https://news.ycombinator.com", config=config)
-    print("Raw Markdown length:", len(result.markdown.raw_markdown))
-    print("Fit Markdown length:", len(result.markdown.fit_markdown))
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://news.ycombinator.com", config=config)
+        print("Raw Markdown length:", len(result.markdown.raw_markdown))
+        print("Fit Markdown length:", len(result.markdown.fit_markdown))
+
+if __name__ == "__main__":
+    asyncio.run(main())
 ```


💡 Verification agent

🧩 Analysis chain

Make the snippet resilient to both string and structured result.markdown shapes.

Elsewhere in the doc, result.markdown is treated like a string, but here it’s accessed as an object with raw_markdown/fit_markdown. To avoid version drift and keep this runnable across releases, compute lengths via getattr fallback.

Apply this diff:

- print("Raw Markdown length:", len(result.markdown.raw_markdown)) - print("Fit Markdown length:", len(result.markdown.fit_markdown)) + md = result.markdown + raw = getattr(md, "raw_markdown", md) + fit = getattr(md, "fit_markdown", md) + print("Raw Markdown length:", len(raw)) + print("Fit Markdown length:", len(fit))

Also consider aligning the earlier “Your First Crawl” example to use the same shape (or add a short note clarifying the return type), so readers don’t get conflicting guidance. I can open a follow-up PR if you’d like.

🏁 Script executed:

#!/bin/bash set -e echo "Searching for AsyncWebCrawler class definition..." grep -R "class AsyncWebCrawler" -n . echo "Searching for arun method definition..." grep -R "async def arun" -n . echo "Searching for Result class definitions..." grep -R "class .*Result" -n . echo "Searching for .markdown usage in code (in crawl4ai folder if exists)..." if [ -d "crawl4ai" ]; then grep -R "\.markdown" -n crawl4ai || echo "No .markdown in crawl4ai folder" else echo "No crawl4ai directory, searching entire repo..." grep -R "\.markdown" -n . fi echo "Locating 'Your First Crawl' in docs..." grep -R "Your First Crawl" -n docs/md_v2/core/quickstart.md echo "Locating 'result.markdown' in docs..." grep -R "result.markdown" -n docs/md_v2/core/quickstart.md

Length of output: 282964

Make snippet resilient to both string and structured result.markdown

File: docs/md_v2/core/quickstart.md (lines 100-122) — CrawlResult.markdown can be a plain/str-subclass or a MarkdownGenerationResult object; use getattr fallback.

- print("Raw Markdown length:", len(result.markdown.raw_markdown)) - print("Fit Markdown length:", len(result.markdown.fit_markdown)) + md = result.markdown + raw = getattr(md, "raw_markdown", md) + fit = getattr(md, "fit_markdown", md) + print("Raw Markdown length:", len(raw)) + print("Fit Markdown length:", len(fit))

Also align the earlier “Your First Crawl” example (or add a short note clarifying the return shape) to avoid conflicting guidance.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import asyncio

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

from crawl4ai.content_filter_strategy import PruningContentFilter

from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

md_generator = DefaultMarkdownGenerator(

content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")

)

async def main():

md_generator = DefaultMarkdownGenerator(

content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")

)

config = CrawlerRunConfig(

cache_mode=CacheMode.BYPASS,

markdown_generator=md_generator

)

config = CrawlerRunConfig(

cache_mode=CacheMode.BYPASS,

markdown_generator=md_generator

)

async with AsyncWebCrawler() as crawler:

result = await crawler.arun("https://news.ycombinator.com", config=config)

print("Raw Markdown length:", len(result.markdown.raw_markdown))

print("Fit Markdown length:", len(result.markdown.fit_markdown))

async with AsyncWebCrawler() as crawler:

result = await crawler.arun("https://news.ycombinator.com", config=config)

print("Raw Markdown length:", len(result.markdown.raw_markdown))

print("Fit Markdown length:", len(result.markdown.fit_markdown))

if __name__ == "__main__":

asyncio.run(main())

```

import asyncio

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

from crawl4ai.content_filter_strategy import PruningContentFilter

from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():

md_generator = DefaultMarkdownGenerator(

content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")

)

config = CrawlerRunConfig(

cache_mode=CacheMode.BYPASS,

markdown_generator=md_generator

)

async with AsyncWebCrawler() as crawler:

result = await crawler.arun("https://news.ycombinator.com", config=config)

md = result.markdown

raw = getattr(md, "raw_markdown", md)

fit = getattr(md, "fit_markdown", md)

print("Raw Markdown length:", len(raw))

print("Fit Markdown length:", len(fit))

if __name__ == "__main__":

asyncio.run(main())

🤖 Prompt for AI Agents

In docs/md_v2/core/quickstart.md around lines 100 to 122, the example assumes result.markdown is always a MarkdownGenerationResult object and fails if it's a plain string; update the snippet to handle both shapes by using a getattr fallback (e.g., obtain raw_markdown = getattr(result.markdown, "raw_markdown", result.markdown) and fit_markdown = getattr(result.markdown, "fit_markdown", result.markdown)) before printing lengths, and also adjust the earlier “Your First Crawl” example or add a short note clarifying that CrawlResult.markdown may be either a string or a MarkdownGenerationResult object so readers aren’t given conflicting guidance.

unclecode · 2026-02-02T10:00:46Z

Closing as duplicate of #1715. Thanks for the contribution.

docs: make example of using a Filter with DefaultMarkdownGenerator …

1274a81

…directly runnable with asyncio

coderabbitai bot reviewed Sep 2, 2025

View reviewed changes

unclecode closed this Feb 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: make example of using a Filter with `DefaultMarkdownGenerator` directly runnable#1465

docs: make example of using a Filter with `DefaultMarkdownGenerator` directly runnable#1465
fardhanrasya wants to merge 1 commit intounclecode:mainfrom
fardhanrasya:docs/runnable-crawler-example

fardhanrasya commented Sep 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 2, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 2, 2025

Uh oh!

unclecode commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

fardhanrasya commented Sep 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of files changed and why

How Has This Been Tested?

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

unclecode commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fardhanrasya commented Sep 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 2, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)