Does crawl4AI already have a function to scrape all related URLs from a root URL? #485
Replies: 7 comments 5 replies
-
Hi @QuangTQV. We are currently working on a scraper module that takes in a root URL then does a Breadth First traversal until the configured depth is reached. It's currently under review and testing. If you are interested in trying this out, you can check this PR there's a quick start example in Please see if this meets your requirements. We are working on some performance enhancements, post that this will be released. |
Beta Was this translation helpful? Give feedback.
-
@QuangTQV We are working on the same thing, but the current solution based on crawl4ai is relatively slow. Could we discuss the implementation ideas more in the future?😀 |
Beta Was this translation helpful? Give feedback.
-
@1933211129 @QuangTQV The new version 0.4.3 includes a very strong component for parallel crawling and will serve as the core for the "scraper" branch. I plan to merge the scraper branch that @aravindkarnam worked on. In the meantime, check this link to gather some ideas for parallel crawling: https://docs.crawl4ai.com/advanced/multi-url-crawling/ |
Beta Was this translation helpful? Give feedback.
-
@unclecode @aravindkarnam, Is there any update regarding the feature "scrape all related URLs from a root URL"? |
Beta Was this translation helpful? Give feedback.
-
Thanks @aravindkarnam, this is really very useful - awaiting it's release. |
Beta Was this translation helpful? Give feedback.
-
Will this be implemented by the api? I think this is very useful also for non python users... |
Beta Was this translation helpful? Give feedback.
-
@aravindkarnam hello, as far as i can see this feature is not released yet. any updates?? excellent work btw I'm waiting like a fly rubbing its hands together. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Description: Starting from a root URL, it will navigate to child URLs, scraping content in parallel while collecting links. It can utilize a queue or some other mechanism and should allow setting a maximum depth parameter. It returns a list of dictionaries, where each element contains information about a URL (link, content, depth, images, markdown, etc.)
Beta Was this translation helpful? Give feedback.
All reactions