Does crawl4AI already have a function to scrape all related URLs from a root URL? #485

QuangTQV · 2025-01-16T09:18:45Z

QuangTQV
Jan 16, 2025

Description: Starting from a root URL, it will navigate to child URLs, scraping content in parallel while collecting links. It can utilize a queue or some other mechanism and should allow setting a maximum depth parameter. It returns a list of dictionaries, where each element contains information about a URL (link, content, depth, images, markdown, etc.)

aravindkarnam · 2025-01-16T11:08:14Z

aravindkarnam
Jan 16, 2025
Collaborator

Hi @QuangTQV. We are currently working on a scraper module that takes in a root URL then does a Breadth First traversal until the configured depth is reached. It's currently under review and testing.

If you are interested in trying this out, you can check this PR
https://github.com/aravindkarnam/crawl4ai/tree/scraper-uc

there's a quick start example in
https://github.com/aravindkarnam/crawl4ai/blob/scraper-uc/docs/scraper/scraper_quickstart.py

Please see if this meets your requirements. We are working on some performance enhancements, post that this will be released.

0 replies

1933211129 · 2025-01-16T11:28:45Z

1933211129
Jan 16, 2025

@QuangTQV We are working on the same thing, but the current solution based on crawl4ai is relatively slow. Could we discuss the implementation ideas more in the future?😀

0 replies

unclecode · 2025-01-16T12:37:09Z

unclecode
Jan 16, 2025
Maintainer

@1933211129 @QuangTQV The new version 0.4.3 includes a very strong component for parallel crawling and will serve as the core for the "scraper" branch. I plan to merge the scraper branch that @aravindkarnam worked on. In the meantime, check this link to gather some ideas for parallel crawling: https://docs.crawl4ai.com/advanced/multi-url-crawling/

0 replies

Rm1n90 · 2025-02-06T16:15:41Z

Rm1n90
Feb 6, 2025

@unclecode @aravindkarnam, Is there any update regarding the feature "scrape all related URLs from a root URL"?

2 replies

aravindkarnam Feb 6, 2025
Collaborator

It's almost done @Rm1n90 ! Currently undergoing testing. It will most likely be released within the next week.

opusmagnum Feb 10, 2025

Thank you @aravindkarnam! I have tested it on your feature-branch and it works great.

Dev4011 · 2025-02-12T12:45:41Z

Dev4011
Feb 12, 2025

Thanks @aravindkarnam, this is really very useful - awaiting it's release.

0 replies

flowluap · 2025-02-17T20:39:56Z

flowluap
Feb 17, 2025

Will this be implemented by the api?

I think this is very useful also for non python users...

1 reply

aravindkarnam Feb 19, 2025
Collaborator

@flowluap Yes. It's on the roadmap.

tezzcan · 2025-03-03T22:51:36Z

tezzcan
Mar 3, 2025

@aravindkarnam hello, as far as i can see this feature is not released yet. any updates?? excellent work btw I'm waiting like a fly rubbing its hands together.

2 replies

aravindkarnam Mar 4, 2025
Collaborator

@tezzcan Really appreciate your enthusiasm. This is what keeps us motivated to build and maintain Crawl4AI. The release should happen most likely tonight 🤞🏼. This is one of the major releases in the history of project so far, that's why we are taking some time to prepare everything.

Along with the release there are also a ton of super helpful tutorial videos that are going to come out through the week. So stay tuned on changelog and @unclecode 's x handle for updates.

1933211129 Mar 4, 2025

Looking forward to it! We are also paying attention to this area of work, thank you for your contributions!

Does crawl4AI already have a function to scrape all related URLs from a root URL? #485

Uh oh!

Replies: 7 comments · 5 replies

Uh oh!

aravindkarnam Jan 16, 2025 Collaborator

Uh oh!

Uh oh!

unclecode Jan 16, 2025 Maintainer

Uh oh!

Uh oh!

aravindkarnam Feb 6, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aravindkarnam Feb 19, 2025 Collaborator

Uh oh!

Uh oh!

aravindkarnam Mar 4, 2025 Collaborator

Uh oh!

Replies: 7 comments 5 replies

aravindkarnam
Jan 16, 2025
Collaborator

unclecode
Jan 16, 2025
Maintainer

aravindkarnam Feb 6, 2025
Collaborator

aravindkarnam Feb 19, 2025
Collaborator

aravindkarnam Mar 4, 2025
Collaborator