Extracting image or video links #500

Rm1n90 · 2025-01-16T16:38:53Z

Rm1n90
Jan 16, 2025

First of all, Thank you for such an amazing repo! it helps me a lot with my work and the automation of my work.

I need help for downloading the media (videos or images) from the following links. In general, the Crawl4ai works perfectly to get the media. However, I have 3 links that I cannot get the media link because the Crawl4ai doesn't return them. I tried several options, like js_code but no success.

Here are my links:

---> In these links I would like to get the video links but crawl cannot find the videos

https://rhonefm.ch/sports/mikaela-shiffrin-quelques-semaines-au-repos-force-777823 ---> I would like extract the image link but same issue.

async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
    print("\n=== Parallel Crawling with Browser Reuse + Memory Check ===")

    # We'll keep track of peak memory usage across all tasks
    peak_memory = 0
    process = psutil.Process(os.getpid())

    def log_memory(prefix: str = ""):
        nonlocal peak_memory
        current_mem = process.memory_info().rss  # in bytes
        if current_mem > peak_memory:
            peak_memory = current_mem
        print(f"{prefix} Current Memory: {current_mem // (1024 * 1024)} MB, Peak: {peak_memory // (1024 * 1024)} MB")

    # Minimal browser config
    browser_config = BrowserConfig(
        headless=True,
        verbose=False,   # corrected from 'verbose=False'
        extra_args=["--disable-dev-shm-usage", "--disable-extensions"],
        user_agent_mode="random",
        light_mode=True,
        viewport_width=1280,
        viewport_height=720,
    )

    # CacheMode.BYPASS --> Fresh data
    crawl_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, # CacheMode.BYPASS --> Fresh data
                                    exclude_external_images=False, # Exclude External Images
                                    exclude_external_links=True,
                                    # exclude_social_media_links=True,  # skip Twitter, Facebook, etc.
                                    wait_for_images=True,  # ensure images are loaded
                                    word_count_threshold=10,
                                    remove_overlay_elements=True,
                                    magic=True,
                                    simulate_user=True,
                                    override_navigator=True,
                                    excluded_tags=['form', 'header', 'footer', 'nav', 'tabel'],
                                    scan_full_page=True,
                                    scroll_delay=0.5,
                                    # wait_for="css:.main-loaded",
                                    delay_before_return_html= 0.5,
                                    # exclude_domains=["adtrackers.com", "spammynews.org"],
                                    # exclude_social_media_domains=["facebook.com", "twitter.com"],
                                    process_iframes=True,
                                    )

    # Create the crawler instance
    crawler = AsyncWebCrawler(config=browser_config)
    await crawler.start()

    try:
        # We'll chunk the URLs in batches of 'max_concurrent'
        success_count = 0
        fail_count = 0
        for i in range(0, len(urls), max_concurrent):
            batch = urls[i : i + max_concurrent]
            tasks = []

            for j, url in enumerate(batch):
                # Unique session_id per concurrent sub-task
                session_id = f"parallel_session_{i + j}"
                task = crawler.arun(url=url, config=crawl_config, session_id=session_id)
                tasks.append(task)

            # Check memory usage prior to launching tasks
            log_memory(prefix=f"Before batch {i//max_concurrent + 1}: ")

            # Gather results
            results = await asyncio.gather(*tasks, return_exceptions=True)

            # Check memory usage after tasks complete
            log_memory(prefix=f"After batch {i//max_concurrent + 1}: ")

            # Evaluate results
            for url, result in zip(batch, results):

                if isinstance(result, Exception):
                    print(f"Error crawling {url}: {result}")
                    fail_count += 1
                elif result.success:
                    
                    image_info = result.media.get("images", [])
                    # print(f"Success crawling {url}: {image_info}")
                    print("Images found:", len(image_info))
                    for i, img in enumerate(image_info):
                        print(f"  - {img['src']} (alt={img.get('alt', '')}, score={img.get('score', 'N/A')}, group_id={img.get('group_id', '')})")


                    print('<< ======================================================================= >>')

                    video_info = result.media.get("videos", [])
                    # print(f"Success crawling {url}: {video_info}")
                    print("video found:", len(video_info))
                    for i, vid in enumerate(video_info):
                        print(f"  - {vid['src']} (alt={vid.get('alt', '')}, score={vid.get('score', 'N/A')})")

                    success_count += 1
                else:
                    print(f"Failed crawling {result}")
                    fail_count += 1

        print(f"\nSummary:")
        print(f"  - Successfully crawled: {success_count}")
        print(f"  - Failed: {fail_count}")

    finally:
        print("\nClosing crawler...")
        await crawler.close()
        # Final memory log
        log_memory(prefix="Final: ")
        print(f"\nPeak memory usage (MB): {peak_memory // (1024 * 1024)}")

async def main():
    urls = [
        "https://www.eurosport.it/sci-alpino/marcel-hirscher-grave-infortunio-in-allenamento-rottura-del-legamento-crociato-a-reiteralm-stagione-finita_sto20059526/story.shtml",
        "https://www.eurosport.it/sci-alpino/flachau/2024-2025/da-ottava-a-prima-che-gran-vittoria-di-rast-in-slalom-riguarda-la-sua-manche_vid2297313/video.shtml",
        "https://rhonefm.ch/sports/mikaela-shiffrin-quelques-semaines-au-repos-force-777823"
    ]
    await crawl_parallel(urls, max_concurrent=3)

if __name__ == "__main__":
    asyncio.run(main())

Could you please guide me on how to extract the media links from the links mentioned above?

Thank you in advance!

unclecode · 2025-01-20T12:51:05Z

unclecode
Jan 20, 2025
Maintainer

@Rm1n90 Thank you so much for your kind words and for using the library. I'm very happy to see that it helps people build better things.

Regarding your questions, you should always apply a few simple tricks for situations where the content of the page is loading dynamically, like some of the links you shared.

I focus on the first link, and typically, pages like this load their images using the lazy loading approach. You need to apply two techniques. First, ensure that all the images on the page have a status of completed. Second, the page often requires scrolling to the end, which ensures that most of the images load. There are also several other techniques you can use. The good news is that my focus in designing Crawl4ai is to encapsulate such difficulties. I update the library almost every week when I encounter new tricks, and I add them to the library. As a developer, you just need to use the relevant flag and configuration. In the following code, I will show you some of them.

I think I should create a tutorial or an article in our documentation to explore these techniques @aravindkarnam.

However, I will show you a very simple way to get a lot of images. When I run your code for the first link, I retrieve only 2 images, but after applying these techniques, the count increases to 29. I hope this helps you.

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    # Configure the browser settings
    browser_config = BrowserConfig(headless=False, verbose=True)

    # Set run configurations, including cache mode and markdown generator
    crawl_config = CrawlerRunConfig(    
        cache_mode=CacheMode.BYPASS,
        wait_for_images=True,
        scan_full_page=True,
        # scroll_delay=0.5,
        # delay_before_return_html=2,
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://www.eurosport.it/sci-alpino/marcel-hirscher-grave-infortunio-in-allenamento-rottura-del-legamento-crociato-a-reiteralm-stagione-finita_sto20059526/story.shtml',
            config=crawl_config
        )
        if result.success:
            print("Media.Images count: ", len(result.media['images']))

if __name__ == "__main__":
    asyncio.run(main())

[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://www.eurosport.it/sci-alpino/marcel-hirsche... | Status: True | Time: 5.27s
[SCRAPE].. ◆ Processed https://www.eurosport.it/sci-alpino/marcel-hirsche... | Time: 293ms
[COMPLETE] ● https://www.eurosport.it/sci-alpino/marcel-hirsche... | Status: True | Total: 5.56s
29

Explanations:
I have used only two flags. One flag wait_for_images, which is a heuristic method that observe the page and network requests as much as possible to ensure that any image on the page has already loaded.

The second flag forces smooth scrolling from the top to the very bottom of the page. The default delay is 200 milliseconds between each scroll segment because the scroll occurs equal to the amount viewport height to allows images loading. You can increase the delay. Additionally, you can incorporate a bit more delay before the final step when we fetch the data from the page. This is especially helpful for pages that use dynamic and lazy loading.

Have fun.

0 replies

Rm1n90 · 2025-01-21T10:20:48Z

Rm1n90
Jan 21, 2025
Author

@unclecode Thank you for your explanation and code snippet. However, my issue is not about the images in the first two links I mentioned. I can extract the images link with my original code with Crawl4Ai but not videos. If you enter the link that you tried, you will see there are 2 videos, one video is at the beginning of the page, and one is at the end. Neither of the code snippets is returning the video links and only images. Is there any way to extract the video links?

Thanks!

0 replies

Auth0rM0rgan · 2025-01-24T15:12:57Z

Auth0rM0rgan
Jan 24, 2025

@Rm1n90 Did you find any solution for extracting video links? Im facing the same issue :|

@unclecode @aravindkarnam I have the same problem with extracting video links. Do I need to specify any specific variable to get the video links because the its just returning the images and ignoring the videos

0 replies

Rm1n90 · 2025-01-27T10:40:35Z

Rm1n90
Jan 27, 2025
Author

@Auth0rM0rgan No success in extracting video links! @unclecode Is there any way to get the video links?

0 replies

unclecode · 2025-01-27T15:21:55Z

unclecode
Jan 27, 2025
Maintainer

@Rm1n90 @Auth0rM0rgan I just tried the link and library returns the video tag but it doesn't have any source. Then I dig into the page and I noticed that this video tag does not have any source. After a while it starts to stream from the server. It's a blob object so that's why you don't get anything. Also I had issue with loading the page so I had to use VPN to change my location. This is a very unusual case.

0 replies

Rm1n90 · 2025-01-28T10:05:07Z

Rm1n90
Jan 28, 2025
Author

@unclecode, When I open the link: https://www.eurosport.it/sci-alpino/marcel-hirscher-grave-infortunio-in-allenamento-rottura-del-legamento-crociato-a-reiteralm-stagione-finita_sto20059526/story.shtml, I can see there are 2 videos. One at the top and one at the bottom (which is not a blob object) of the page.. is there any way to get these links?

Thanks

0 replies

unclecode · 2025-01-28T15:52:09Z

unclecode
Jan 28, 2025
Maintainer

@Rm1n90 Please take a look at this image, is this different than what you see? Look at the dev tools, for me its a blob.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extracting image or video links #500

Uh oh!

{{title}}

Uh oh!

Replies: 7 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Extracting image or video links #500

Uh oh!

Rm1n90 Jan 16, 2025

Replies: 7 comments

Uh oh!

unclecode Jan 20, 2025 Maintainer

Uh oh!

Rm1n90 Jan 21, 2025 Author

Uh oh!

Auth0rM0rgan Jan 24, 2025

Uh oh!

Rm1n90 Jan 27, 2025 Author

Uh oh!

unclecode Jan 27, 2025 Maintainer

Uh oh!

Rm1n90 Jan 28, 2025 Author

Uh oh!

unclecode Jan 28, 2025 Maintainer

Rm1n90
Jan 16, 2025

unclecode
Jan 20, 2025
Maintainer

Rm1n90
Jan 21, 2025
Author

Auth0rM0rgan
Jan 24, 2025

Rm1n90
Jan 27, 2025
Author

unclecode
Jan 27, 2025
Maintainer

Rm1n90
Jan 28, 2025
Author

unclecode
Jan 28, 2025
Maintainer