Extracting image or video links #500
Replies: 7 comments
-
@Rm1n90 Thank you so much for your kind words and for using the library. I'm very happy to see that it helps people build better things. Regarding your questions, you should always apply a few simple tricks for situations where the content of the page is loading dynamically, like some of the links you shared. I focus on the first link, and typically, pages like this load their images using the lazy loading approach. You need to apply two techniques. First, ensure that all the images on the page have a status of completed. Second, the page often requires scrolling to the end, which ensures that most of the images load. There are also several other techniques you can use. The good news is that my focus in designing Crawl4ai is to encapsulate such difficulties. I update the library almost every week when I encounter new tricks, and I add them to the library. As a developer, you just need to use the relevant flag and configuration. In the following code, I will show you some of them. I think I should create a tutorial or an article in our documentation to explore these techniques @aravindkarnam. However, I will show you a very simple way to get a lot of images. When I run your code for the first link, I retrieve only 2 images, but after applying these techniques, the count increases to 29. I hope this helps you. from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
# Configure the browser settings
browser_config = BrowserConfig(headless=False, verbose=True)
# Set run configurations, including cache mode and markdown generator
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
wait_for_images=True,
scan_full_page=True,
# scroll_delay=0.5,
# delay_before_return_html=2,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url='https://www.eurosport.it/sci-alpino/marcel-hirscher-grave-infortunio-in-allenamento-rottura-del-legamento-crociato-a-reiteralm-stagione-finita_sto20059526/story.shtml',
config=crawl_config
)
if result.success:
print("Media.Images count: ", len(result.media['images']))
if __name__ == "__main__":
asyncio.run(main())
Explanations: The second flag forces smooth scrolling from the top to the very bottom of the page. The default delay is 200 milliseconds between each scroll segment because the scroll occurs equal to the amount viewport height to allows images loading. You can increase the delay. Additionally, you can incorporate a bit more delay before the final step when we fetch the data from the page. This is especially helpful for pages that use dynamic and lazy loading. Have fun. |
Beta Was this translation helpful? Give feedback.
-
@unclecode Thank you for your explanation and code snippet. However, my issue is not about the images in the first two links I mentioned. I can extract the images link with my original code with Crawl4Ai but not videos. If you enter the link that you tried, you will see there are 2 videos, one video is at the beginning of the page, and one is at the end. Neither of the code snippets is returning the video links and only images. Is there any way to extract the video links? Thanks! |
Beta Was this translation helpful? Give feedback.
-
@Rm1n90 Did you find any solution for extracting video links? Im facing the same issue :| @unclecode @aravindkarnam I have the same problem with extracting video links. Do I need to specify any specific variable to get the video links because the its just returning the images and ignoring the videos |
Beta Was this translation helpful? Give feedback.
-
@Auth0rM0rgan No success in extracting video links! @unclecode Is there any way to get the video links? |
Beta Was this translation helpful? Give feedback.
-
@Rm1n90 @Auth0rM0rgan I just tried the link and library returns the video tag but it doesn't have any source. Then I dig into the page and I noticed that this video tag does not have any source. After a while it starts to stream from the server. It's a blob object so that's why you don't get anything. Also I had issue with loading the page so I had to use VPN to change my location. This is a very unusual case. |
Beta Was this translation helpful? Give feedback.
-
@unclecode, When I open the link: https://www.eurosport.it/sci-alpino/marcel-hirscher-grave-infortunio-in-allenamento-rottura-del-legamento-crociato-a-reiteralm-stagione-finita_sto20059526/story.shtml, I can see there are 2 videos. One at the top and one at the bottom (which is not a blob object) of the page.. is there any way to get these links? Thanks |
Beta Was this translation helpful? Give feedback.
-
@Rm1n90 Please take a look at this image, is this different than what you see? Look at the dev tools, for me its a blob. ![]() |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @unclecode,
First of all, Thank you for such an amazing repo! it helps me a lot with my work and the automation of my work.
I need help for downloading the media (videos or images) from the following links. In general, the Crawl4ai works perfectly to get the media. However, I have 3 links that I cannot get the media link because the Crawl4ai doesn't return them. I tried several options, like js_code but no success.
Here are my links:
https://www.eurosport.it/sci-alpino/marcel-hirscher-grave-infortunio-in-allenamento-rottura-del-legamento-crociato-a-reiteralm-stagione-finita_sto20059526/story.shtml
https://www.eurosport.it/sci-alpino/flachau/2024-2025/da-ottava-a-prima-che-gran-vittoria-di-rast-in-slalom-riguarda-la-sua-manche_vid2297313/video.shtml
---> In these links I would like to get the video links but crawl cannot find the videos
Could you please guide me on how to extract the media links from the links mentioned above?
Thank you in advance!
Beta Was this translation helpful? Give feedback.
All reactions