Data-Scraping-from-PDF-Files

This project has been developed for data scraping from monthly added PDF reports on the Mortgage Finance Forecast Archives website (https://www.mba.org/news-and-research/forecasts-and-commentary/mortgage-finance-forecast-archives). The data scraping framework developed in this project converts the tabular data in PDF files to ".csv" format. The developed data scraping framework integrates two different tasks such as web scraping and data extraction. The code uses three different function named as extract_table_columns_name, extract_pdf_data, and get_pdf_files. The get_pdf_files function downloads the pdf files from the archive, and then extracts tabular data and attribute names using extract_pdf_data and extract_table_columns_name functions. The entire structure of the data scraping framework checks out the archive manually for newly added reports and works as a fully automated manner. Besides, the regex expressions in the framework are also compatible with pattern formats that vary in different table structures. In this project, two different packages used for data extraction from PDF files: pdfplumber and tika-python. At the stage of web scraping, the url links of the pdf files are parsed using lxml feature of the BeautifulSoup package. When the code is run for the first time, it creates two files named as archievefiles, archievecsvfiles and it also creates a ".csv" file named as archievelinks in cwd. When the code is run for the second time, it only downloads the pdf files in the newly added url links to the archive and performs other operations.

Feel free to contact me for support.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
data_scraping_from_pdf_files.ipynb		data_scraping_from_pdf_files.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data-Scraping-from-PDF-Files

About

Uh oh!

Releases

Packages

Languages

License

sinanercan/Data-Scraping-from-PDF-Files-

Folders and files

Latest commit

History

Repository files navigation

Data-Scraping-from-PDF-Files

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages