A Comparison of python libraries for PDF Data Extraction for text, images and tables
Motivation
Extracting data from PDF files is a common task in many data processing and analysis workflows. Python provides several libraries that facilitate the extraction of text, images, and tables from PDF documents. In this article, we will explore and compare some popular Python libraries for PDF data extraction, considering their capabilities, execution speed, and ease of use. By the end you should be able to decide and select the library for your usecase.
But why PDF
We choose PDF because PDF is the most used format or extension across the globe in any domain or industry. These PDFs can contain content like text, images, tables and graphs (which could again be images only). With the advent of large language models like GPT-x leveraged by ChatGPT, there’s an increasing demand for a capability to ask questions that have answers hidden in these PDFs, either in the form of text, the tables or the images. And hence there are too many blogs, articles, videos on how to query your documents (more likely in PDFs because of its dominance) with ChatGPT or any LLMs. Broadly speaking, there are two non-trivial tasks to the goal of answering questions from the documents (PDFs for now).
- PDF extraction
- Send extracted content (with the query) to the LLM for answers.
In this article, we are focusing on the first task which is to process or extract the PDF data in python.
This is not at all a novel article, but the aim is to consolidate any information related to the PDF extraction and keep a checklist of selection criteria for the existing libraries.
PDFs are non-structured
There are multiple python libraries that one can use to extract the text, images and tables from the PDFs. With multiple python libraries existing for PDF extraction, one can realize the non-triviality in processing PDFs. From naked eyes PDFs can have structured data like text, tables, images and even the index at the start and end of the document. However, it is not structured when we start extracting using a program or a script. It is so unstructured that sometimes even the most visible text may not be extracted as text but some binary content.
Few misunderstandings about PDFs that needs to be clarified is that PDF is not a document format like MS-Word or HTML or XML. The closest thing to PDF is graphics but never close to structured format like HTML or XML.
Here we will take some of those libraries and try to see what each library has to offer w.r.t. the following aspects:
- Possible extraction (text, image, tables)
- Implementation easiness
- Execution Speed
Following Python libraries will be compared for extracting data from PDFs, covering their capabilities in terms of text, image, and table extraction, speed of execution, and ease of use. It also provides links to each library’s GitHub page for further information.
PyPDF2:
GitHub: [PyPDF2]
PyPDF2 is a pure-Python library that allows reading and manipulating PDF files. While it primarily focuses on text extraction, it also provides limited support for image extraction. However, table extraction is not a built-in feature. PyPDF2 is relatively easy to use and is widely adopted due to its simplicity and extensive documentation.
pdfminer.six:
GitHub: [pdfminer.six]
pdfminer.six is a community-maintained Python library based on the original PDFMiner project. It offers advanced capabilities for text extraction from PDFs, including the ability to extract text layout information. However, it does not provide direct support for image or table extraction. pdfminer.six is known for its accuracy in extracting text, but it can be more complex to use compared to other libraries. pdfminer.six is a library specifically designed for text extraction from PDF files. For python 3+, pdfminer.six else there used to be pdfminer for older versions of python as well.
Tabula-py:
GitHub: [tabula-py]
Tabul-py is another powerful Python library for extracting tables from PDFs. Tabula-py is a Python wrapper around the Java-based Tabula library. It specializes in extracting tabular data from PDF files. Although it focuses on tables, it can also extract some textual content. Image extraction is not supported. Tabula-py stands out when working with structured data in tabular form and provides a user-friendly interface for extracting tables from PDFs.
PyMuPDF:
GitHub: [PyMuPDF]
PyMuPDF is a Python binding for the MuPDF library, which is known for its high-performance rendering and parsing capabilities. PyMuPDF offers extensive features for extracting both text and images from PDFs. While it does not provide built-in table extraction, it offers a solid foundation for implementing custom table extraction algorithms. PyMuPDF is well-documented and provides a rich set of functionalities, but it may require a steeper learning curve compared to other libraries. PyMuPDF documentation can be found here.
Camelot:
GitHub: [Camelot]
Camelot is a Python library that excels in extracting tabular data from PDF files. It offers both a command-line interface and a Python API for extracting tables. Camelot uses advanced algorithms to detect and extract tables accurately from complex PDF layouts. It supports multiple output formats such as CSV, Excel, and JSON. Camelot’s documentation can be found [here]. It is slightly cumbersome to set up the camelot as it requires dependencies like OpenCV-python, GhostScript (OS level installation).
Comparison:
Now, let’s compare the libraries based on their extraction capabilities, speed of execution, and ease of use.
- Text Extraction:
— PyPDF2: Good support for text extraction.
— pdfminer.six: Excellent support with advanced layout information extraction.
— Tabula-py: Limited support, mainly focused on tables.
— PyMuPDF: Strong text extraction capabilities.— Camelot: Primarily designed for tabular data extraction and may not provide advanced text extraction capabilities for answering questions from the content.
- Image Extraction:
— PyPDF2: Limited support.
— pdfminer.six: Limited or no support.
— Tabula-py: No built-in support.
— PyMuPDF: Strong image extraction capabilities.— Camelot: no built-in support for image extraction.
- Table Extraction:
— PyPDF2: No built-in support.
— pdfminer.six: No built-in support.
— Tabula-py: Excellent support for table extraction.
— PyMuPDF: Custom implementation required, but provides a foundation for table extraction.— Camelot: Excels at extracting tabular data from PDFs, which can be useful for answering questions based on structured information.
- Speed of Execution:
— PyPDF2: Speed is moderate as it may take longer for processing large PDF files.
— pdfminer.six: Moderate speed, depending on the complexity of the PDF.
— Tabula-py: Varies depending on the size and complexity of the tables.
— PyMuPDF: Known for its high-performance rendering and parsing.— Camelot: execution speed is impressive, thanks to its efficient table extraction algorithms.
- Ease of Use:
— PyPDF2: Simple and easy to use.
— pdfminer.six: More complex compared to other libraries.
— Tabula-py: User-friendly interface, especially for table extraction.
— PyMuPDF: Provides a rich set of functionalities but may have a steeper learning curve.— Camelot: Initial set up is tricky but otherwise good documentation for any specific use. Flexibilty to provide page numbers and page range to extrtact tables
Retaining original formatting in the text
pdfminer.six library takes into account the structure of the PDF document and attempts to retain the line breaks and formatting present in the original document during the extraction process. This makes it well-suited for applications where maintaining the exact formatting and structure of the text is crucial, such as data analysis, document processing, or content extraction tasks.
Compared to other libraries like PyPDF2, which may not fully preserve the original formatting, `pdfminer.six` offers a more reliable and accurate extraction of text content while minimizing the loss of information.
The `PyMuPDF` library is also capable of preserving the original formatting of the text, including newline characters, during PDF text extraction. When it comes to text extraction, `PyMuPDF` aims to retain the original formatting as accurately as possible, including preserving newline characters, line breaks, and other textual formatting elements. Similar to `pdfminer.six`, `PyMuPDF` analyzes the structure of the PDF document to ensure the extracted text maintains the intended formatting. It considers the positioning and layout of text elements in the PDF to recreate the original structure in the extracted text.
Therefore, if your objective is to preserve the original formatting, including newline characters and also extract images, `PyMuPDF` is indeed a good choice. It offers robust capabilities for text extraction while striving to maintain the integrity and structure of the extracted content.
It’s important to note that the preservation of formatting can vary depending on the complexity of the PDF file and the specific formatting techniques used. Some PDFs may have complex layouts or unconventional formatting that can pose challenges to any extraction library. It’s always recommended to test the library with your specific PDF files to ensure the desired outcome.
Both `pdfminer.six` and `PyMuPDF` are reliable options for preserving the original formatting during PDF text extraction. The choice between the two will depend on your specific requirements, preferences, and the overall features and functionalities provided by each library.
Conclusion:
Each Python library has its strengths and focuses on different aspects of PDF data extraction. If you primarily require text extraction, pdfminer.six is the best choice as it strives to preserve the original formatting of the text, including carriage return and newline characters, as closely as possible. For table extraction, Camelot provides a convenient solution due to its specialized capabilities and fast execution speed (but remember if you just need tables, then Camelot is the best solution). Camelot is specifically designed to excel in differentiating tables from other content in a PDF document. Camelot utilizes techniques such as table boundary detection, line and cell recognition, and table structure analysis to accurately identify and extract tables from PDFs. It can handle a variety of table layouts, including simple and complex structures, merged cells, and varying table sizes. If you need comprehensive capabilities for both text and image extraction, PyMuPDF offers a powerful option, albeit with a potentially higher learning curve.
Consider the specific requirements of your project and the types of data you need to extract when choosing the appropriate library. It is also advisable to explore the official documentation and examples provided by each library to gain a deeper understanding of their capabilities and how to use them effectively. Remember that the speed of execution may vary depending on the complexity and size of the PDF files you are working with. However, PyMuPDF is known for its impressive execution speed, making it suitable for large-scale PDF processing.
Considering the requirements of extracting text, images, and tables for answering questions from PDF content,
PyMuPDF emerges as a strong contender. It provides comprehensive text and image extraction capabilities, allowing you to retrieve relevant information from the PDF. While it doesn’t offer dedicated table extraction features, it can still be used to analyze the PDF structure and extract tabular data with additional processing steps if needed.
References
- Entire Internet and self-experience in using these libraries!
- PDF — 🦜🔗 LangChain 0.0.194
- PyMuPDF vs PDFMiner (pratiksanghvi.blogspot.com)
- prajwollamichhane11/PDF-Handling-With-Python: Performing the following operations using python on PDF. (github.com)
- 5 Python open-source tools to extract text and tabular data from PDF Files | by Zoumana Keita | Towards Data Science
- Thomas’s World: Parsing PDFs in Python (survivalengineer.blogspot.com)
- How to Extract Text Contents from PDF (part 1/3) — YouTube
- Comparing 4 methods for pdf text extraction in python | by Jeanna Schoonmaker | Social Impact Analytics | Medium