PDF Processing with Python- All You Need to Know

All of you, most probably are familiar with the Portable Document Format (or PDFs) as it is one of the most important and widely used Digital Media. The file format, ending with the extension .pdf is the most widely used for reliably presenting and exchanging documents independent of any software or hardware requirements and it can work on every operating system Making it independent of the Operating system installed on the hardware.

The PDF (Portable Document Format) file format was invented by the famous software development company named Adobe in 1991 when the Adobe co-founder Dr. John Warnock launched this file format with the name Camelot Project. Just after a year in 1992, the Camelot project was converted into the PDF file format we know today. Of Course, It went through many changes from that time till today. One of the most popular software for PDF files is Soda PDF, a widely used application for creating, editing and converting PDF files to other formats.

A pre-existing PDF file can be processed and manipulated with the use of some simple python scripts. How to upload a pdf.

In this article we will discuss, How to run some simple operations on the PDF files with the use of various simple python scripts like extracting text from the PDF file, rotating pages of the file, or even combining various PDF files, adding watermarks to the mix and little more things.

Let’s get started.

What is Python?

Python is a high-level, multipurpose programming language. It is quite popular due to its versatility and relatively easy syntax, which makes Python the perfect language, especially for those without much or any prior programming experience.

For processing PDFs, there exist many python libraries that can be used in order to process and manipulate PDF files. PDF- Miner, pyPdf, PyPDF 2, PyPDF 4, PDF Query, xPDF are some of the most common python libraries that can be used to work with various kinds of PDF documents

For this article, we will be using the PyPDF 2 python library, as it is one of the most capable PDF toolkits for the python programming language.

PyPDF 2 vs PyPDF 4

The original parent package of both these packages was released in the year 2005 with the name pyPdf. After a few years, a company called Phasit, sponsored the development of PyPDF 2, as a fork of PyPdf. PyPdf 2 worked quite well for several years with the last update in 2016.

After some time, we got to see the release of a package with the name, PyPDF 3, later renamed PyPDF 4.

pypdf2 vs pypdf4, both nearly do the same things, let you work with PDF documents, in many ways. Both of them come with the support for python 3.

But the PyPDF 2 was abandoned, meaning it will not receive any new updates. Hence, you will need to shift to the PyPDF 4 in upcoming times. But in this article, we will be using the PyPDF 2, but most of the code will still work with the PyPdf 4, so give it a try.

Installing the PyPDF 2

To start working with the PDF documents with the help of Python, you need to first install a python library, which will allow you to do it easily. In this blog, we will be using the PyPDF 2, to process our PDF document files.

You most probably know how to install a Python Library, installing PyPDF 2 is not any different. You can easily install the PyPDF 2 using the pip if you are the regular Python and Conda if you are using the anaconda.

Here is the syntax of installing the PyPDF with pip –

$ pip install pypdf2

The installation should be very quick as the PyPDF 2 library does not have any dependencies.

Once you complete the installation, let’s just quickly jump on how you can use this library to process your PDF documents.

Extracting Text and other Data from PDF Files via Python

You can use the above python library to extract metadata and text from the PDF document. In the current version, you can extract the Author, creator, producer, subject, title, and the number of pages.

For this example, we will use the python script to find the number of pages in the PDF document.

Start by importing the module, remember the PyPDF 2 module is a case-sensitive one. So, make sure that the “y” is lowercase and everything else is uppercase.

Import PyPDF2

After importing the module via the code above. Go and open the Pdf file you want to work with. For the article, we are using the examplePython.pdf, open it in binary and save the file as pdfFileEx via the code below-

pdfFileEx = open(‘examplePython.pdf’, ‘rb’)

Next, we will create an object of the PdfFileReader class of this module and we will get the pdf reader object.

pdfReaderObj = PyPDF2.PdfFileReader(pdfFileEx)

numPages property extracts the number of pages in the pdf document.

print(pdfReader.numPages)

Now create an object of the PageObject class. Function getPage() will takes page number (starting from index 0) as an argument and will return the Page Object.

pageObj = pdfReaderObj.getPage(0)

Now, run the following code with function extractText() to extract text from the pdf page

print(pageObj.extractText())

At last, we close the Pdf file object via the close() command.

pdfFileEx.close()

Rotating PDF Pages

For rotating the PDF document, we first need to create a pdf reader object, right after importing the module. The rotated pages will be written to a new PDF.

import PyPDF2

def PDFrotate(origFileName, newFileName, rotation):

# First creating a pdf File object

pdfFileObj = open(origFileName, ‘rb’)

# Then creating a pdf Reader object

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# Create a pdf Writer object

pdfWriter = PyPDF2.PdfFileWriter()

# To rotate each and every page of the PDF

for page in range(pdfReader.numPages):

pageObj = pdfReader.getPage(page)

pageObj.rotateClockwise(rotation)

pdfWriter.addPage(pageObj)

# pdf file object for new PDF

newFile = open(newFileName, ‘wb’)

# writing rotated pages to new PDF

pdfWriter.write(newFile)

# Now, close the original pdf file object

pdfFileObj.close()

# Also, close the new pdf file object

newFile.close()

Conclusion

PDF is now an open standard and managed by the International Organization for Standardization (ISO). When it first launched, it did not have anything more than text, But nowadays, PDFs can contain links, buttons, audio, videos and form fields, and much more. Hope! You have got the good idea of the same.

HedgeThink

HedgeThink.com is the fund industry’s leading news, research and analysis source for individual and institutional accredited investors and professionals

Table of Contents

Add a header to begin generating the table of contents