jeudi 22 décembre 2016

Recognize checkbox in parsing/mining pdf in python

Python 3

I am trying to design a script that will extract the text from a PDF but also recognize if a checkbox was checked. WHen I use pyPDF2 it recognizes text but not the check boxes. I found some sample script for pdfminer. This one does better as it recognizes the checkboxes but when you get the output it is impossible to figure out what labels the checkboxes refer to as it is all scrambled.

Sample Image of PDf

sample code I tried:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

Current output:

PTN CPN UN MN Other

Even looking at the whitespaces there is not a particular pattern that It will accurately say that the checkmarks are related to which label.




1 commentaire: