Python 3
I am trying to design a script that will extract the text from a PDF but also recognize if a checkbox was checked. WHen I use pyPDF2 it recognizes text but not the check boxes. I found some sample script for pdfminer. This one does better as it recognizes the checkboxes but when you get the output it is impossible to figure out what labels the checkboxes refer to as it is all scrambled.
sample code I tried:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching,
check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
Current output:
✔
✔
PTN CPN UN MN Other
Even looking at the whitespaces there is not a particular pattern that It will accurately say that the checkmarks are related to which label.
can you tell were i should add my image of checkbox in the code
RépondreSupprimer