mardi 12 janvier 2021

How to read legacy checkbox states from .docx files using docxtractr

I have data collected with standardized forms designed on word. There are free text entries entered on legacy text form fields as well as legacy checkboxes, which I am having trouble scraping. Here is a sample file that mimics the data. Screenshot from sample file that mimics the data

I have been harvesting the text using docxtractr package for R and making use of regular expressions, but the checkboxes appear the same whether checked or unchecked and I can't harvest my data:

k <- read_docx("Sample file here.docx")
l <- as.character(docx_extract_all_tbls(k, 1))[1]
print(l)
# "list(Text.boxes...FORMTEXT.Test.Entry.No1Legacy.checkboxes...FORMCHECKBOX..Yes..FORMCHECKBOX..No = character(0))"

# Harvesting the text data:
t <- gsub("\\.", " ", l) # getting rid of the ugly dots
text_data <- regmatches(t, regexec("Text boxes   FORMTEXT \\s*(.*?)\\s*(Legacy checkboxes|$)", 
                               text=t))[[1]][2]  
print(text_data)
# "Test Entry No1"

All legacy checkboxes appear as "FORMCHECKBOX", regardless of checked or unchecked state.

Non-legacy ("modern") checkboxes can be easily identified as checked or unchecked because they appear with their unicode code point:

l <- as.character(docx_extract_all_tbls(k, 1))[2]
print(l)
# "list(Modern.text.form.field..Test.entry.No2Modern.Checkbox...U.2612..Yes..U.2610..No = character(0))"
t <- gsub("\\.", " ", l)
cbox <- regmatches(t, regexec("Modern Checkbox   U \\s*(.*?)\\s*(  Yes  U |$)", 
                               text=t))[[1]][2]  
ifelse(cbox=="2612", "Yes", "No")

However, this does not work for me as my word files contain legacy checkboxes instead of modern checkboxes. How can I detect the checked or unchecked state of my legacy checkboxes in my word documents?




Aucun commentaire:

Enregistrer un commentaire