checkbox: How to read legacy checkbox states from .docx files using docxtractr

mardi 12 janvier 2021

How to read legacy checkbox states from .docx files using docxtractr

I have data collected with standardized forms designed on word. There are free text entries entered on legacy text form fields as well as legacy checkboxes, which I am having trouble scraping. Here is a sample file that mimics the data.

I have been harvesting the text using docxtractr package for R and making use of regular expressions, but the checkboxes appear the same whether checked or unchecked and I can't harvest my data:

k <- read_docx("Sample file here.docx")
l <- as.character(docx_extract_all_tbls(k, 1))[1]
print(l)
# "list(Text.boxes...FORMTEXT.Test.Entry.No1Legacy.checkboxes...FORMCHECKBOX..Yes..FORMCHECKBOX..No = character(0))"

# Harvesting the text data:
t <- gsub("\\.", " ", l) # getting rid of the ugly dots
text_data <- regmatches(t, regexec("Text boxes   FORMTEXT \\s*(.*?)\\s*(Legacy checkboxes|$)", 
                               text=t))[[1]][2]  
print(text_data)
# "Test Entry No1"

All legacy checkboxes appear as "FORMCHECKBOX", regardless of checked or unchecked state.

Non-legacy ("modern") checkboxes can be easily identified as checked or unchecked because they appear with their unicode code point:

l <- as.character(docx_extract_all_tbls(k, 1))[2]
print(l)
# "list(Modern.text.form.field..Test.entry.No2Modern.Checkbox...U.2612..Yes..U.2610..No = character(0))"
t <- gsub("\\.", " ", l)
cbox <- regmatches(t, regexec("Modern Checkbox   U \\s*(.*?)\\s*(  Yes  U |$)", 
                               text=t))[[1]][2]  
ifelse(cbox=="2612", "Yes", "No")

However, this does not work for me as my word files contain legacy checkboxes instead of modern checkboxes. How can I detect the checked or unchecked state of my legacy checkboxes in my word documents?

checkbox

mardi 12 janvier 2021

How to read legacy checkbox states from .docx files using docxtractr

Aucun commentaire:

Enregistrer un commentaire