I have data collected with standardized forms designed on word. There are free text entries entered on legacy text form fields as well as legacy checkboxes, which I am having trouble scraping. Here is a sample file that mimics the data.
I have been harvesting the text using docxtractr
package for R and making use of regular expressions, but the checkboxes appear the same whether checked or unchecked and I can't harvest my data:
k <- read_docx("Sample file here.docx")
l <- as.character(docx_extract_all_tbls(k, 1))[1]
print(l)
# "list(Text.boxes...FORMTEXT.Test.Entry.No1Legacy.checkboxes...FORMCHECKBOX..Yes..FORMCHECKBOX..No = character(0))"
# Harvesting the text data:
t <- gsub("\\.", " ", l) # getting rid of the ugly dots
text_data <- regmatches(t, regexec("Text boxes FORMTEXT \\s*(.*?)\\s*(Legacy checkboxes|$)",
text=t))[[1]][2]
print(text_data)
# "Test Entry No1"
All legacy checkboxes appear as "FORMCHECKBOX", regardless of checked or unchecked state.
Non-legacy ("modern") checkboxes can be easily identified as checked or unchecked because they appear with their unicode code point:
l <- as.character(docx_extract_all_tbls(k, 1))[2]
print(l)
# "list(Modern.text.form.field..Test.entry.No2Modern.Checkbox...U.2612..Yes..U.2610..No = character(0))"
t <- gsub("\\.", " ", l)
cbox <- regmatches(t, regexec("Modern Checkbox U \\s*(.*?)\\s*( Yes U |$)",
text=t))[[1]][2]
ifelse(cbox=="2612", "Yes", "No")
However, this does not work for me as my word files contain legacy checkboxes instead of modern checkboxes. How can I detect the checked or unchecked state of my legacy checkboxes in my word documents?
Aucun commentaire:
Enregistrer un commentaire