Parsing PDFs using Alteryx (and a little R)

Motivation
Suppose we have a PDF which contains a table and we would like to extract that table.
Table in PDF
The R package pdftools can extract text from PDFs, and Alteryx, which is a visually intuitive drag-and-drop data analysis tool, makes it very easy for R novices to include R code snippets as part of a workflow.
Step-by-step guide
In order to build an Alteryx workflow which can extract text from PDFs, first install the packages pdftools and Rcpp. To do this, right-click on the R version which installed with Alteryx and select “Run as administrator”.
Run as Administrator
Now run the commands below to install/update the required packages.

install.packages(“Rcpp”)
install.packages(“pdftools”)

Run as administrator and install
The Alteryx workflow starts with a Text Input tool which contains the full path of the PDF file.
Text_Input
Next the R tool is used to extract the text from the PDF. The code used in the R tool is below.

# read in the PDF file location which must
# be in a field called FullPath
data <- read.Alteryx("#1",mode="data.frame")

# Use pdf_text() function to return a character vector
# containing the text for each page of the PDF
txt <- pdftools::pdf_text(file.path(data$FullPath))

# convert the character vector to a data frame
df_txt <- data.frame(txt)

# output the data frame in steam 1
write.Alteryx(df_txt, 1)

Write_some_R
The output from the R tool is a single string which contains the extracted text. To parse this string, use the Text to Columns tool in Alteryx to split the string into rows at the newline (\n) character.
Text to Columns - Newline
To split out the table columns, use the Regex tool to replace sequences of 2 or more blanks by a pipe delimiter (|).
Regex
Finally use the Text to Columns tool to split the strings at the pipe delimiter and use the Dynamic Rename tool to take the column names from the first row.
Full Workflow
Reference
Alteryx Community | PDF Parsing in Alteryx using R

Leave a comment