Motivation
Suppose we have a PDF which contains a table and we would like to extract that table.
The R package pdftools can extract text from PDFs, and Alteryx, which is a visually intuitive drag-and-drop data analysis tool, makes it very easy for R novices to include R code snippets as part of a workflow.
Step-by-step guide
In order to build an Alteryx workflow which can extract text from PDFs, first install the packages pdftools and Rcpp. To do this, right-click on the R version which installed with Alteryx and select “Run as administrator”.
Now run the commands below to install/update the required packages.
install.packages(“Rcpp”) install.packages(“pdftools”)
The Alteryx workflow starts with a Text Input tool which contains the full path of the PDF file.
Next the R tool is used to extract the text from the PDF. The code used in the R tool is below.
# read in the PDF file location which must # be in a field called FullPath data <- read.Alteryx("#1",mode="data.frame") # Use pdf_text() function to return a character vector # containing the text for each page of the PDF txt <- pdftools::pdf_text(file.path(data$FullPath)) # convert the character vector to a data frame df_txt <- data.frame(txt) # output the data frame in steam 1 write.Alteryx(df_txt, 1)
The output from the R tool is a single string which contains the extracted text. To parse this string, use the Text to Columns tool in Alteryx to split the string into rows at the newline (\n) character.
To split out the table columns, use the Regex tool to replace sequences of 2 or more blanks by a pipe delimiter (|).
Finally use the Text to Columns tool to split the strings at the pipe delimiter and use the Dynamic Rename tool to take the column names from the first row.
Reference
Alteryx Community | PDF Parsing in Alteryx using R