Reading data from PDF files into R

Is that even possible!?!

I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave that to a command line tool?

The reports were made in excel and then pdfed, so they have regular structure, but many blank "cells".

Answers


Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to bitmapped images of text or possibly even uglier things than I can imagine, nothing other than OCR can help you.

On top of that, in my sad experience there's no guarantee that apps which create PDF docs all behave the same, so the data in your table may or may not be read out in the desired order (as a result of the way the doc was built). Be cautious.

Probably better to make a couple grad students transcribe the data for you. They're cheap :-)


So... this gets me close even on a fairly complex table.

Download a sample pdf from bmi pdf

library(tm)

pdf <- readPDF(PdftotextOptions = "-layout")

dat <- pdf(elem = list(uri='bmi_tbl.pdf'), language='en', id='id1')

dat <- gsub(' +', ',', dat)
out <- read.csv(textConnection(dat), header=FALSE)

The current package du jour for getting text out of PDFs is pdftools (successor to Rpoppler, noted above), works great on Linux, Windows and OSX:

install.packages("pdftools")
library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])

# second page text
cat(txt[2])

You can also (now) use the new (2015-07) Rpoppler pacakge:

Rpoppler::PDF_text(file)

It includes 3 functions (4, really, but one just gets you a ptr to the PDF object):

  • PDF_fonts PDF font information
  • PDF_info PDF document information
  • PDF_text PDF text extraction

(posting as an answer to help new searchers find the package).


per zx8754 ... the following works in Win7 with pdftotext.exe in the working directory:

library(tm)
uri = 'bmi_tbl.pdf'
pdf = readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                                language = "en", id = "id1")   

Need Your Help

Difference between nullable, __nullable and _Nullable in Objective-C

objective-c nullable objective-c-nullability

With Xcode 6.3 there were new annotations introduced for better expressing the intention of API's in Objective-C (and to ensure better Swift support of course). Those annotations were of course non...

Visual Studio "Unable to start debugging on the web server. The web server did not respond in a timely manner."

asp.net visual-studio visual-studio-2008 debugging iis

I get the following error pretty regularly when compiling in Visual Studio and running my web application: