In about six weeks the Record will release a new open-source program to help journalists turn PDF files into structured data. The new software will enable reporters to take an image containing data — say a scanned campaign finance return — and turn that into a spreadsheet.
This may sound boring, but it’s a problem that we at the Record have been trying to overcome for more than two years. The story started with Wake County campaign finance returns. The returns are filed as paper, and staff at the Wake County Board of Elections scan them in and put the images online. The problem is, the only way to view the data is to look at it page by page, and the only way to analyze it is to go through by hand and enter the data into a spreadsheet one row at a time.
We’re a small news organization; we don’t have the staff to do data entry for hundreds of pages of campaign finance information. We also don’t have the budget to hire some unfortunate college students to do it for us.
Edward Duncan, my brother and a full-time programmer, and I have been thinking about how to tackle this problem since 2010. We had been kicking ideas back and forth until Edward stumbled across this solution last summer.
The new program aims to pull the data from the documents and put it into a spreadsheet.
It’s called DocHive, and here’s how it works: the program uses XML, a computer programming language used mainly for websites, to break a page up into smaller sections.
For example, in the campaign finance documents, it will make separate sections for donor name, occupation, donation amount and all the other fields. Then, it will take each of those sections and turn it into a separate image file. The software takes that small image and uses optical character recognition technology, known by the acronym OCR, to read the couple words or numbers and insert it into a text file.
[media-credit name=”Charles C. Duncan Pardo” align=”aligncenter” width=”600″][/media-credit]
This method works with county-level campaign finance returns in North Carolina, but it can also work with almost any other standardized document formats. OCR is a great technology for being able to scan something in and read it, but it’s hard to turn an entire page into something you can cut and paste from accurately. It’s just too much for most OCR programs to handle, and you still have the problem of turning that into a spreadsheet.
The new program works so well because it is able to break the page down into its component parts and use OCR with that much smaller image. Each page could be broken down into as many as two hundred smaller images to be processed into a spreadsheet.
We are currently working on finishing up the core functions of the program and creating a user interface so anybody can create a template. Each type of document needs its own template, and right now that’s done by hard coding the XML.
The Record will release the beta version of DocHive at the NICAR conference Feb. 28 in Louisville, Ky. Development has been made possible by a grant from Raleigh’s own Beehive Collective (hence the DocHive name) and the kind folks at Reporters’ Lab, housed in the Dewitt Wallace Center for Media and Democracy at Duke University.
Let us know if you’ve got any tricky document sets we can use to test DocHive or want to help test or prepare the new program for release.