The Knight Foundation announced today that it will give Raleigh Public Record a $50,000 grant to support a six-month push to develop docHive, a new program to turn PDF image files with structured data into a spreadsheet-friendly format.
We released a command-line version of the program earlier today, but it required some programming knowledge and took some work to get it converting properly. This new funding will allow us to continue developing a visual version so anyone can use the software.
Here’s essentially how it works: take a North Carolina county-level campaign finance form or anything else that contains structured data on a piece of paper or scanned image. Think about a phone bill as an everyday example—you have the number you called and how long you talked, maybe other pieces of data on a piece of paper. DocHive allows users to draw squares around each piece of data and then pulls those out as separate images that a computer can then read and turn into numbers or letters. Those pieces of data go into a CSV file, essentially a spreadsheet, and that file can then go into Excel or Access or any other spreadsheet or database program.
This new funding will give us six months for a “code sprint” to get docHive ready for a stable beta release. My brother, Edward Duncan, is the lead software engineer on the project and has essentially worked as a volunteer on this project for the past year or so. We will be able to pay him something closer to a reasonable wage and bring on two other contract programmers to help with some of the big pieces.
We have from the beginning been committed to keeping docHive open source. You can see the code and what we’re working with over at GitHub.
I want to give a big shout out to Raleigh’s own Beehive Collective, a giving circle here in town. They really got this thing rolling with a $1,500 grant last fall. We received another $5,500 from Reporter’s Lab at the Dewitt Wallace Center for Media and Democracy at Duke late last year to show proof of concept.
The ultimate dream is to figure out how to use this software to support our day-to-day news operations at the Record. Plenty of companies, think of Raleigh’s own RedHat, make money off of open source, so maybe we can too, and move towards figuring out a truly sustainable business model for public service journalism in Raleigh.