Convert PDF (From Excel) into JSON/CSV

TheNuzziNuzz · September 29, 2021

I need a way to convert PDFs with Excel Exported Form Data into JSON or CSV

Backstory:

I'm working on a website for a friend's company. They have thousands (about 7,000) excel file databases they've assembled through the years. They exported all of these excel files via excel's save as PDF function to create "Read Only Backups." They then "lost" the original excel files. I need to make all 7,000 datasheets accessible in an online database in TEXT/CSV format.

The PDFs do actually contain text information, i've been able to use Node.JS PDFJS to extract text information as coordinates and text...but not a spreadsheet format.

I've also been successful in using Adobes PDF convert function, however it takes 10-15 seconds per PDF and each PDF has to be converted by manually opening the file and clicking convert...far too slow.

mariushm · September 29, 2021

You may have better luck using adobe acrobat pro (or some other tool) to merge multiple pdfs in a single pdf , then export to grayscale images at a reasonable dpi, then load them in some ocr software like abby finereader.

The ocr software is smart enough to detect the tables and may give you the option to save to excel sheet. Of course you'd have to proofread but considering it's just one or a couple of fonts, once you proofread a few pages and the software learns the character shapes, it will autocorrect itself.

TheNuzziNuzz · September 29, 2021

12 minutes ago, mariushm said:

You may have better luck using adobe acrobat pro (or some other tool) to merge multiple pdfs in a single pdf , then export to grayscale images at a reasonable dpi, then load them in some ocr software like abby finereader.

The ocr software is smart enough to detect the tables and may give you the option to save to excel sheet. Of course you'd have to proofread but considering it's just one or a couple of fonts, once you proofread a few pages and the software learns the character shapes, it will autocorrect itself.

I don't want or need to use OCR. The textual data is stored in the PDF as text. I can use PDF text extractors without OCR to pull out the data...just not in bulk. For example, this old website is percect & exactly what I need...just to be able to run it locally on 7000 PDFs. I need to figure out how this website does what it does:

https://www.pdf-online.com/osa/extract.aspx

badreg · September 29, 2021

You can use Tabula to do this.

Python example:

import tabula

f = tabula.read_pdf("example.pdf", pages='all')[0]
tabula.convert_into("example.pdf", "example.csv", output_format="csv", pages='all')
print(f)

TheNuzziNuzz · September 30, 2021

7 hours ago, badreg said:

You can use Tabula to do this.

Python example:


import tabula

f = tabula.read_pdf("example.pdf", pages='all')[0]
tabula.convert_into("example.pdf", "example.csv", output_format="csv", pages='all')
print(f)

I've literally been looking for something like this for months. Literally it's been months. This is incredible. Thank you so much! It's perfect, does exactly what I need.

Sign In

Convert PDF (From Excel) into JSON/CSV

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

The Biggest Test Bench I’ve Ever Seen

Latest From ShortCircuit:

Razer Finally Got a Desk Job - Razer Pro Type Ergo

Latest From TechLinked:

This Summer’s Lookin’ Steamy

Latest From GameLinked:

This Was A GOOD One...

Latest From Tech Quickie:

The Secret Council Behind Every Emoji

Latest From The WAN Show:

Google’s Best Feature In Years - WAN Show June 5, 2026

My Activity Streams