Jump to content

Convert PDF (From Excel) into JSON/CSV

Go to solution Solved by badreg,

You can use Tabula to do this.

 

Python example:

import tabula

f = tabula.read_pdf("example.pdf", pages='all')[0]
tabula.convert_into("example.pdf", "example.csv", output_format="csv", pages='all')
print(f)

 

I need a way to convert PDFs with Excel Exported Form Data into JSON or CSV
 
Backstory:
I'm working on a website for a friend's company. They have thousands (about 7,000) excel file databases they've assembled through the years. They exported all of these excel files via excel's save as PDF function to create "Read Only Backups." They then "lost" the original excel files. I need to make all 7,000 datasheets accessible in an online database in TEXT/CSV format.
 
The PDFs do actually contain text information, i've been able to use Node.JS PDFJS to extract text information as coordinates and text...but not a spreadsheet format.
I've also been successful in using Adobes PDF convert function, however it takes 10-15 seconds per PDF and each PDF has to be converted by manually opening the file and clicking convert...far too slow.

Computers r fun

Link to comment
https://linustechtips.com/topic/1377086-convert-pdf-from-excel-into-jsoncsv/
Share on other sites

Link to post
Share on other sites

You may have better luck  using adobe acrobat pro (or some other tool) to merge multiple pdfs in a single pdf , then export to grayscale images at a reasonable dpi, then load them in some ocr software like abby finereader.

The ocr software is smart enough to detect the tables and may give you the option to save to excel sheet. Of course you'd have to proofread but considering it's just one or a couple of fonts, once you proofread a few pages and the software learns the character shapes, it will autocorrect itself.

 

 

Link to post
Share on other sites

12 minutes ago, mariushm said:

You may have better luck  using adobe acrobat pro (or some other tool) to merge multiple pdfs in a single pdf , then export to grayscale images at a reasonable dpi, then load them in some ocr software like abby finereader.

The ocr software is smart enough to detect the tables and may give you the option to save to excel sheet. Of course you'd have to proofread but considering it's just one or a couple of fonts, once you proofread a few pages and the software learns the character shapes, it will autocorrect itself.

 

 

I don't want or need to use OCR. The textual data is stored in the PDF as text. I can use PDF text extractors without OCR to pull out the data...just not in bulk. For example, this old website is percect & exactly what I need...just to be able to run it locally on 7000 PDFs. I need to figure out how this website does what it does:

 

https://www.pdf-online.com/osa/extract.aspx

Computers r fun

Link to post
Share on other sites

7 hours ago, badreg said:

You can use Tabula to do this.

 

Python example:


import tabula

f = tabula.read_pdf("example.pdf", pages='all')[0]
tabula.convert_into("example.pdf", "example.csv", output_format="csv", pages='all')
print(f)

 

I've literally been looking for something like this for months. Literally it's been months. This is incredible. Thank you so much! It's perfect, does exactly what I need.

Computers r fun

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×