Jump to content

Is it possible?

CookieMaster

This is just a far fetch question, with the question really asking if it could be done

 

Program something to read pdfs.

 

retain certain info from that pdf (example if it was a quote or invoice, the number and amount and name)

 

Auto draft an email with a format displaying that data, auto sending it with the attached pdf.

 

If it is possible, what would you even code it in?

 

And no I'm not looking for the code. 

Link to comment
Share on other sites

Link to post
Share on other sites

well I mean it may be possible to create a program to auto email the pdf and make the subject line the tittle of the document but that is the furthest that could be reached (I think)

 

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, CookieMaster said:

 

Depends. If the PDF supports text it might be possible, but an image PDF would just be near impossible.

M1 MacBook Air 256/8 | iPhone 13 pro

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Eclipsefang said:

well I mean it may be possible to create a program to auto email the pdf and make the subject line the tittle of the document but that is the furthest that could be reached (I think)

 

Very True, and if that pdf had relevant title that would do it.

 

1 minute ago, RGProductions said:

Depends. If the PDF supports text it might be possible, but an image PDF would just be near impossible.

Hmm if you could convert it to a word file(assuming it converted) then would it be more easy to retain the info off of that? 

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, CookieMaster said:

Very True, and if that pdf had relevant title that would do it.

 

Hmm if you could convert it to a word file(assuming it converted) then would it be more easy to retain the info off of that? 

depnds. If when viewing it you can highlight words with your cursor this may be possible, but AFAIK software that can read things on images doesn't exist yet.

M1 MacBook Air 256/8 | iPhone 13 pro

Link to comment
Share on other sites

Link to post
Share on other sites

Totally possible. You need some programming smarts though.

I don'T PreSS caPs.. I juST Hit THe keYboARd so HarD iT CriTs :P

 

Quote or @dzzope to get my attention..

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, RGProductions said:

depnds. If when viewing it you can highlight words with your cursor this may be possible, but AFAIK software that can read things on images doesn't exist yet.

OCR?

 

Also, I was doing an internship in a really big and famous company and they have developed a OCR software which is able to recognize where are some key data like name, address etc. in forms which have completely different layout without predefined form layouts. The software is able to find all necessary fields by itself. And to make things more complicated, it has support for hand written Chinese characters, kana and roman letters.

 

Aaand, it took months to develop by experienced developers + Phd guys working on the algorithms.

So, it is possible, but not trivial at all. 

Link to comment
Share on other sites

Link to post
Share on other sites

Possible. Of course. Anything is possible in programming. There are literally hundreds of programs that grab text and content from PDF's. Then you just have to sort through the data. Even if it's an image it's possible. Instagram has nipple recognition. And text is much easier than nipples

I am good at computer

Spoiler

Motherboard: Gigabyte G1 sniper 3 | CPU: Intel 3770k @5.1Ghz | RAM: 32Gb G.Skill Ripjaws X @1600Mhz | Graphics card: EVGA 980 Ti SC | HDD: Seagate barracuda 3298534883327.74B + Samsung OEM 5400rpm drive + Seatgate barracude 2TB | PSU: Cougar CMX 1200w | CPU cooler: Custom loop

Link to comment
Share on other sites

Link to post
Share on other sites

You can just use ruby with the pdf-reader gem for pdf that contain text. The gem has some good functions you can use, but is a bit the same as opening the file and iterating through all the lines.

 

For pdf with text as images you will need ocr, which for most people is not worth the effort and time when developing solo.

 

Ruby, python, php... It does not really matter which language. But it takes effort from your side to get the correct data from a pdf.

Your average Software Engineering student.

Link to comment
Share on other sites

Link to post
Share on other sites

It is possible (and done in some companies). You need OCR and to tell the program which fields in the text you want to keep. Then it's just a matter of creating the email with the fields you want and send it.

The best way to measure the quality of a piece of code is "Oh F*** "s per line

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, CookieMaster said:

This is just a far fetch question, with the question really asking if it could be done

 

Program something to read pdfs.

 

retain certain info from that pdf (example if it was a quote or invoice, the number and amount and name)

 

Auto draft an email with a format displaying that data, auto sending it with the attached pdf.

 

If it is possible, what would you even code it in?

 

And no I'm not looking for the code. 

If they are not just pdf images like from a scanner and they are all a similar format, then its quite easy, you can use pdf2text pipe to awk or sed pipe to php or python and have this done pretty quickly.

Link to comment
Share on other sites

Link to post
Share on other sites

14 hours ago, Dzzope said:

Totally possible. You need some programming smarts though.

Going with this yes you can. Doing handwriting recognition is a guess I would have. If the text is neatly laid out then it should be easy. Captcha has lines through it so programs can't read it so it is evidently possible. 

Link to comment
Share on other sites

Link to post
Share on other sites

Reading PDFs is very possible, but it can be a pain in the ass due to there being a range of different PDF encodings, and some actually requiring Acrobat to decode (otherwise they are unreadable).

 

Pulling information out of the the PDF falls into the realm of Information Retrieval and might be, far and away, the most difficult step depending on the PDFs you're dealing with.  If you're dealing with a finite, known set of possible inputs, you can just write rules that explicitly deal with each one.  E.g., if you're dealing only with invoices, and you know the invoices can only be formatted one of several ways, you can just check what the layout is and go to the appropriate processing steps.  If you're dealing with more general and unstructured data--e.g., raw text--then you'll need to delve deep into information retrieval and possible natural language processing/understanding.  And things can get messy and complicated there if you're building your own IR engine, and you'll need to do a lot of quality assurance/testing even if you're borrowing someone else's to make sure it works in your specific circumstances.

 

Drafting and sending e-mails is pretty straightforwards.  That's just text formatting.  You could have some form letters and plop the relevant information from the PDFs into the letters.  As for sending the e-mails, I don't personally know how you'd do that, but that's absolutely a thing you can do, and it's probably not all that difficult either.

Link to comment
Share on other sites

Link to post
Share on other sites

You can use PDF miner to get the text from PDF's in python. 

 

http://www.unixuser.org/~euske/python/pdfminer/index.html

                     ¸„»°'´¸„»°'´ Vorticalbox `'°«„¸`'°«„¸
`'°«„¸¸„»°'´¸„»°'´`'°«„¸Scientia Potentia est  ¸„»°'´`'°«„¸`'°«„¸¸„»°'´

Link to comment
Share on other sites

Link to post
Share on other sites

On 4/27/2017 at 8:29 PM, CookieMaster said:

This is just a far fetch question, with the question really asking if it could be done

 

Program something to read pdfs.

 

retain certain info from that pdf (example if it was a quote or invoice, the number and amount and name)

 

Auto draft an email with a format displaying that data, auto sending it with the attached pdf.

 

If it is possible, what would you even code it in?

 

And no I'm not looking for the code. 

Well, in any turing complete language it is theoretically possible to solve any problem (of this type). So I'm going to go ahead and say yes it is definitely possible. There is going to be a multitude of ways to accomplish this. I would probably use C# and the Microsoft Office Programming features to get the .PDF, convert it to a .docx file, and then read the "document" OOXML file into a DOM tree, then search the DOM tree for every "run" surrounded by whatever is signifying the quote or invoice. In fact, this is explicitly one of the prime reasons that Office Open XML was made.

Having a well formatted file will make it much easier to do this.

ENCRYPTION IS NOT A CRIME

Link to comment
Share on other sites

Link to post
Share on other sites

Sorry that I'm not replying to anyone

 

Based on feedback:

 

It's possible

Requires OCR

If using OCR would have to code

I already know how tricky an OCR program would be

 

Possible solutions:(Coming from me) 

Use adobe to convert to excel worksheet or word file or text file)

 

read off of that file based off line

Profit? 

 

Edit: or use other programs that seem to be free

 

convert to other format that might be better.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×