Jump to content

Open pdf as txt file and then copy - why it doesn't work

Adonis4000
Go to solution Solved by Kilrah,
31 minutes ago, Adonis4000 said:

I was thinking that since I am able to change the extension at the end of a .pdf to .txt and then back to .pdf without any loss of data, in theory I should be able to copy the text from the pdf file to another text file and then change that file to .pdf.

Chaging the name of the file doesn't touch the file itself. The name is meaningless and has nothing to do with the format, it's just used for some OSes (and users) to decide what program will open it by default and it'll interpret the file contents however it needs.

 

The content of a pdf file is binary, not text. If you copy the contents with a text editor or python in text mode some bytes will be mangled/lost because they are not valid text characters. Need to open the file as binary.

 

31 minutes ago, Adonis4000 said:

copy files using python, which there probably is a better way of going about

shutil has file copy functions.

I was thinking that since I am able to change the extension at the end of a .pdf to .txt and then back to .pdf without any loss of data, in theory I should be able to copy the text from the pdf file to another text file and then change that file to .pdf.

 

The reason for doing this would be to easily copy files using python, which there probably is a better way of going about (I have to yet to do proper research on it), but I am genuinely curious as to why it doesn't work.

When copying manually, I can see the copy is always a bit different that the original.

When copying with python, I get the error: 'charmap' codec can't decode byte 0x9c in position 470: character maps to <undefined>

 

Any help to understand why this doesn't work is appreciated.

 

Link to comment
Share on other sites

Link to post
Share on other sites

31 minutes ago, Adonis4000 said:

I was thinking that since I am able to change the extension at the end of a .pdf to .txt and then back to .pdf without any loss of data, in theory I should be able to copy the text from the pdf file to another text file and then change that file to .pdf.

Chaging the name of the file doesn't touch the file itself. The name is meaningless and has nothing to do with the format, it's just used for some OSes (and users) to decide what program will open it by default and it'll interpret the file contents however it needs.

 

The content of a pdf file is binary, not text. If you copy the contents with a text editor or python in text mode some bytes will be mangled/lost because they are not valid text characters. Need to open the file as binary.

 

31 minutes ago, Adonis4000 said:

copy files using python, which there probably is a better way of going about

shutil has file copy functions.

F@H
Desktop: i9-13900K, ASUS Z790-E, 64GB DDR5-6000 CL36, RTX3080, 2TB MP600 Pro XT, 2TB SX8200Pro, 2x16TB Ironwolf RAID0, Corsair HX1200, Antec Vortex 360 AIO, Thermaltake Versa H25 TG, Samsung 4K curved 49" TV, 23" secondary, Mountain Everest Max

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB SX8200Pro RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

 

Asus Zenbook UM325UA, Ryzen 7 5700u, 16GB, 1TB, OLED

 

GPD Win 2

Link to comment
Share on other sites

Link to post
Share on other sites

There are two problems here.

 

Firstly, the file name extension (.txt, .pdf etc) is a description of what should be in the file, but changing it doesn't actually affect what is in the file. Think of it as having a box of apples with "apples" written on it. I could cross out "apples" and write "oranges" instead, but doing that doesn't change apples into oranges. In this case, when you change the file extension from PDF to TXT, the file itself is still in PDF format.

 

Secondly, PDF format files aren't readable as text, which is why Python is throwing errors - it is expecting to see Unicode characters but the bytes it is reading from the file are not valid Unicode characters.

 

There are Python libraries available that allow you to read PDF files though so I suggest you use one of them.

 

If all you want to do is copy the files, use the builtin shutil module

 https://docs.python.org/3/library/shutil.html

____________________________________________________________________________________________________________________________________

 

 

____________________________________________________________________________________________________________________________________

pythonmegapixel

into tech, public transport and architecture // amateur programmer // youtuber // beginner photographer

Thanks for reading all this by the way!

By the way, my desktop is a docked laptop. Get over it, No seriously, I have an exterrnal monitor, keyboard, mouse, headset, ethernet and cooling fans all connected. Using it feels no different to a desktop, it works for several hours if the power goes out, and disconnecting just a few cables gives me something I can take on the go. There's enough power for all games I play and it even copes with basic (and some not-so-basic) video editing. Give it a go - you might just love it.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×