Open pdf as txt file and then copy - why it doesn't work

Adonis4000 · February 15, 2022

I was thinking that since I am able to change the extension at the end of a .pdf to .txt and then back to .pdf without any loss of data, in theory I should be able to copy the text from the pdf file to another text file and then change that file to .pdf.

The reason for doing this would be to easily copy files using python, which there probably is a better way of going about (I have to yet to do proper research on it), but I am genuinely curious as to why it doesn't work.

When copying manually, I can see the copy is always a bit different that the original.

When copying with python, I get the error: 'charmap' codec can't decode byte 0x9c in position 470: character maps to <undefined>

Any help to understand why this doesn't work is appreciated.

Kilrah · February 15, 2022

31 minutes ago, Adonis4000 said:

I was thinking that since I am able to change the extension at the end of a .pdf to .txt and then back to .pdf without any loss of data, in theory I should be able to copy the text from the pdf file to another text file and then change that file to .pdf.

Chaging the name of the file doesn't touch the file itself. The name is meaningless and has nothing to do with the format, it's just used for some OSes (and users) to decide what program will open it by default and it'll interpret the file contents however it needs.

The content of a pdf file is binary, not text. If you copy the contents with a text editor or python in text mode some bytes will be mangled/lost because they are not valid text characters. Need to open the file as binary.

31 minutes ago, Adonis4000 said:

copy files using python, which there probably is a better way of going about

shutil has file copy functions.

pythonmegapixel · February 15, 2022

There are two problems here.

Firstly, the file name extension (.txt, .pdf etc) is a description of what should be in the file, but changing it doesn't actually affect what is in the file. Think of it as having a box of apples with "apples" written on it. I could cross out "apples" and write "oranges" instead, but doing that doesn't change apples into oranges. In this case, when you change the file extension from PDF to TXT, the file itself is still in PDF format.

Secondly, PDF format files aren't readable as text, which is why Python is throwing errors - it is expecting to see Unicode characters but the bytes it is reading from the file are not valid Unicode characters.

There are Python libraries available that allow you to read PDF files though so I suggest you use one of them.

If all you want to do is copy the files, use the builtin shutil module

https://docs.python.org/3/library/shutil.html