Jump to content

I think I've found a way to look at data density and therefore its non-random compressibility

Poet129

Google CoLab

You should be able to upload any file of moderate size then rename it to 'Test.bin' or change the code whichever, then run the cells in order and look at the graph at the end.

I've pre-run the AMillionRandomDigits.bin (That was the most recent file so that will be the previously recorded results) and a few other files to verify it looks correct.

The graph can be read pretty easily the smaller the range of the data the more it should be compressible however if the file has just a few spots that are outliers it should remain mostly compressible.

I've tested the sample files included in CoLab and their zlib compressibility seemed to match the above statement.

This is not a compression algorithm in itself.

 

Not really sure what the applications are for this if any thus why I'm posting here.

Link to comment
Share on other sites

Link to post
Share on other sites

9 hours ago, Poet129 said:

Not really sure what the applications are for this if any thus why I'm posting here.

I'd say... none? What do you think this does? You're just interpreting a bunch of bytes as integers and plotting them on a graph... for that matter why use 10 bytes for an integer? Seems pretty arbitrary... why not 2 or 50 or 1000? Why use integers at all?

9 hours ago, Poet129 said:

The graph can be read pretty easily the smaller the range of the data the more it should be compressible

Do you have any citation or data to back up this claim or did you just make it up? You can't just write 3 lines of code doing something with a dataset, notice a pattern among 4 sets and then just assume this is a fundamentally true property of all datasets. If you have a theory about a mathematical property of data you need to show some sort of mathematical proof that this is the case, or at least a very significant correlation with a sufficiently large amount of input data. Compressibility in general is a pretty ill defined concept anyway.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Sauron said:

I'd say... none? What do you think this does? You're just interpreting a bunch of bytes as integers and plotting them on a graph... for that matter why use 10 bytes for an integer? Seems pretty arbitrary... why not 2 or 50 or 1000? Why use integers at all?

I've split the data merely to speed up the process it does get rid of the smaller occurrences, however I've yet to see a compressor that goes that deep into the file looking for unused space.

 

2 hours ago, Sauron said:

Do you have any citation or data to back up this claim or did you just make it up? You can't just write 3 lines of code doing something with a dataset, notice a pattern among 4 sets and then just assume this is a fundamentally true property of all datasets. If you have a theory about a mathematical property of data you need to show some sort of mathematical proof that this is the case, or at least a very significant correlation with a sufficiently large amount of input data. Compressibility in general is a pretty ill defined concept anyway.

You are correct, I don't think such a mathematical proof exists and if it does I don't know it. However, I've changed a few things to make it easier to look at and run, along with running 37 files from Canterbury Corpus plus the AMillionRandomDigits.bin. The first two points should be ignored they are simply for scale. The results along with the data. I realize this is still a conceivably small pattern and will look into adding more files later.

Link to comment
Share on other sites

Link to post
Share on other sites

Can you explain what this code measures and how that would relate to compressibility?

17 hours ago, Poet129 said:

the smaller the range of the data the more it should be compressible

Isn't this sort of known? At first glance it sounds similar to why e.g. random noise is (nigh) incompressible.

Crystal: CPU: i7 7700K | Motherboard: Asus ROG Strix Z270F | RAM: GSkill 16 GB@3200MHz | GPU: Nvidia GTX 1080 Ti FE | Case: Corsair Crystal 570X (black) | PSU: EVGA Supernova G2 1000W | Monitor: Asus VG248QE 24"

Laptop: Dell XPS 13 9370 | CPU: i5 10510U | RAM: 16 GB

Server: CPU: i5 4690k | RAM: 16 GB | Case: Corsair Graphite 760T White | Storage: 19 TB

Link to comment
Share on other sites

Link to post
Share on other sites

Not sure what you are trying to get at with this topic?

 

You strictly can't see "un-compressability" of data (humans are really bad at recognizing decently complex visual randomness. 

3735928559 - Beware of the dead beef

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×