I think I've found a way to look at data density and therefore its non-random compressibility

Poet129 · March 20, 2022

You should be able to upload any file of moderate size then rename it to 'Test.bin' or change the code whichever, then run the cells in order and look at the graph at the end.

I've pre-run the AMillionRandomDigits.bin (That was the most recent file so that will be the previously recorded results) and a few other files to verify it looks correct.

The graph can be read pretty easily the smaller the range of the data the more it should be compressible however if the file has just a few spots that are outliers it should remain mostly compressible.

I've tested the sample files included in CoLab and their zlib compressibility seemed to match the above statement.

This is not a compression algorithm in itself.

Not really sure what the applications are for this if any thus why I'm posting here.

Sauron · March 20, 2022

9 hours ago, Poet129 said:

Not really sure what the applications are for this if any thus why I'm posting here.

I'd say... none? What do you think this does? You're just interpreting a bunch of bytes as integers and plotting them on a graph... for that matter why use 10 bytes for an integer? Seems pretty arbitrary... why not 2 or 50 or 1000? Why use integers at all?

9 hours ago, Poet129 said:

The graph can be read pretty easily the smaller the range of the data the more it should be compressible

Do you have any citation or data to back up this claim or did you just make it up? You can't just write 3 lines of code doing something with a dataset, notice a pattern among 4 sets and then just assume this is a fundamentally true property of all datasets. If you have a theory about a mathematical property of data you need to show some sort of mathematical proof that this is the case, or at least a very significant correlation with a sufficiently large amount of input data. Compressibility in general is a pretty ill defined concept anyway.

Poet129 · March 20, 2022

2 hours ago, Sauron said:

I'd say... none? What do you think this does? You're just interpreting a bunch of bytes as integers and plotting them on a graph... for that matter why use 10 bytes for an integer? Seems pretty arbitrary... why not 2 or 50 or 1000? Why use integers at all?

I've split the data merely to speed up the process it does get rid of the smaller occurrences, however I've yet to see a compressor that goes that deep into the file looking for unused space.

2 hours ago, Sauron said:

Do you have any citation or data to back up this claim or did you just make it up? You can't just write 3 lines of code doing something with a dataset, notice a pattern among 4 sets and then just assume this is a fundamentally true property of all datasets. If you have a theory about a mathematical property of data you need to show some sort of mathematical proof that this is the case, or at least a very significant correlation with a sufficiently large amount of input data. Compressibility in general is a pretty ill defined concept anyway.

You are correct, I don't think such a mathematical proof exists and if it does I don't know it. However, I've changed a few things to make it easier to look at and run, along with running 37 files from Canterbury Corpus plus the AMillionRandomDigits.bin. The first two points should be ignored they are simply for scale. The results along with the data. I realize this is still a conceivably small pattern and will look into adding more files later.

tikker · March 20, 2022

Can you explain what this code measures and how that would relate to compressibility?

17 hours ago, Poet129 said:

the smaller the range of the data the more it should be compressible

Isn't this sort of known? At first glance it sounds similar to why e.g. random noise is (nigh) incompressible.

wanderingfool2 · March 21, 2022

Not sure what you are trying to get at with this topic?

You strictly can't see "un-compressability" of data (humans are really bad at recognizing decently complex visual randomness.

Sign In

I think I've found a way to look at data density and therefore its non-random compressibility

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Featured Topics

Topics

Latest From Linus Tech Tips:

I Bought a Recording Jammer. It’s Legal.

Latest From Tech Quickie:

Why Are Gaming Laptops So Expensive?

Latest From TechLinked:

Not MORE Youtube Ads…

Latest From GameLinked:

Is Nintendo being FRAMED?

Latest From ShortCircuit:

I tried 20 influencer foods, here are the best… and the worst…

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!

My Activity Streams