Simplest File Compression Algorithm

ClobberXD · June 8, 2016

I'm planning on making a file compression utility. Can someone give me the easiest compression algorithm on Earth, and explain to me how it works?

Thanks!

Kryptyx · June 8, 2016

Check out http://www.piedpiper.com /s

Nineshadow · June 8, 2016

Huffman trees are a very simple and common method of lossless compression.

ClobberXD · June 8, 2016

6 hours ago, Nineshadow said:

Huffman trees are a very simple and common method of lossless compression.

Can you teach me how to do that, in C++?

BTW, you open a file with the <ios::binary> flag for data compression right?

Thanks!

Nineshadow · June 9, 2016

11 hours ago, Anand_Geforce said:

Can you teach me how to do that, in C++?

BTW, you open a file with the <ios::binary> flag for data compression right?

Thanks!

Read more here

: http://www.geeksforgeeks.org/greedy-algorithms-set-3-huffman-coding/

mariushm · June 9, 2016

The absolute easiest compression algorithm (but very inefficient for most file types) would be RLE (run length encoding) .. basically you count the number of characters that repeat in a sequence and instead of writing the characters, you just write the number of times the character repeats and the character. For example in a very simplified way Tessst becomes Te3st

This type of compression was used a lot in bitmap pictures, especially 2 bit (black and white) drawings, because such pictures have long strings of white or black pixels. It's still used in some cases for game textures.

Next, you'd have the LZ family of algorithms : https://en.wikipedia.org/wiki/LZ77_and_LZ78 Basically, the compressor looks for sequences of bytes that repeat in a file and build a dictionary with these sequences and store in the final file just the entries in the dictionary and one copy of each sequence. These algorithms are the base of ZIP and modern compressors.

Read this article, it gives a lot of details and explains some of the first compression algorithms : http://ethw.org/History_of_Lossless_Data_Compression_Algorithms

If you don't want to write the code yourself (for practice or learning) I'd suggest just going with a ready made implementation like the ZLIB library whose output can be written directly to disk as a file with the .gz extension (that's recognized by lots of decompression tools) or optionally can use some built in functions to wrap the output in a ZIP container. The compression method called DEFLATE is also used by web servers to compress html content on the fly when sending it to viewers.

It's very small library, can be incorporated completely in applications as it's open source code and compressed very fast and decent sizes:

If you want really small files after compression you can have a look at 7zip, which offers a library that can be used by programs to compress or decompress files with: http://www.7-zip.org/sdk.html

You can choose to produce larger compressed files using little ram and cpu or you can trade processor usage and a lot of ram to produce the smallest compressed files.

ClobberXD · June 9, 2016

12 hours ago, Nineshadow said:

Read more here

: http://www.geeksforgeeks.org/greedy-algorithms-set-3-huffman-coding/

It looks pretty complex... but I'll try it out sometime!

Thanks!

ClobberXD · June 9, 2016

10 hours ago, mariushm said:

The absolute easiest compression algorithm (but very inefficient for most file types) would be RLE (run length encoding) .. basically you count the number of characters that repeat in a sequence and instead of writing the characters, you just write the number of times the character repeats and the character. For example in a very simplified way Tessst becomes Te3st

This type of compression was used a lot in bitmap pictures, especially 2 bit (black and white) drawings, because such pictures have long strings of white or black pixels. It's still used in some cases for game textures.

Next, you'd have the LZ family of algorithms : https://en.wikipedia.org/wiki/LZ77_and_LZ78 Basically, the compressor looks for sequences of bytes that repeat in a file and build a dictionary with these sequences and store in the final file just the entries in the dictionary and one copy of each sequence. These algorithms are the base of ZIP and modern compressors.

Read this article, it gives a lot of details and explains some of the first compression algorithms : http://ethw.org/History_of_Lossless_Data_Compression_Algorithms

If you don't want to write the code yourself (for practice or learning) I'd suggest just going with a ready made implementation like the ZLIB library whose output can be written directly to disk as a file with the .gz extension (that's recognized by lots of decompression tools) or optionally can use some built in functions to wrap the output in a ZIP container. The compression method called DEFLATE is also used by web servers to compress html content on the fly when sending it to viewers.

It's very small library, can be incorporated completely in applications as it's open source code and compressed very fast and decent sizes:

If you want really small files after compression you can have a look at 7zip, which offers a library that can be used by programs to compress or decompress files with: http://www.7-zip.org/sdk.html

You can choose to produce larger compressed files using little ram and cpu or you can trade processor usage and a lot of ram to produce the smallest compressed files.

I've incidentally created a very similar compression algorithm to RLE. I want to code by myself - so no third-party libs. But I'm kinda stuck... how do I output binary numbers to the output file (opened with mode (ios::out | ios::binary))? I tried ofstream.write(); but it seems to take only *char as parameter, and not binary numbers. And how do you even specify whether you are trying to decimal or binary digits? Your answers will be very helpful.

Thanks!

mariushm · June 10, 2016

I'm not familiar with the stuff you use, ios:out and ios::binary and so on.. not much of a c++ programmer. I tend to code stuff in php or plain C mostly on microcontrollers. You should check the actual documentation, because there you would find functions for writing bytes to such files, not just opening a file in write mode. There should be a function in the fstream class (if that's what you use).

The char data type is usually ONE byte per character, a number that can have values between 0 and 255. Some programming languages use UTF-8 or Unicode to store characters in a string therefore they use TWO bytes or more bytes to store one character but in those cases, the data type is generally something else, char is generally considered 1 byte.

Numbers are stored in data types such as byte, word, integer, long, double, float... the exact memory size these use depends on the programming language and the computer architecture ... byte takes 1 byte of memory, word takes 2 bytes (word is less used these days) , integer takes 4 bytes of memory and so on. Long data type is somewhat of a special case, since it's exactly 32 bit long on 32 bit operating systems ( 4 bytes), four bytes are often used to store memory pointers and stuff specific to 32 bit operating systems - on 64 bit operating systems, as memory pointers are now 64 bit, some programming languages treat the long data type as being 8 bytes long.

For C# here's an example of what data types are available: https://msdn.microsoft.com/en-us/library/cs7y5x0x(v=vs.90).aspx You can see that integer is 32 bit long (4 bytes) and long is 8 bytes long and char is unicode aware, so you really should use the byte data type when creating an array instead of the char data type.. IF you're writing your application in C#

If you have a number (let's say of data type integer, taking up to 4 bytes in memory), the programming language you use SHOULD have some built in functions that would allow you to easily take the 4 bytes and copy them into an array of 4 byte values.

In C#, there's a built in class which has lots of functions that do that.. and you can simply say something like this byte[] the_array = BitConverter.GetBytes(the_number); and you get an array with 4 elements in it, the four bytes that were used to store the number in memory.

Without this class, you could have used memory copying functions

Even without built in functions, you can do it yourself with something like this, i'm writing this straight into the post so it may not be quite valid code but you should understand the idea :

unsigned int the_number;

unsigned byte temp_number;

byte[] the_array;

temp_number = the_number / 0x1000000; // 0x1000000 = 256x256x256 in hexadecimal

the_array[0] = temp_number; the_number = the_number - temp_number * 0x1000000;

temp_number = the_number / 0x10000; // 0x10000 = 256x256 in hexadecimal

the_array[1] = temp_number; the_number = the_number - temp_number * 0x10000;

temp_number = the_number / 0x100; // 0x100 = 256 in hexadecimal

the_array[2] = temp_number; the_number = the_number - temp_number * 0x100;

the_array[3] = the_number; // what's left after dividing by 256^3, 256^2 and 256 is the remainder, a number between 0 and 255

So for example, let's say we have the number 305,419,896 in a 4 byte variable.. which I chose intentionally for this example, because it happens to be 0x12345678 in hexadecimal...so we have :

305,419,896 / 256^3 = 18 ( 0x12 in hexadecimal) and number becomes 305,419,896 - (18x256^3) = 305,419,896 - 301,989,888 = 3,430,008

3,430,008 / 256^3 = 52 ( 0x34 in hexadecimal) and number becomes 3,430,008 - 52 x 256^2 = 22,136

22,136 / 256 = 86 (0x56 in hexadecimal) and number becomes 22,136 - (86x256) = 120 which is 0x78 in hexadecimal

so we have our four bytes without resorting to any built in functions.

Sign In

Simplest File Compression Algorithm

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Featured Topics

Topics

Latest From Linus Tech Tips:

I KNEW I’d Hate this (I Was Wrong)

Latest From Tech Quickie:

Why Do Speakers Hiss?

Latest From TechLinked:

Yep, it’s an App

Latest From GameLinked:

Bethesda Knows It’s Broken

Latest From ShortCircuit:

14 inches of GREATNESS - HP OMEN Transcend 14

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!