Jump to content

Simplest File Compression Algorithm

ClobberXD

I'm planning on making a file compression utility. Can someone give me the easiest compression algorithm on Earth, and explain to me how it works?

 

Thanks!

Nothing to see here ;)

Link to comment
Share on other sites

Link to post
Share on other sites

Check out http://www.piedpiper.com /s

CPU: AMD 5950X    MB: Asus ROG Crosshair VIII Dark Hero    RAM: HyperX Predator 64GB    GPU: Nvidia RTX 3090 Ti FE    SSD: Seagate FireCuda 530 2TB    
PSU: EVGA 1200w P2    COOLING: EK AIO Elite 360    CASE: Fractal Design Torrent 
   DISPLAY: LG CX48 4k OLED    AUDIO: HIFIMAN Arya SE

Link to comment
Share on other sites

Link to post
Share on other sites

Huffman trees are a very simple and common method of lossless compression.

i5 4670k @ 4.2GHz (Coolermaster Hyper 212 Evo); ASrock Z87 EXTREME4; 8GB Kingston HyperX Beast DDR3 RAM @ 2133MHz; Asus DirectCU GTX 560; Super Flower Golden King 550 Platinum PSU;1TB Seagate Barracuda;Corsair 200r case. 

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, Nineshadow said:

Huffman trees are a very simple and common method of lossless compression.

Can you teach me how to do that, in C++?

 

BTW, you open a file with the <ios::binary> flag for data compression right?

 

Thanks!

Nothing to see here ;)

Link to comment
Share on other sites

Link to post
Share on other sites

11 hours ago, Anand_Geforce said:

Can you teach me how to do that, in C++?

 

BTW, you open a file with the <ios::binary> flag for data compression right?

 

Thanks!

Read more here

http://www.geeksforgeeks.org/greedy-algorithms-set-3-huffman-coding/

i5 4670k @ 4.2GHz (Coolermaster Hyper 212 Evo); ASrock Z87 EXTREME4; 8GB Kingston HyperX Beast DDR3 RAM @ 2133MHz; Asus DirectCU GTX 560; Super Flower Golden King 550 Platinum PSU;1TB Seagate Barracuda;Corsair 200r case. 

Link to comment
Share on other sites

Link to post
Share on other sites

The absolute easiest compression algorithm (but very inefficient for most file types) would be RLE (run length encoding) .. basically you count the number of characters that repeat in a sequence and instead of writing the characters, you just write the number of times the character repeats and the character.  For example in a very simplified way  Tessst  becomes Te3st

This type of compression was used a lot in bitmap pictures, especially 2 bit (black and white) drawings, because such pictures have long strings of white or black pixels. It's still used in some cases for game textures.

 

Next, you'd have the LZ family of algorithms : https://en.wikipedia.org/wiki/LZ77_and_LZ78 Basically, the compressor looks for sequences of bytes that repeat in a file and build a dictionary with these sequences and store in the final file just the entries in the dictionary and one copy of each sequence. These algorithms are the base of ZIP and modern compressors.

 

Read this article, it gives a lot of details and explains some of the first compression algorithms : http://ethw.org/History_of_Lossless_Data_Compression_Algorithms

 

 

If you don't want to write the code yourself (for practice or learning) I'd suggest just going with a ready made implementation like the ZLIB library whose output can be written directly to disk as a file with the .gz extension (that's recognized by lots of decompression tools) or optionally can use some built in functions to wrap the output in a ZIP container. The compression method called DEFLATE is also used by web servers to compress html content on the fly when sending it to viewers.

It's very small library, can be incorporated completely in applications as it's open source code and compressed very fast and decent sizes:

 

If you want really small files after compression you can have a look at 7zip, which offers a library that can be used by programs to compress or decompress files with: http://www.7-zip.org/sdk.html

You can choose to produce larger compressed files using little ram and cpu or you can trade processor usage and a lot of ram to produce the smallest compressed files.

Link to comment
Share on other sites

Link to post
Share on other sites

10 hours ago, mariushm said:

The absolute easiest compression algorithm (but very inefficient for most file types) would be RLE (run length encoding) .. basically you count the number of characters that repeat in a sequence and instead of writing the characters, you just write the number of times the character repeats and the character.  For example in a very simplified way  Tessst  becomes Te3st

This type of compression was used a lot in bitmap pictures, especially 2 bit (black and white) drawings, because such pictures have long strings of white or black pixels. It's still used in some cases for game textures.

 

Next, you'd have the LZ family of algorithms : https://en.wikipedia.org/wiki/LZ77_and_LZ78 Basically, the compressor looks for sequences of bytes that repeat in a file and build a dictionary with these sequences and store in the final file just the entries in the dictionary and one copy of each sequence. These algorithms are the base of ZIP and modern compressors.

 

Read this article, it gives a lot of details and explains some of the first compression algorithms : http://ethw.org/History_of_Lossless_Data_Compression_Algorithms

 

 

If you don't want to write the code yourself (for practice or learning) I'd suggest just going with a ready made implementation like the ZLIB library whose output can be written directly to disk as a file with the .gz extension (that's recognized by lots of decompression tools) or optionally can use some built in functions to wrap the output in a ZIP container. The compression method called DEFLATE is also used by web servers to compress html content on the fly when sending it to viewers.

It's very small library, can be incorporated completely in applications as it's open source code and compressed very fast and decent sizes:

 

If you want really small files after compression you can have a look at 7zip, which offers a library that can be used by programs to compress or decompress files with: http://www.7-zip.org/sdk.html

You can choose to produce larger compressed files using little ram and cpu or you can trade processor usage and a lot of ram to produce the smallest compressed files.

I've incidentally created a very similar compression algorithm to RLE. I want to code by myself - so no third-party libs. But I'm kinda stuck... how do I output binary numbers to the output file (opened with mode (ios::out | ios::binary))? I tried ofstream.write(); but it seems to take only *char as parameter, and not binary numbers. And how do you even specify whether you are trying to decimal or binary digits? Your answers will be very helpful.

 

Thanks!

Nothing to see here ;)

Link to comment
Share on other sites

Link to post
Share on other sites

I'm not familiar with the stuff you use, ios:out and ios::binary and so on.. not much of a c++ programmer. I tend to code stuff in php or plain C mostly on microcontrollers. You should check the actual documentation, because there you would find functions for writing bytes to such files, not just opening a file in write mode. There should be a function in the fstream class (if that's what you use).  

 

The char data type is usually ONE byte per character, a number that can have values between 0 and 255.  Some programming languages use UTF-8 or Unicode to store characters in a string therefore they use TWO bytes or more bytes to store one character but in those cases, the data type is generally something else, char is generally considered 1 byte.

Numbers are stored in data types such as byte, word, integer, long, double, float... the exact memory size these use depends on the programming language and the computer architecture ... byte takes 1 byte of memory, word takes 2 bytes (word is less used these days) , integer takes 4 bytes of memory and so on. Long data type is somewhat of a special case, since it's exactly 32 bit long on 32 bit operating systems ( 4 bytes), four bytes are often used to store memory pointers and stuff specific to 32 bit operating systems - on 64 bit operating systems, as memory pointers are now 64 bit, some programming languages treat the long data type as being 8 bytes long.

 

For C# here's an example of what data types are available: https://msdn.microsoft.com/en-us/library/cs7y5x0x(v=vs.90).aspx You can see that integer is 32 bit long (4 bytes) and long is 8 bytes long and char is unicode aware, so you really should use the byte data type when creating an array instead of the char data type.. IF you're writing your application in C#

 

If you have a number (let's say of data type integer, taking up to 4 bytes in memory), the programming language you use SHOULD have some built in functions that would allow you to easily take the 4 bytes and copy them into an array of 4 byte values.

In C#, there's a built in class which has lots of functions that do that.. and you can simply say something like this  byte[] the_array = BitConverter.GetBytes(the_number);  and you get an array with 4 elements in it, the four bytes that were used to store the number in memory.

Without this class, you could have used memory copying functions

 

Even without built in functions, you can do it yourself with something like this, i'm writing this straight into the post so it may not be quite valid code but you should understand the idea :

 

unsigned int the_number;

unsigned byte temp_number;

byte[] the_array;

temp_number = the_number / 0x1000000;  //  0x1000000 = 256x256x256 in hexadecimal

the_array[0] = temp_number; the_number = the_number - temp_number * 0x1000000;

temp_number = the_number / 0x10000;  //  0x10000 = 256x256 in hexadecimal

the_array[1] = temp_number; the_number = the_number - temp_number * 0x10000;

temp_number = the_number / 0x100;  //  0x100 = 256 in hexadecimal

the_array[2] = temp_number; the_number = the_number - temp_number * 0x100;

the_array[3] = the_number; // what's left after dividing by 256^3, 256^2 and 256 is the remainder, a number between 0 and 255

 

So for example, let's say we have the number  305,419,896  in a 4 byte variable.. which I chose intentionally for this example, because it happens to be 0x12345678 in hexadecimal...so we have :

305,419,896 / 256^3 = 18  ( 0x12 in hexadecimal) and number becomes 305,419,896 - (18x256^3)  = 305,419,896  - 301,989,888 = 3,430,008

3,430,008 / 256^3 =  52 ( 0x34 in hexadecimal) and number becomes   3,430,008  - 52 x 256^2 = 22,136

22,136 / 256 = 86 (0x56 in hexadecimal) and number becomes 22,136 - (86x256) = 120 which is 0x78 in hexadecimal

so we have our four bytes without resorting to any built in functions.

 

 

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×