Jump to content

Python 3, I have made a program that uses a lot of ram, in short I want to reduce how much it uses.

Poet129

I have a program that I made however it mostly runs one other program however it is a rather large program which uses a lot of ram. ~4.5 GB. I'm currently running an instance per cpu thread. This works well for my setup. However I've been working on making it gpu accelerated. Which would require running an instance per gpu thread, obviously that is LOT more threads. I've already gotten everything working, but I have to limit the cuda cores to ~5-7 or it uses to much ram and crashes my pc slowly. My proposed solution would be RAM Data deduplication, I'm not worried about security in this case... However I have no clue how to do this, and if that would be the best way. BTW the sub program's main issue is that it requires a really big DLL, however it doesn't ever change so I was hoping that there would be a way to only have to store it once in ram, even with the program running multiple times simultaneously. Any ideas?

Link to comment
Share on other sites

Link to post
Share on other sites

Can't really help without knowing more about the program

SPEC LIST:

  • CPU: AMD Ryzen 9 5950X w/ NZXT Kraken Z73 360mm Liquid Cooler
  • GPU: NVIDIA RTX 3090 FE
  • RAM: Corsair Vengeance LPX 32GB (4 x 8GB) 5000MHz CL18
  • Motherboard: MSI MEG X570 Godlike
  • SSD: Samsung 980 Pro PCIe 4.0 1TB (x3)
  • PSU: Corsair AX1600i
  • Case: NZXT H710
  • Monitor: Alienware AW2521H 25inch 360Hz 1ms
Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, cm992 said:

Can't really help without knowing more about the program

It uses pytorch and generates a set number of random numbers along with using a seed. This using cuda acceleration as of now rather than the cpu.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Poet129 said:

It uses pytorch and generates a set number of random numbers along with using a seed. This is using cuda as of now, rather than the cpu.

Corrected grammer.

Link to comment
Share on other sites

Link to post
Share on other sites

If you load lots of data as a whole then all has to be stored in RAM. This can be optimized by processing chunks of data at a time - either by streaming it or loading in batches. Aside of that it's hard to tell without the code and so on.

Link to comment
Share on other sites

Link to post
Share on other sites

I'm curious, just HOW big is the DLL?

 

As Sakuriru said, in order to reduce memory duplication (ie have threads share resources), you need your threads under the same process.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, shadow_ray said:

Hi? I don't get the mention lol

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

Yeah, I have a very hard time following the code.

Don't hardcode "C:\program files" in your code, as that requires administrative rights for scripts to run from there, and not everyone has or is willing to give administrative rights.

Opening lots of processes and creating temporary files for a few bytes is stupid.... figure out something better... worst case scenario , you can make your main program  a server (listen on a tcp/ip port) and launch multiple "clients" that request data, do it, then send it back as a reply to your main application. (your server would only need 3-4 commands, compress this, uncompress this, wait n seconds, die..

 

I don't get what you're doing with picking seed based on number of cores or what the hell...

 

Also you make a silly assumption, that the random generator is platform neutral/independent - do you know for sure the randomness is the same on intel and on amd processors, that it's the same on an old pentium mmx and a modern ryzen, or an ARM cpu?

 

Maybe explain the algorithm for people like me who don't understand it.

Let's say I want to replicate what you do in PHP or in Free Pascal or Javascript ... what are the steps.

 

As an example, here's all the steps you need to decompress PALMDOC which was the compression used in old versions of MOBI files (ebooks)- knowing the below, you can easily figure out how to compress something so that the decompressor would work and recover the original content :

Quote

Read a byte from the compressed stream. If the byte is:

  • 0x00: "1 literal" copy that byte unmodified to the decompressed stream.
  • 0x09 to 0x7f: "1 literal" copy that byte unmodified to the decompressed stream.
  • 0x01 to 0x08: "literals": the byte is interpreted as a count from 1 to 8, and that many literals are copied unmodified from the compressed stream to the decompressed stream.
  • 0x80 to 0xbf: "length, distance" pair: the 2 leftmost bits of this byte ('10') are discarded, and the following 6 bits are combined with the 8 bits of the next byte to make a 14 bit "distance, length" item. Those 14 bits are broken into 11 bits of distance backwards from the current location in the uncompressed text, and 3 bits of length to copy from that point (copying n+3 bytes, 3 to 10 bytes).
  • 0xc0 to 0xff: "byte pair": this byte is decoded into 2 characters: a space character, and a letter formed from this byte XORed with 0x80.

Repeat from the beginning until there is no more bytes in the compressed file.

PalmDOC data is always divided into 4096 byte blocks and the blocks are acted upon independently.

 

Write something similar to the above for your compression scheme.

 

You're messing around with random and seeds for random, all based on some assumptions that aren't necessarily true.

Link to comment
Share on other sites

Link to post
Share on other sites

  

1 hour ago, mariushm said:

Yeah, I have a very hard time following the code.

 

Don't hardcode "C:\program files" in your code, as that requires administrative rights for scripts to run from there, and not everyone has or is willing to give administrative rights.

 

Opening lots of processes and creating temporary files for a few bytes is stupid.... figure out something better... worst case scenario , you can make your main program  a server (listen on a tcp/ip port) and launch multiple "clients" that request data, do it, then send it back as a reply to your main application. (your server would only need 3-4 commands, compress this, uncompress this, wait n seconds, die..

 

Also you make a silly assumption, that the random generator is platform neutral/independent - do you know for sure the randomness is the same on intel and on amd processors, that it's the same on an old pentium mmx and a modern ryzen, or an ARM cpu? You're messing around with random and seeds for random, all based on some assumptions that aren't necessarily true.

 

I don't get what you're doing with picking seed based on number of cores or what the hell... Maybe explain the algorithm for people like me who don't understand it.

Ok.

 

What would you recommend rather than "C:\Program Files" that everyone will have?

 

There will only ever be one temp file which I will explain in a moment.

 

This assumption is not entirely "empty" based I've tested with a few different computers and configurations. However you are correct in the case you want to use a GPU to compress and a CPU to decompress you are out of luck as the algorithm will give you a different number. But I've tested different CPUs they work fine, Along with different GPUs as well same result. Just not intercompatible between GPUs and CPUs.

 

It goes something like this....

User inputs file.

First, you get the Binary counted version of the file. (This is one of the first things you are taught in basic CS classes (Binary Counting))

Second, we find the string length of the number so for 1,234,567 it would be 7.

Third, we find the first seed (starting with 0 and counting up by 1) that generates that same number given you generate the number of numbers long the number is (The previous step).

Then, store the seed along with the length of the number.

 

The reason this has multiple subprocesses is to allow for multithreading otherwise it would take literally forever. The workload is automatically split up across all CPU or GPU threads. The first thread to respond (first to have a correct seed) will write the seed to that temp file from before when the host program detects this it kills all other subprograms. And then writes the seed along with the string length of the number. Along with deleting the temp file.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, gabrielcarvfer said:

I don't get this part. You want a seed that generates a number equal to the length of your number?

 

If that is the case, just precompute a ton of the generated values and their seeds in a dictionary and dump it with the pickle module.

 

After that, you can either access the dictionary by the seed and get the length generated, or create a new dictionary mapping length to a list of seeds (which can also be pickled to prevent further reprocessing).

 

Way faster and easier than reprocessing the same stuff time and time again.

This would be faster however that would be one huge pickle file... This would be 2^((Byte Count) * 8) combinations. Have fun with that man.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, gabrielcarvfer said:

If I understood things correctly, you're actually trying to compress the input file by replacing a bytestream with the equivalent seed that generates the same bytestream. Is that correct?

 

If that is the case, isn't it easier to find the seed by doing the inverse operation of the generator?

You are correct, however I have no idea where to find how to do that. Obviously in the pytorch source code, but I mean in a way that will consistently give the smallest exact seed without any guessing.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, gabrielcarvfer said:

Hmm, it depends on the pseudorandom generator. 

In your case, you're trying to find your way by using the Marsenne's generator, built into python (randint). I don't think it is guaranteed to work for large strings, but I can see it working for smaller ones. Never looked into details, so I'm not sure if it has reversible operations.

Does pytorch's torch.randint just equate to random.randint? I thought they were different.

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, gabrielcarvfer said:

Oh, my bad. They seem to be different. PyTorch code is massive, but seems like they use numpy randint as a backend. They moved from Marsennes twister to PCG-64.

But this generator isn't really that important. The important thing is you can choose whatever pseudorandom generation you want, as long as it can produce a series of results your file contains with a consistent set of operations.

E.g. I think you can inject 10 numbers into a LFSR, and use the generated number + the last injected number to do things backward. That way you can compress 10 numbers into 2. The problem is guaranteeing things are reversible.

Sorry, but I think I was overthinking and you will end up requiring a ton of computation to find the polynomials that describe the operations anyway. Guess I need some sleep. 😴

 

Memoization is the way to go. Dump all you can to the disk and continue what you're doing.

Okay, hey thanks for trying to help, you are one of the first to not call this stupid...

 

After, calculating the math we would have to store all combinations for the byte size we couldn't do just a do few... And even just to store every four byte combination we would have to store ~18GB of dictionary files. Also keep in mind four bytes won't compress at all as it is already to small. So I think we are going to need to stick to calculation rather than storage.

 

The main issue with computation right now at least for the people that are okay using their CUDA GPUs is that to my understanding CUDA is Nvidia only, and that currently we can't use all the CUDA cores anyway as we need to lower how much memory is in use. And CPUs just don't have enough threads right now at least for most people.

 

We could achieve this with deduplication on say a pagefile that is stored in RAM. That way we only have to store the ~5GB of DLLs in memory once. However I'm not sure how we could do this in Windows... I think I have a way we could do it on Linux though...

 

First we would get a ~10GB RamDisk and then put a OpenDedup volume on it, then just put a regular page file on it. The reason this wouldn't work for Windows is because most of the time we can't "hot add" a page file in Windows like we can in Linux.

 

However, even that has issues as we would need some kind of setup for that, along with assuming the user has 10GB of RAM, we could use regular storage and that would work on both Linux and Windows, however then that set of DLLs is really slow. But I still don't know if OpenDedup is even "live" anyway. What I mean is if it only even stores the compressed version on the disk rather than also having it also decompressed to access other than just once in memory.

Link to comment
Share on other sites

Link to post
Share on other sites

25 minutes ago, gabrielcarvfer said:

Trying new things is not stupid by definition.

 

The only thing I can say in that sense is that your approach seems to be too naïve by trying to brute force the solution.

 

I would start by looking at pseudorandom generators and looking what kinds of sequences of numbers they can generate, because you might be trying to do something that is inherently impossible. 

You may need to perform some analysis on the file before even choosing the generator. Or you can go the Auto Encoder/Decoder route, overfit the network so that it can reproduce your file with only the intermediate weight and decoding neural network (usually appropriate for signal compression which can be lossy, not data which is always lossless).

Okay.

 

This is just the only way I could think of that should theoretically work every time without many issues other than speed being the main issue. Which for my scenario, can be fixed with multithreading and that can be done easily. I just need to figure out how to only store those DLLs once or find a better method overall to make this work.

 

Sounds interesting will look into.

Link to comment
Share on other sites

Link to post
Share on other sites

So ... here's what I was thinking.

 

You can generate a bunch of "dictionaries" with random seeds. 

For example, let's say you initialize your random number generator with 1 and generate 16777216 numbers (each of them 4 bytes, as it's a 32 bit number), or you could generate 16777216 numbers that each use 1 byte each.

This means that for each seed, you'll use 16 MB of memory (for the 1 byte number) or 64 MB of memory for the 4 byte per number.

 

So you could have 256 dictionaries calculated, each  with seed 1,2,3,...., 256 (or whatever seeds you want

Your compressor will use 256 x 16..64 MB = 4..16 GB

 

Now you can start reading the input file and pick 4 bytes and search your memory to see if you find this sequence in your 4..16 GB of random data and if you do find it, you can write in the output file (compressed file)  the dictionary id (1 byte) if it changes, and then the offset to where in the dictionary for that seed you can find the sequence and you'll also need a length in case you manage to find the sequence somewhere else. 

 

For example, let's say you have this sequence of bytes in the file:

0x04 0xEF 0x65 0x00 0x10 0x7C

 

Let's say you have the 1000...1005 numbers in the dictionary with seed 5 these numbers: 0x04 0xEF 0x65 0x00 0xAA 0xCC

Let's say you have the 5100...5105 numbers in the dictionary with seed 200 these numbers: 0x04 0xEF 0x65 0x00 0x10 0xAC

 

You could compress 4 bytes as [dictionary=5][offset=1000][length=4]  OR

You could compress 5 bytes as [dictionary=200][offset=5100][length=5]

 

Because it's highly unlikely you'll find more than let's say 10-12 bytes randomly generated matching data, you could use only 3 bits for length .. 0 means 4 bytes, 7 means 7+4 = 11 bytes match. Then it's up to you what you use the remaining 5 bits out of the byte.

For example one idea would be

byte 1 :

bit 7  = seed changes or not

bit 6  = number of bytes used for the offset (0  = 2 bytes used, 1 = 3 bytes used)

bit 5 =

bit 4 =

bit 3 =

bit 2 = 3 bits used for the length 0..7 + 4 = 4..11 bytes from offset that follows.

bit 1 =

bit 0 =

 

bytes 2 and 3 or bytes 2..4: offset from where to get data from dictionary

byte 4 or 5 (depending on how many bytes used for offset) : dictionary seed (IF seed changes)

 

So this way you could pack in minimum 3 bytes or maximum 5 bytes (if seed changes and also offset is higher than 65536)

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×