Jump to content

Compression

Robbertz

My buddies and I are working on a program for ourselves for the fun of it , and we realized that we just have to parse a crap ton of data from logs. At what point do we have to worry about compressing the huge Excel workbooks and other log files? how large can we let them get before we really notice a performance hit? I know this is kind of ambiguous, but I've never had to deal with compression within a program so I'd have to learn a lot. 

 

Thanks.

Link to comment
Share on other sites

Link to post
Share on other sites

@Nuluvius Any tips here? I've not worked with large logs.

                     ¸„»°'´¸„»°'´ Vorticalbox `'°«„¸`'°«„¸
`'°«„¸¸„»°'´¸„»°'´`'°«„¸Scientia Potentia est  ¸„»°'´`'°«„¸`'°«„¸¸„»°'´

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, Robbertz said:

My buddies and I are working on a program for ourselves for the fun of it , and we realized that we just have to parse a crap ton of data from logs. At what point do we have to worry about compressing the huge Excel workbooks and other log files? how large can we let them get before we really notice a performance hit? I know this is kind of ambiguous, but I've never had to deal with compression within a program so I'd have to learn a lot. 

 

Thanks.

Depends. What do you mean by a "crap ton?"

 

1 MB? 100 MB? 

 

100 MB is a huge amount of text in a file. Hell, 1 MB is a HUGE text file. 

 

I have data files 20,000 lines long and word/excel handles it just fine. Of course, I use neither of those programs (unless I'm forced to) because they're awful.. but that's beside the point. 

 

Get yourself a real data analysis program and never worry about it again.

 

My personal favorite is igor pro.

Link to comment
Share on other sites

Link to post
Share on other sites

They can be as large as your memory can fit them and your processor can process them.

 

I worked on a log eater that used a 32-bit version of python. I found out later when you feed it a 1.2GB log, the script crashes because of memory issues. So I broke it down to process only chunks of the log. I saw almost no performance difference but that was probably due to the fact how the log eater works. It searches byte by byte until it finds a signature before processing the thing.

 

So if you have something that can really chug away on the processor, compression will have a measurable impact. If you have something that the processor putters on, then compression won't be a problem.

Link to comment
Share on other sites

Link to post
Share on other sites

12 minutes ago, vorticalbox said:

@Nuluvius Any tips here? I've not worked with large logs.

Oi I was heading off to bed, I'm really tired :/

 

It really does depend on what you mean by a 'performance hit'. Are you trying to store all of the data in memory as you are processing it? Or are you worried about disk space consumption?

 

If its disk space then it's usual to implement a circular logging mechanism where configured triggers such as a certain amount of space or a certain amount of time for instance will cause the logging to write in a loop i.e over the first file in the sequence. Another mechanism would indeed be to archive the logs, again based on trigger conditions...

 

If you are worried about memory consumption then IO and process in chunks... One would not usually expect to load and process a truly massive file all at once. One can also implement threading to help of course - the producer consumer pattern comes to mind.

The single biggest problem in communication is the illusion that it has taken place.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×