SQL advice needed.

Heartbreaker3120 · July 3, 2019

I need a tool that will look up all lines from file1 in file2 (something like grep -f file1 file2), but I need it to be really fast for big files. I expect file1 to change for every lookup and to have even few million lines, file2 will stay the same, it might be updated from time to time and may contain over few billion lines (something around 150GB file size).

file1 format: item
file2 format: item:value

I heard program with hashmap would be the fastest option, but apparently I'd need a lot of ram for it. So I'd like it to be done with NoSQL database like Cassandra Apache. I'd need a table with 2 rows: item and value and some program using API that would get all lines (file1) and look them all up in the table and if item match I'd like it to return item:value.

I need it to be really fast so I'd need to know what speeds can I expect, is disk speeds the only important factor?

JacobFW · July 3, 2019

Oh boy, as if the internet really needed another venue to argue SQL v. NoSQL.

If all your doing is just key-value lookups, then yes, as hashmap is going to be the fastest way. However as you mentioned that would require a system with an equivalent amount of ram to store all the data in (technically more for OS overhead and buffer space).

One of the driving forces behind databases is that they enable you to perform operations on a dataset that is far, far larger than your computer's physical ram, usually at the cost of a little extra space for indexes to enable fast lookup of the data.

There are multiple databases and types of databases to choose from, each with their own positives and negatives. I am aware of the resistance in recent years to using SQL databases, and as I mentioned, there are certainly reasons why you might not want to an SQL database. Despite that however, I still recommend starting out using an SQL database like MySQL (which is FOSS). With all the years all development that has gone into them, they are generally some of the most reliable, feature rich, and flexible databases you can use. And that last part is to me the most important. You don't realize just how valuable having a flexible system is until you have some great idea, but realize that to use it would require making a absolute shit-ton of changes to your existing setup. Using a higher performance but less flexible database is to me an example of premature optimization.

wanderingfool2 · July 4, 2019

8 hours ago, Heartbreaker3120 said:

I need a tool that will look up all lines from file1 in file2 (something like grep -f file1 file2), but I need it to be really fast for big files. I expect file1 to change for every lookup and to have even few million lines, file2 will stay the same, it might be updated from time to time and may contain over few billion lines (something around 150GB file size).

file1 format: item
file2 format: item:value

I heard program with hashmap would be the fastest option, but apparently I'd need a lot of ram for it. So I'd like it to be done with NoSQL database like Cassandra Apache. I'd need a table with 2 rows: item and value and some program using API that would get all lines (file1) and look them all up in the table and if item match I'd like it to return item:value.

I need it to be really fast so I'd need to know what speeds can I expect, is disk speeds the only important factor?

Fast is going to be very very subjective here. (And will depends on what you really consider to be "fast").

To quickly answer your disk speed question about disk speed...no, it is not the only important factor.

I think the most important factor is how exactly your data is stored/organized.

file2 is what would be focused on (as it is the large one). Are item's numeric? Is file2 sorted? If not, can it be sorted based on item? The reason is sorted data is a lot easier to combine than none sorted. (1, 4, 5, 8, 10 is a lot easier to look up values than 1, 4 , 2, 10, 8...in the first example you can do a binary search so you don't need to parse the entire string).

Truthfully though, if you really wanted to just load up a mysql onto a computer and create two tables. Load them with the data from file1 and file2 and just run a join and see how long it takes to see if it is a reasonable time.

mariushm · July 4, 2019

Considering that file is gonna be so large, the biggest performance would probably be just from placing it on a nvme SSD and stream data from there... unless you can get a server with loads of memory to cache the file.

Also, considering it's gonna change so rarely, I would work on optimizing that big file... for example use the length of the "item" text (number of bytes) and the first 1 or 2 bytes/letters in the name to group items in "blocks" of data, then sort records in each block and put only those records which fit the requirements.

For example, let's say you'll have 64 MB blocks of data, and you end up with minimum 2400 blocks of 64 MB which contain those 150 GB worth of records.

If you want to look up "pencil" for example, you know it's a 6 byte long text and starts with pe, so you look in those 2400+ blocks to see which blocks contain records that are 6 byte long and start with pe and only look in those blocks to see if you have an item called "pencil" and what value it has.

Franck · July 4, 2019

21 hours ago, Heartbreaker3120 said:

I need a tool that will look up all lines from file1 in file2 (something like grep -f file1 file2), but I need it to be really fast for big files. I expect file1 to change for every lookup and to have even few million lines, file2 will stay the same, it might be updated from time to time and may contain over few billion lines (something around 150GB file size).

file1 format: item
file2 format: item:value

I heard program with hashmap would be the fastest option, but apparently I'd need a lot of ram for it. So I'd like it to be done with NoSQL database like Cassandra Apache. I'd need a table with 2 rows: item and value and some program using API that would get all lines (file1) and look them all up in the table and if item match I'd like it to return item:value.

I need it to be really fast so I'd need to know what speeds can I expect, is disk speeds the only important factor?

non-engine based database will be entirely based of you disk speed plus whatever crazy overhead they have. If you plan that route get as fast as possible storage. For file that large with seeking you are looking at least for RAID SSD or RAID NVME. A single file that large there is no use case scenario where you need that in one file but still that is what you need.

Best case from non engine DB is to have your file splitted in more reasonable size and use a FAT32 drive. No security and other extra header for file access which make it faster by A LOT.

Using C or C++ would make file handling much faster.

Best of best option is to use a database engine and stream the file data over the DB live. You have a process that check the file for modification, take the new data, put it in the DB and flush the file. This make the file stay small and some DB engines can also use Ram tables (i.e MSSQL) and be so fast you can query couple million records per seconds.

looterz · July 4, 2019

If some pre-processing time is an option for getting both files into a workable format, something like a bloom-filter may work, if that's usable I'd recommend it over a plain hashmap but if not something like Redis may work. Parsing out the files into data in memory for fast lookups is your main challenge here. I would also recommend Rust or C++ for this.

Sign In

SQL advice needed.

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Featured Topics

Topics

Latest From Linus Tech Tips:

I KNEW I’d Hate this (I Was Wrong)

Latest From Tech Quickie:

Why Do Speakers Hiss?

Latest From TechLinked:

Yep, it’s an App

Latest From GameLinked:

Roblox and Walmart... Are One

Latest From ShortCircuit:

Dell Has Destroyed the XPS - Dell XPS 16 (2024)

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!

My Activity Streams