Working with BIG data (how to?)

asheenlevrai · May 30, 2023

Hey there

I will need to start working with very large files / file sets: mean size 3-4TB.

Transferring these is thus inconvenient due to the time it takes (my infrastructure doesn't support 10GbE yet).

A few years ago I looked into this issue and how people doing the same kind of work deal with that limitation. The answer was that you move the worker rather than the data. So basically, the worker would need to go analyze the data where they are produced since this was so inconvenient and time-consuming to move them.

Do you know if any major breakthrough happened since then and if any alternative option would now be available?

Thank you very much in advance for your help.

Best,

-a-

adm0n · May 30, 2023

Well, if you just have 1GbE, you can easily have VMs set up with direct (or near direct) connection to the data. That way the people can remote into those workstations. I think that would be the most elegant solution here.

cmndr · May 31, 2023

Can the worker remote into the system with the data on it?

igormp · May 31, 2023

On 5/30/2023 at 6:48 AM, asheenlevrai said:

Hey there

I will need to start working with very large files / file sets: mean size 3-4TB.

Transferring these is thus inconvenient due to the time it takes (my infrastructure doesn't support 10GbE yet).

A few years ago I looked into this issue and how people doing the same kind of work deal with that limitation. The answer was that you move the worker rather than the data. So basically, the worker would need to go analyze the data where they are produced since this was so inconvenient and time-consuming to move them.

Do you know if any major breakthrough happened since then and if any alternative option would now be available?

Thank you very much in advance for your help.

Best,

-a-

First of all, where is your data stored? Usually companies have this kind of stuff in the cloud on something like S3, so your workers can just fetch the required bits of data in parallel.

Otherwise, if doing locally, are you planning on doing it in a single machine or distributed with something like spark? You could totally do this in a single machine that's capable enough (even if you need to chunk the data), so I wouldn't really consider those datasets do actually be "big data".

asheenlevrai · June 1, 2023

Right now, the data are stored on a local machine (next to the instruments that generate them). They're also archived on a local server AFAIK.

Workers physically go to the local machine (same city) but I guess working remotely shouldn't be an issue.

I'm not sure VMs (in order for multiple users to work in parallel on the remote PC) are an option (I'm not admin on the PC) since the analysis app is probably quite resource hungry.

cmndr · June 1, 2023

Data on a drive -> mail?

Just be warned you usually need A LOT of server to handle a 3TB working set. Think 10TB RAM.
If it's a case where the work can be broken into pieces then that's less of a concern.

Sign In

Working with BIG data (how to?)

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

I Visited the Cradle of the Internet

Latest From ShortCircuit:

Razer Finally Got a Desk Job - Razer Pro Type Ergo

Latest From TechLinked:

This Summer’s Lookin’ Steamy

Latest From GameLinked:

This Was A GOOD One...

Latest From Tech Quickie:

The Secret Council Behind Every Emoji

Latest From The WAN Show:

Google’s Best Feature In Years - WAN Show June 5, 2026

My Activity Streams