Jump to content

Hey there 🙂

 

I will need to start working with very large files / file sets: mean size 3-4TB.

 

Transferring these is thus inconvenient due to the time it takes (my infrastructure doesn't support 10GbE yet).

A few years ago I looked into this issue and how people doing the same kind of work deal with that limitation. The answer was that you move the worker rather than the data. So basically, the worker would need to go analyze the data where they are produced since this was so inconvenient and time-consuming to move them.

 

Do you know if any major breakthrough happened since then and if any alternative option would now be available?

 

Thank you very much in advance for your help.

Best,

-a-

 

Link to comment
https://linustechtips.com/topic/1510058-working-with-big-data-how-to/
Share on other sites

Link to post
Share on other sites

Can the worker remote into the system with the data on it?

 

 

5900XT (16C/32T) | 64 GB DDR4 RAM | RTX 5070 

1.5TB Optane P4800X | 16TB nvme SSD NAS w/ 10Gbe & 96GB DDR5 RAM caching
LG C4 + QN90A | Sony AZ7000ES | Polk R200+R100, ELAC OW4.2, SVS PB12-NSD + 3x SB1000 | HD800

Link to post
Share on other sites

On 5/30/2023 at 6:48 AM, asheenlevrai said:

Hey there 🙂

 

I will need to start working with very large files / file sets: mean size 3-4TB.

 

Transferring these is thus inconvenient due to the time it takes (my infrastructure doesn't support 10GbE yet).

A few years ago I looked into this issue and how people doing the same kind of work deal with that limitation. The answer was that you move the worker rather than the data. So basically, the worker would need to go analyze the data where they are produced since this was so inconvenient and time-consuming to move them.

 

Do you know if any major breakthrough happened since then and if any alternative option would now be available?

 

Thank you very much in advance for your help.

Best,

-a-

 

First of all, where is your data stored? Usually companies have this kind of stuff in the cloud on something like S3, so your workers can just fetch the required bits of data in parallel.

 

Otherwise, if doing locally, are you planning on doing it in a single machine or distributed with something like spark? You could totally do this in a single machine that's capable enough (even if you need to chunk the data), so I wouldn't really consider those datasets do actually be "big data".

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to post
Share on other sites

Right now, the data are stored on a local machine (next to the instruments that generate them). They're also archived on a local server AFAIK.

 

Workers physically go to the local machine (same city) but I guess working remotely shouldn't be an issue.

 

I'm not sure VMs (in order for multiple users to work in parallel on the remote PC) are an option (I'm not admin on the PC) since the analysis app is probably quite resource hungry.

Link to post
Share on other sites

Data on a drive -> mail?

Just be warned you usually need A LOT of server to handle a 3TB working set. Think 10TB RAM. 
If it's a case where the work can be broken into pieces then that's less of a concern. 

5900XT (16C/32T) | 64 GB DDR4 RAM | RTX 5070 

1.5TB Optane P4800X | 16TB nvme SSD NAS w/ 10Gbe & 96GB DDR5 RAM caching
LG C4 + QN90A | Sony AZ7000ES | Polk R200+R100, ELAC OW4.2, SVS PB12-NSD + 3x SB1000 | HD800

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×