Jump to content
To encourage social distancing, you must leave two blank lines at the start and end of every post, and before and after every quote. Failure to comply may result in non-essential parts of the forum closing. Click for more details. ×
Search In
  • More options...
Find results that contain...
Find results in...
Master Disaster

Is there an easy way to move 4TB of data from a ZFS Volume to a Btrfs Volume?

Recommended Posts

Posted · Original PosterOP

I'm 99% sure I know the answer but here goes anyway...

 

My current NAS is a Ras Pi 4 running OMV with an 8TB USB HDD which is formatted in ZFS (I'm not 100% sure why I chose ZFS) which always did the job for me though I had to do some Linux hackery to get power management working on the USB HDD which was fine. About 3 months I finally jumped from DLNA to Plex because my collection grew to be huge and I got sick of having to organise everything manually, this is where the problems started. Turns out the hackery I did to make my USB drive idle when not in use doesn't play nice with Plex (I should add this is pretty much unfixable and this is confirmed by multiple authors of OMV and OMV Plugin developers so I'm not looking for help fixing this issue), turns out Plex doesn't like it when the OS tries to idle drives while its using them and since the Plex port for OMV is unofficial there's no integration so they can't talk to each other. Basically I either have a Plex server that crashes multiple times a day because the HDD idled or I have a HDD spinning 24/7 in my bedroom. I don't want either.

 

I just ordered myself a Synology DS218+ with 2 x 4TB NAS HDDs and an 8GB RAM upgrade which is all arriving tomorrow. I did my research and chose the Synology because it allows everything I want from it (Apache, PHP, mySQL, Transmission and certificate management) plus I'm assured the Plex port for it fully intergrates with the Synology power management so the drives will idle when not in use. It also supports Active Directory which is a nice bonus.

 

Now I have a conundrum, how the heck do I transfer 4.2TBs worth of data from my old NAS to my new NAS? I was hoping the Synology supported ZFS but everything I am reading says it doesn't so am I really dragging 4TBs of data from one NAS to another over my network? There must be a better way?

 

Any ideas would be greatly appreciated.


Main Rig:-

Ryzen 7 3800X | Asus ROG Strix X570-F Gaming | 16GB Team Group Dark Pro 3600Mhz | Samsung 970 Evo 500GB NVMe | Sapphire 5700 XT Pulse | Corsair H115i Platinum | WD Black 1TB | WD Green 4TB | EVGA SuperNOVA G3 650W | Asus TUF GT501 | Samsung C27HG70 1440p 144hz HDR FreeSync 2 | Windows 10 Pro X64 |

 

Server:-

Raspberry Pi 4 Model B running OMV Arrakis and an 8TB Seagate USB 3.0 external HDD

Link to post
Share on other sites

Not really a better way unless the NAS model you brought supported USB3/TB connection and you do a 'local disk copy', otherwise you're up for standard rsync type migration of the data over the network,

Link to post
Share on other sites
Posted · Original PosterOP
2 minutes ago, leadeater said:

Not really a better way unless the NAS model you brought supported USB3/TB connection and you do a 'local disk copy', otherwise you're up for standard rsync type migration of the data over the network,

The NAS does support USB file copy, unfortunately it doesn't support ZFS at all so I'm guessing that if I were to plug the HDD into the NAS it would refuse to access it.

 

Oh well, I guess my free weekend just got fully booked up.


Main Rig:-

Ryzen 7 3800X | Asus ROG Strix X570-F Gaming | 16GB Team Group Dark Pro 3600Mhz | Samsung 970 Evo 500GB NVMe | Sapphire 5700 XT Pulse | Corsair H115i Platinum | WD Black 1TB | WD Green 4TB | EVGA SuperNOVA G3 650W | Asus TUF GT501 | Samsung C27HG70 1440p 144hz HDR FreeSync 2 | Windows 10 Pro X64 |

 

Server:-

Raspberry Pi 4 Model B running OMV Arrakis and an 8TB Seagate USB 3.0 external HDD

Link to post
Share on other sites
1 minute ago, Master Disaster said:

The NAS does support USB file copy, unfortunately it doesn't support ZFS at all so I'm guessing that if I were to plug the HDD into the NAS it would refuse to access it.

 

Oh well, I guess my free weekend just got fully booked up.

Higher model NAS's support USB/TB host connection rather than network so the NAS would be mounted on the server/system with the ZFS disk in it, both being 'local disks'. I don't think the DS 218+ supports that.

Link to post
Share on other sites
Posted · Original PosterOP
1 minute ago, leadeater said:

Higher model NAS's support USB/TB host connection rather than network so the NAS would be mounted on the server/system with the ZFS disk in it, both being 'local disks'. I don't think the DS 218+ supports that.

Oh, I see what you mean. Yeah that would be a cool feature, means I could start the transfer and forget about it/not worry about babysitting it.


Main Rig:-

Ryzen 7 3800X | Asus ROG Strix X570-F Gaming | 16GB Team Group Dark Pro 3600Mhz | Samsung 970 Evo 500GB NVMe | Sapphire 5700 XT Pulse | Corsair H115i Platinum | WD Black 1TB | WD Green 4TB | EVGA SuperNOVA G3 650W | Asus TUF GT501 | Samsung C27HG70 1440p 144hz HDR FreeSync 2 | Windows 10 Pro X64 |

 

Server:-

Raspberry Pi 4 Model B running OMV Arrakis and an 8TB Seagate USB 3.0 external HDD

Link to post
Share on other sites

Over gigabit Ethernet it's going to take 15 hours, I'd just use rsync and give it a day, is it so bad?

Link to post
Share on other sites
Posted · Original PosterOP
Just now, Loote said:

Over gigabit Ethernet it's going to take 15 hours, I'd just use rsync and give it a day, is it so bad?

rsync wouldn't really work as I need to reorganise where everything is on the drive. The Ras Pi was my first NAS and I learned some valuable lessons about where and how things should be stored. Unless DSM has something akin to a file explorer so I can move files around directly on the NAS? (I wished OMV had this so many times).


Main Rig:-

Ryzen 7 3800X | Asus ROG Strix X570-F Gaming | 16GB Team Group Dark Pro 3600Mhz | Samsung 970 Evo 500GB NVMe | Sapphire 5700 XT Pulse | Corsair H115i Platinum | WD Black 1TB | WD Green 4TB | EVGA SuperNOVA G3 650W | Asus TUF GT501 | Samsung C27HG70 1440p 144hz HDR FreeSync 2 | Windows 10 Pro X64 |

 

Server:-

Raspberry Pi 4 Model B running OMV Arrakis and an 8TB Seagate USB 3.0 external HDD

Link to post
Share on other sites

Yes DSM has a file explorer.

Just launch a transfer to copy everything across, let it spend its 15 hours then reorganize.

You can also just mount the share on a computer and reorganize from there so as not to need to use the DSM desktop in a browser.

 


Desktop: i7-5960X 4.4GHz, Noctua NH-D14, ASUS Rampage V, 32GB, RTX2080S, 2TB NVMe SSD, 2x16TB HDD RAID0, Corsair HX1200, Thermaltake Overseer RX1, Samsung 4K curved 49" TV, 23" secondary

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB NVMe SSD RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

Dell XPS 2 in 1 2019, 32GB, 1TB, 4K / GPD Win 2

Link to post
Share on other sites

Synology has... nearly everything, after initial pains you could also just ssh onto the machine and use mc(Midnight Commander) to sort your folders, or write a script that will things you want however you want. It's local access to data so it shouldn't be a problem.

 

Screen from browser interface on some Synology Server:

obraz.thumb.png.c16ce40d4c3d093801ab5505618af5b5.png

Link to post
Share on other sites
Posted · Original PosterOP
1 minute ago, Kilrah said:

Yes DSM has a file explorer.

Just launch a transfer to copy everything across, let it spend its 15 hours then reorganize.

You can also just mount the share on a computer and reorganize from there so as not to need to use the DSM desktop in a browser.

 

 

1 minute ago, Loote said:

Synology has... nearly everything, after initial pains you could also just ssh onto the machine and use mc(Midnight Commander) to sort your folders, or write a script that will things you want however you want. It's local access to data so it shouldn't be a problem.

 

Screen from browser interface on some Synology Server:

obraz.thumb.png.c16ce40d4c3d093801ab5505618af5b5.png

Excellent, thank you both. Then I guess I will rsync everything and reorganise after everything is copied.


Main Rig:-

Ryzen 7 3800X | Asus ROG Strix X570-F Gaming | 16GB Team Group Dark Pro 3600Mhz | Samsung 970 Evo 500GB NVMe | Sapphire 5700 XT Pulse | Corsair H115i Platinum | WD Black 1TB | WD Green 4TB | EVGA SuperNOVA G3 650W | Asus TUF GT501 | Samsung C27HG70 1440p 144hz HDR FreeSync 2 | Windows 10 Pro X64 |

 

Server:-

Raspberry Pi 4 Model B running OMV Arrakis and an 8TB Seagate USB 3.0 external HDD

Link to post
Share on other sites

Yeah this is one of the pains that come with switching File Systems. If the Synology uses BTRFS you could possibly setup a server with both ZFS & BTRFS then mount the ZFS pool setup the new BTRFS pool copy the files over internally then unmount the BTRFS pool and move the drives over to the NAS and re-mount.

 

Though you mentioned using USB so I can't see it'd go very quickly. You'd need to shuck them to have it go very quickly. Otherwise I'm pretty sure over the network is your only other option. Have 10Gig at your disposal?


Guides & Tutorials:

How to Format Storage Devices in Windows 10

A How-To: Drive Sharing in Windows 10

VFIO GPU Pass-though w/ Looking Glass KVM on Ubuntu 19.04

A How-To Guide: Building a Rudimentary Disk Enclosure

Three Methods to Resetting a Windows Login Password

A Beginners Guide to Debian CLI Based File Servers

A Beginners Guide to PROXMOX

How to Use Rsync on Microsoft Windows for Cross-platform Automatic Data Replication

 

Guide/Tutorial in Progress:

A Beginners Guide to Servers

 

In the Queue:

[Taking Suggestions]

 

Don't see what you need? Check the Full List or *PM me, if I haven't made it I'll add it to the list.

*NOTE: I'll only add it to the list if the request is something I know I can do.

Link to post
Share on other sites
13 hours ago, Windows7ge said:

Yeah this is one of the pains that come with switching File Systems. If the Synology uses BTRFS you could possibly setup a server with both ZFS & BTRFS then mount the ZFS pool setup the new BTRFS pool copy the files over internally then unmount the BTRFS pool and move the drives over to the NAS and re-mount.

 

Though you mentioned using USB so I can't see it'd go very quickly. You'd need to shuck them to have it go very quickly. Otherwise I'm pretty sure over the network is your only other option. Have 10Gig at your disposal?

I think that you're going to be pretty much stuck and that 10 GbE isn't really going to help you much either because you're going from what I presume to be a single drive to two drives, presumably in a RAID0/stripped array/pool.

 

The caution with that, of course, is that you are at risk of losing all of your data if one of the drives fail.

 

But yeah, give it about a day or two to transfer everything with rsync.

 

Sorry.

 

*sidebar*

If your new Synology DS218+ has a little bit more powerful processor, what you MIGHT want to consider doing is packing your files into a 7-zip archive with little to no compression (since if you're moving video files, it's not really going to compress much anyways, and since the old NAS was running off of a RPi, it might have limited computing power to be able to perform the compressions quickly enough anyways.

 

The idea with this is that you can move the 7-zip archive file as one pretty much contiguous block/file so that you can max out your transfer speed longer. A lot of times, when you are moving files over, even with rsync, "opening" and "closing" files means that you aren't able to run at line rate, and the slow down when moving different files can eventually be potentially significant enough that it can add almost an entire day to the transfer operation vs. packing it into an archive file, and moving the whole giant thing at once, and unpacking it at the destination.

 

I deal with this a fair bit when I am writing my data to LTO-8 tape because the tape requires sequential writes, so it's a LOT better for me to pack files into archives and write the archives to tape (beyond tarballs); except that in my case, a) I can have a lot of really small files and b) I can also use my 4-node cluster to help me pack up the data into archives.

 

Just a thought for you as well that might be able to help you speed the transfer up a little bit. It'll take a little bit of time to create the archive that will store the individual files, but it might be worth it for you, in the end. You can run a shorter test to see if that would give you any sort of time savings vs. just rsyncing the data over without it being in an archive.

Link to post
Share on other sites

Making an archive would be stupid, might speed up the transfer a tiny bit but then decompressing on the tardget device will mean reading and writing the entirety of the data from/to the same drives which will take longer than the transfer itself.


Desktop: i7-5960X 4.4GHz, Noctua NH-D14, ASUS Rampage V, 32GB, RTX2080S, 2TB NVMe SSD, 2x16TB HDD RAID0, Corsair HX1200, Thermaltake Overseer RX1, Samsung 4K curved 49" TV, 23" secondary

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB NVMe SSD RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

Dell XPS 2 in 1 2019, 32GB, 1TB, 4K / GPD Win 2

Link to post
Share on other sites
Posted · Original PosterOP

Update guys, everything arrived about an hour ago (at 7am), NAS is up and running now and as it happens I was worrying over nothing because Disk Station supports mounting CIFS Folders directly. I still have to do things folder by folder (that's a limitation of OMV as you cannot access the root of the pool as a share) but it means I can do it directly on the NAS and not have to leave my PC running for 15 hours.

 

2 hours ago, alpha754293 said:

I think that you're going to be pretty much stuck and that 10 GbE isn't really going to help you much either because you're going from what I presume to be a single drive to two drives, presumably in a RAID0/stripped array/pool.

 

The caution with that, of course, is that you are at risk of losing all of your data if one of the drives fail.

 

But yeah, give it about a day or two to transfer everything with rsync.

 

Sorry.

 

*sidebar*

If your new Synology DS218+ has a little bit more powerful processor, what you MIGHT want to consider doing is packing your files into a 7-zip archive with little to no compression (since if you're moving video files, it's not really going to compress much anyways, and since the old NAS was running off of a RPi, it might have limited computing power to be able to perform the compressions quickly enough anyways.

 

The idea with this is that you can move the 7-zip archive file as one pretty much contiguous block/file so that you can max out your transfer speed longer. A lot of times, when you are moving files over, even with rsync, "opening" and "closing" files means that you aren't able to run at line rate, and the slow down when moving different files can eventually be potentially significant enough that it can add almost an entire day to the transfer operation vs. packing it into an archive file, and moving the whole giant thing at once, and unpacking it at the destination.

 

I deal with this a fair bit when I am writing my data to LTO-8 tape because the tape requires sequential writes, so it's a LOT better for me to pack files into archives and write the archives to tape (beyond tarballs); except that in my case, a) I can have a lot of really small files and b) I can also use my 4-node cluster to help me pack up the data into archives.

 

Just a thought for you as well that might be able to help you speed the transfer up a little bit. It'll take a little bit of time to create the archive that will store the individual files, but it might be worth it for you, in the end. You can run a shorter test to see if that would give you any sort of time savings vs. just rsyncing the data over without it being in an archive.

Archiving 4TBs of data on a Ras Pi would take multiple times longer (without the transfer time of the archive included) than just transferring everything uncompressed. Then I would also have to decompress at the other end.

 

Anyway once everything is migrated I'm gonna kill the USB drive, reformat it using Btrfs and create a second pool which will be a backup of my important files since I'm using RAID 0 which has no redundancy.


Main Rig:-

Ryzen 7 3800X | Asus ROG Strix X570-F Gaming | 16GB Team Group Dark Pro 3600Mhz | Samsung 970 Evo 500GB NVMe | Sapphire 5700 XT Pulse | Corsair H115i Platinum | WD Black 1TB | WD Green 4TB | EVGA SuperNOVA G3 650W | Asus TUF GT501 | Samsung C27HG70 1440p 144hz HDR FreeSync 2 | Windows 10 Pro X64 |

 

Server:-

Raspberry Pi 4 Model B running OMV Arrakis and an 8TB Seagate USB 3.0 external HDD

Link to post
Share on other sites
17 hours ago, Master Disaster said:

Update guys, everything arrived about an hour ago (at 7am), NAS is up and running now and as it happens I was worrying over nothing because Disk Station supports mounting CIFS Folders directly. I still have to do things folder by folder (that's a limitation of OMV as you cannot access the root of the pool as a share) but it means I can do it directly on the NAS and not have to leave my PC running for 15 hours.

 

Archiving 4TBs of data on a Ras Pi would take multiple times longer (without the transfer time of the archive included) than just transferring everything uncompressed. Then I would also have to decompress at the other end.

 

Anyway once everything is migrated I'm gonna kill the USB drive, reformat it using Btrfs and create a second pool which will be a backup of my important files since I'm using RAID 0 which has no redundancy.

Remember, you're not COMPRESSING the data on the RPi4. You're only "storing" the data (i.e. zero or no compression), which if you're uzing the 7z Linux command line tool, would be akin to:

7z a -t7z -m0=Copy -mx=0 ...

 

That will create an 7-zip archive file, but because you're not actually going to be compressing the data, it will be able to create it about as fast as a RPi4 can, whatever that works out to be.

Like I said, you can test it out to see if it would be of any benefit to you, as a suggestion.

Also like I said, I do this for when I am transferring data to tape because writing a ton of really small files to LTO-8 tape makes it such that instead of writing at say 200-250 MB/s, I would be writing to the tape at like 30 kB/s instead, which, writing 12 TB of data to tape would take FOREVER.

YMMV. You can try it. If it works out that you don't save any time, then you can skip that. But if it can help you save time, again, it's your call whether you want to do that or not.

My main server is about 80 TB, and it was painful to move that much data around when I migrated to my new server. 4 TB isn't as bad, but if there are ways to speed it up, why not?

Link to post
Share on other sites
1 hour ago, alpha754293 said:

Remember, you're not COMPRESSING the data on the RPi4. You're only "storing" the data (i.e. zero or no compression), which if you're uzing the 7z Linux command line tool, would be akin to:

Doesn't particularly matter here, most of the files in question sound like video files so are already large so will be going as fast as they can for majority of the time, only an extremely small portion will be metadata/file table updates. If it were a case of millions of small (less than 10mb files) it might help but you're still talking about a 3 step process so it has to be at least 4 times faster to even bother considering it.

Link to post
Share on other sites
11 hours ago, leadeater said:

Doesn't particularly matter here, most of the files in question sound like video files so are already large so will be going as fast as they can for majority of the time, only an extremely small portion will be metadata/file table updates. If it were a case of millions of small (less than 10mb files) it might help but you're still talking about a 3 step process so it has to be at least 4 times faster to even bother considering it.

I tend of think of it, a tad, in the opposite way.

 

If the total time for me to send the data over is x for any given y amount of data (be it in bytes, or number of files or both), and storing the files into an archive and sending it over is < x; then it's still a time savings nevertheless.

 

Therefore; if the effort to do that is relatively low (which, again, you can script this), then it's still a time savings in the end.

 

Conversely, if you don't really care about how fast you can make this happen, and that say like you don't particularly care whether it's going to take you 47 hours vs. 48 hours to transfer 4.2 TB of data over, then if time is on your side, then sure, you can just let it do its thing.

 

Like I said, in my case, moving 80 TB of data and/or writing 12 TB to tape, the difference is substantial.

 

Some people might think that a time saving is still a time saving.

 

Others might look at 47 hours vs. 48 hours as being identical, and therefore; it doesn't matter to them.

 

To each, their own.

Link to post
Share on other sites
45 minutes ago, alpha754293 said:

If the total time for me to send the data over is x for any given y amount of data (be it in bytes, or number of files or both), and storing the files into an archive and sending it over is < x; then it's still a time savings nevertheless.

But it's not. Creating an archive of 4TB of data not only will need another 4TB free space on the source drive, it will take the time to copy 4TB from a drive to itself, which is more than twice as long as just copying it to another drive...

And that's not to mention the other device also has to be at least twice as big as the amount of data, and will also need to copy all the data to itself.

 

Literally you're talking of more than twice the time on the source, more than twice the time on the destination, plus the copy.

So you take about 5x the time of the intended transfer in total to save maybe 20% on the actual transfer... not to mention the free space requirements. How on earth can you see that as a time saving?

 

Also.. Your whole point is "saving some time on accessing small files". Guess what, these small files will now have to be accessed twice at different times when archiving and unarchiving instead of simultaneously, which means the loss of time due to them is now twice what it was.

 

Your tape scenario doesn't even match the current case anywhere near significantly.


Desktop: i7-5960X 4.4GHz, Noctua NH-D14, ASUS Rampage V, 32GB, RTX2080S, 2TB NVMe SSD, 2x16TB HDD RAID0, Corsair HX1200, Thermaltake Overseer RX1, Samsung 4K curved 49" TV, 23" secondary

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB NVMe SSD RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

Dell XPS 2 in 1 2019, 32GB, 1TB, 4K / GPD Win 2

Link to post
Share on other sites
4 hours ago, alpha754293 said:

If the total time for me to send the data over is x for any given y amount of data (be it in bytes, or number of files or both), and storing the files into an archive and sending it over is < x; then it's still a time savings nevertheless.

Archive creation: 1 hr

Transfer: 1 hr

Decompression: 1hr

 

Straight copy target time: 3 hrs

 

If straight copy is equal to or less than 3 hrs then entire process is pointless and slower than this straight copy, that's the point. What you're suggesting has to actually be quicker. Random example figures used of course, but remember it has to actually be faster and large video files already are best case for network transfers and even copy to tape so there is little need to do anything.

 

Tape isn't being used here.

Link to post
Share on other sites
On 3/1/2020 at 9:58 AM, Kilrah said:

But it's not. Creating an archive of 4TB of data not only will need another 4TB free space on the source drive, it will take the time to copy 4TB from a drive to itself, which is more than twice as long as just copying it to another drive...

And that's not to mention the other device also has to be at least twice as big as the amount of data, and will also need to copy all the data to itself.

 

Literally you're talking of more than twice the time on the source, more than twice the time on the destination, plus the copy.

So you take about 5x the time of the intended transfer in total to save maybe 20% on the actual transfer... not to mention the free space requirements. How on earth can you see that as a time saving?

 

Also.. Your whole point is "saving some time on accessing small files". Guess what, these small files will now have to be accessed twice at different times when archiving and unarchiving instead of simultaneously, which means the loss of time due to them is now twice what it was.

 

Your tape scenario doesn't even match the current case anywhere near significantly.

Your first point is true if and only if you're copying a moving a contiguous 4.2 TB file, which I highly doubt you are.

 

You can literally store, send, and unpack, as a sequence of parallel tasks, with the exception of the very first file.

 

In other words, unless you actually have a contiguous 4.2 TB file that you're moving, ANYTHING other than that, you can store and once that store operation is complete, start sending the data over to your new server and start packing your second (set of) file(s).

 

I don't know where you got the idea that you have to store ALL of the data into a SINGLE file, before transferring it, when you can store it by subfolders, and once the store operation is complete, start sending the data over to your destination, whilst your source is working on storing the next (set of) file(s).

 

I have no idea what and/or how you have set up your folder structure, but suppose that you have four folders where each one is approximately 1.05 TB worth of data.

 

You can store that into an archive, and once that is done, start sending that 1.05 TB archive over before you start storing the next batch of 1.05 TB worth of data and when that's done, start sending that over, and if the first send has already completed, you can start unpacking the data.

 

No one is saying that you have to do this.

 

But I can also tell you that I do this all the time because it's between 8-10 times faster for me to do it this way, and I've moved around 30 TB worth of data over, doing it this way. (Normally, it takes me about 14 hours to copy about 10 TB of data that's been packed into an archive over (at around 200 MB/s). If I DON'T pack my data, then it'll take somewhere around 120+ hours. I've never actually timed it to let it run because taking 5+ days to write the same volume of data is just ridiculous, so I will abort the write task once it has ran long enough for me to get some sense of what is the ballpark amount of time that I would be looking at.) 

 

No one is saying that you have to do this.

 

If you don't want to, then don't.

It's just an idea that might save you some time and you can parallel task it rather than trying to move all 4.2 TB of data as a single, serial task.

 

Do whatever you want.

 

If your files and the size of your files is such that it you can maximize your drive's sustained sequential transfer rates, and they're large enough that your average transfer speeds is > 90% of your line rate, then don't do it.

 

But if your files and the probability density function of the size of your files is such that it doesn't lend itself to running at or > 90% of your line rate, then it's up to you to test to see whether this idea will actually work for you. Or don't.

 

Do whatever you want.

 

I've tested it with my systems. I get between a 8-10x time savings with this method. If you tested this idea and it doesn't save you any time because you've actually tested it, that's fine. If you don't want to test it, that's fine too. You don't want to use the idea, then don't.

 

Do whatever you want.

Link to post
Share on other sites
On 3/1/2020 at 2:14 PM, leadeater said:

Archive creation: 1 hr

Transfer: 1 hr

Decompression: 1hr

 

Straight copy target time: 3 hrs

 

If straight copy is equal to or less than 3 hrs then entire process is pointless and slower than this straight copy, that's the point. What you're suggesting has to actually be quicker. Random example figures used of course, but remember it has to actually be faster and large video files already are best case for network transfers and even copy to tape so there is little need to do anything.

 

Tape isn't being used here.

The problem with looking at it in terms of time is because the time-based calculation method assumes that each (set of) task(s) is sequential.

In other words, your inherent assumption is that you're trying to store all 4.2 TB into one file, and then sending over just that one file, and then unpacking only just that one file.

 

But there is nothing that says that you need to nor have to do it that way.

 

The way that I do things, since I have a folder structure and the data isn't just dumped into a single folder nor am I working with a single, large, contiguous file, therefore; I actually parallelise the entire process by breaking up and creating multiple repeats of the same set of tasks by folder.

 

In other words, the ONLY task that I CANNOT parallelise is the storing of the data from the first folder into an archive file. That's it.

 

Once the first folder has been stored into an archive, I can start sending that archive file over to the destination before moving on to storing the second folder into an archive and preparing that for transport.

 

And depending on the transfer rates, processing power, and performance of the hard drive(s), if the transport of the first file is complete before the second file is sent, then I can start unpacking the first archive file on the destination already whilst the third folder is being stored into an archive, the second file is being sent over, and the first archive file has already started unpacking.

 

All of these operations run in parallel. By using the time-based method of calculating whether this is beneficial to the OP or not, a summation-based  of accounting for the time doesn't properly and appropriately account for the parallel nature of how you are breaking up a "big" problem into a series of "smaller" problem/chunks, so that you can perform each of the smaller problems/chunks in parallel as opposed to working on one large problem, sequentially.

 

This way, I am maximising the resource utilization on the source by having it store a folder into an archive, AND transfer another archive to the destination simultaneously just as I am also maximising the resource utilization on the destination by having it receive an archive file, whilst unpacking a previous file at the same time.

 

The biggest question is if you were to do a straight copy, what's the average data transfer rate that the hardware can sustain?

 

If your single drive is bottlenecked to a STR of 40 MB/s, then this probably won't do anything in terms of speeding things up for you.

 

But if your single drive is "bottlenecked" to a STR of 175 MB/s, then it will likely depending on what's the data storage rate that your RPi4 can sustain. 

 

e.g. My Buffalo TerraStation 4 has a crappy processor, so it can't do more than about 30 MB/s (with four drives in RAID5) whilst my Qnap TS-453Be has a better Celeron J3455 processor in it, so it's actually able to sustain around 200 MB/s (via it's dual GbE NICs/ports), but the array itself is capable of much more.

 

I'll put it to you this way -- when you have to migrate an 80 TB server, you have to find really creative ways to do that very quickly and efficiently, because the best case scenario for a straight-up copy, over a single GbE NIC, you're looking at 10 days worth of just straight, pure, transit time. And with a straight-copy, I also know that I wouldn't be able to keep the single GbE NIC pegged at max line rate, which is why and how I came up with this idea/method when I migrated my 80 TB server over. (And then I had to essentially repeat the almost the same process when I migrated my failover server, which was a 60 TB server), so between the two, I had moved somewhere around 100-120 TB worth of data. Somewhere around there.

 

I don't use time as the basis of my calculations to see whether this will work because parallelisation can result in the total time needed being greater than the total wall clock time.

Instead, I use effective transfer rates to calculate the effectiveness (or lack thereof) of something like this.

Link to post
Share on other sites
32 minutes ago, alpha754293 said:

I don't know where you got the idea that you have to store ALL of the data into a SINGLE file

In your post where you mention packing the files into A 7-zip archive and transferring THE 7.zip archive.

 

32 minutes ago, alpha754293 said:

But I can also tell you that I do this all the time because it's between 8-10 times faster for me to do it this way, and I've moved around 30 TB worth of data over, doing it this way. (Normally, it takes me about 14 hours to copy about 10 TB of data that's been packed into an archive over (at around 200 MB/s). If I DON'T pack my data, then it'll take somewhere around 120+ hours.

Yeah becasue your particular scenario has a tape drive that deals horribly badly with lots of files, and a fast local filesystem. So you've got a system where transferring individual small files to the drive is maybe 100 times slower than your local storage is able to copy files into an archive.

But in a scenario like the OP's with hard drives on both ends and a gigabit network inbetween the balance is literally reversed, the transfer even of small files is faster than a single local copy.

So your experience and recommendation are entirely inapplicable to the scenario at hand. You're basing a recommendation based on something that's completely different, and you're not able to see that.


Desktop: i7-5960X 4.4GHz, Noctua NH-D14, ASUS Rampage V, 32GB, RTX2080S, 2TB NVMe SSD, 2x16TB HDD RAID0, Corsair HX1200, Thermaltake Overseer RX1, Samsung 4K curved 49" TV, 23" secondary

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB NVMe SSD RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

Dell XPS 2 in 1 2019, 32GB, 1TB, 4K / GPD Win 2

Link to post
Share on other sites
22 minutes ago, Kilrah said:

In your post where you mention packing the files into A 7-zip archive and transferring THE 7.zip archive.

 

Yeah becasue your particular scenario has a tape drive that deals horribly badly with lots of files, and a fast local filesystem. So you've got a system where transferring individual small files is maybe 100 times slower than your local storage is able to copy files into an archive.

But in a scenario like the OP's with hard drives on both ends and a gigabit network inbetween the balance is literally reversed, the transfer even of small files is faster than a local copy.

So your experience and recommendation are entirely inapplicable to the scenario at hand. You're basing a recommendation based on something that's completely different. 

 

 

Yes, but this is why I specifically stated storing the data/files into an archive vs. compressing the data/files into the archive, as noted here:

 

On 2/29/2020 at 9:10 PM, alpha754293 said:

Remember, you're not COMPRESSING the data on the RPi4. You're only "storing" the data (i.e. zero or no compression), which if you're uzing the 7z Linux command line tool, would be akin to:

7z a -t7z -m0=Copy -mx=0 ...

 

You can write data/files into archive files WITHOUT compressing them.

 

I think that in 7-zip (File Manager) and also in WinRAR, that's just called "storing" the data.

 

This means that no CPU cycles are expended on actually trying to compress the data using ANY compression algorithm, at which point, your bottleneck is going to be how fast can your RPi4 store the data/files into the archive without compressing said data/files.

 

If your RPi4 is capable, and your single hard drive has a STR of say 150 MB/s, then so long as your RPi4 can sustain it, you can store the data/files into the 7-zip archive file (extension *.7z) at the 150 MB/s rate. 

 

"But in a scenario like the OP's with hard drives on both ends and a gigabit network inbetween the balance is literally reversed, the transfer even of small files is faster than a local copy."

 

Try it.

 

Create 1 GB worth of files with sizes ranging from 0.1 kB to 100 kB and try copying them over a GbE. Note your average transfer rate.

 

"Simple" math (i.e. not taking into account the difference between binary-decimal difference/conversion) suggests that transfer 1 GB worth of data over a single GbE NIC should take about 10 seconds.

Therefore, if you have 1 GB worth of files with sizes ranging from 0.1 kB to 100 kB, if the total time that it takes to transfer those files is > 10 seconds, then you can see the impact/penalty that trying to copy a bunch of small files will have on the effective transfer rate. It's amazing what you learn when you have to deal with 1.5 MILLION files in just a single 1.3 TB folder alone (which results in a simple arithmetic average file size of 952 kB).

 

"So your experience and recommendation are entirely inapplicable to the scenario at hand. You're basing a recommendation based on something that's completely different."

 

Does it hurt to try it?

 

If the answer is no, then why wouldn't you?

 

Again, you don't have to do anything that you don't want to.

 

If you want to do things just the one way that you know how, then by all means, do that.

 

But if you're wililng to come up with and try new and different ways of doing the same thing, but differently, and who knows, maybe because you're willing to try something different than what everybody else is doing, you might actually see some potential benefit to it (or not), the difference between just doing it the one way that you know how vs. trying a new or just a different way is that when someone asks you why didn't it work, you'd have a better response than "I don't know. Didn't try it."

 

Like in the case of my Buffalo TerraStation 441, I CAN'T do this because the processor that's in it is too slow to be able to process the data fast enough to make this idea worthwhile, so I don't do it on that NAS. But on my Qnap NAS units, which has faster processors, I CAN implement this.

 

And also, oh by the way, yes the files are prepped to be written to tape, but it's also stationed/put "on deck" on my cluster headnode where the data is prepared prior to being written/sent to tape.

 

In other words, the data comes from my servers, to the cluster headnode, and is stationed there prior to written to tape, which means that I still have the transfer from the server to the cluster headnode to deal with before it even goes on to tape. In other words, my data actually has two hops (because my cluster headnode can read/write to large scratch disk array (24 TB) at a sustained STR of around 500 MB/s. The bottleneck there is I don't have enough physical drives to increase the STR.)

Link to post
Share on other sites

Again unless your target is dead slow with small files like your tape drive none of this matters, and any archiving even without compression only loses time.

Small file overhead isn't that big on HDDs, and write caching if enabled in any modern OS nearly eliminates it entirely.

 

  

22 minutes ago, alpha754293 said:

It's amazing what you learn when you have to deal with 1.5 MILLION files in just a single 1.3 TB folder alone

Yeah but we couldn't care less about it here since the OP has media files, aka smallish number of big files. Again your recommendations do not relate in any way to the scenario.

 

  

22 minutes ago, alpha754293 said:

Does it hurt to try it?

 

If the answer is no, then why wouldn't you?

Becasuse there is zero doubt that it's not going to help and just be a waste of time.


Desktop: i7-5960X 4.4GHz, Noctua NH-D14, ASUS Rampage V, 32GB, RTX2080S, 2TB NVMe SSD, 2x16TB HDD RAID0, Corsair HX1200, Thermaltake Overseer RX1, Samsung 4K curved 49" TV, 23" secondary

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB NVMe SSD RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

Dell XPS 2 in 1 2019, 32GB, 1TB, 4K / GPD Win 2

Link to post
Share on other sites
2 hours ago, alpha754293 said:

The problem with looking at it in terms of time is because the time-based calculation method assumes that each (set of) task(s) is sequential.

In other words, your inherent assumption is that you're trying to store all 4.2 TB into one file, and then sending over just that one file, and then unpacking only just that one file.

 

But there is nothing that says that you need to nor have to do it that way.

So try and create 10 zip archives using a script then copy and then extract, how exactly does this make it faster. You know you can only go as fast as the HDD on the source computer right?

 

Parallel operations will only go as fast as the hardware allows.

 

It doesn't matter how you slice it, my point is accurate and correct. What you suggest has to actually be faster or it's pointless, it's a 3 step process and multiple archives doesn't change that, you're just multiplying the number of 3 step processes that you can try and run in parallel but you will be hardware limited net result little to no gain.

 

The end of the day total completion time is all that matters, nothing else does.

 

This situation was about large video files, hard disk to hard disk, over network and finding the most optimal way to do it both time wise and operationally and it's been answered. Creating archives doesn't help here at all.

2 hours ago, alpha754293 said:

I'll put it to you this way -- when you have to migrate an 80 TB server

So. I've managed 2500 LTO-5 tapes as well as 8PB of on disk backup storage. Mentioning this has no applicability to the question that was asked.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×