queue length and max disk usage

biotoxin · May 29, 2016

I've been running backup storage servers for the oil fields for a few months now and a new client is overwhelming my system. For the mean time I've limited them while I make adjustments so they don't end up crashing everything but I don't know how to adapt for their traffic. When they perform their backups on my system they run every single array to 100% disk usage and build disk queue lengths up over 160 until long after they're done causing everything else in the system to hang and in one case crashed one of the servers forcing them over to the secondary. Right now I've limited their traffic to one array of disks. Each array is z2 vdev8 with 250gb ssd cache drive.

I've considered trying raid0 ssd's as cache, and what basically amounts to a ram drive in order to accommodate intensive requests but not even sure if those would work yet. I can't determine if it's anything on their end other than pushing too much at once but I'm also quickly exhausting testing on my side to see where the problem is actually occurring. The files they're sending are roughly 2-4gb each in storage containers (zip, rar, etc) and there's plenty of capacity left in each array I've tested which was also part of a check to see if any drives might be going. The servers themselves are running CentOS. I've also confirmed it only happens when one specific client is connected which leads me to believe that is has something to do with how they're transmitting the data but in simulated transmissions I've tested between the two servers I have up I can't replicate the problem.

I can't make sense of what is going on or why. I've confirmed it happens on all arrays on both servers and continues long after the client has disconnected bringing me back to believing it's on my end somehow. From what I can tell it takes roughly 40 minutes for them to perform a complete backup of all their data to my system and the problem persists anywhere from 30 minutes after they're done to 12 hours in one case, with the average being 2 hours. If I could replicate the problem I'd experiment with solutions like dropping the entire array and re-adding or more forceful tactics in the name of experimenting but until I can duplicate it I can't risk losing their data.

I'd try talking with their guy but when you can reach him it's either a run around or talking him through every step, getting details on their settings was pulling out teeth and hair at the same time. Do I just need more/better hardware or did I seriously screw something up somewhere? Are the servers secretly trolling me?

Mark77 · May 29, 2016

RAID-5/6 take a pretty significant performance penalty with writing. Certain Unix software will cache data in RAM for a pretty long time until the buffers themselves are flushed. I know from my work with the 3.18.x and subsequent kernels, that I am pretty sure there's a major Linux I/O bug in kernels subsequent to 3.18.16 (I do some moderately high I/O stuff with Linux as well!!) which persists even in the latest Kernel 4.6 code (I really, really need to start harassing some kernel devs!!).

One other question, just how do you have the devices mounted? In some cases, you can try mount options that relax certain metadata collection/updating procedures that probably are irrelevant to a backup server. In ext4, the noatime option is highly recommended as a mount option, for example, in /etc/fstab.

Private Message me if you would like to send more details and/or would like an on-site consultation (I might not actually be all that far from you!). Unix systems knowledge on this forum is quite limited. There's a limit to what I can do without compensation, but if we have a similar issue, maybe the solution is similar.

leadeater · May 29, 2016

What is the client using to send that backup data? Backup software e.g. Commvault or just scheduled file copies?

How many devices is the client using to send data? Sounds more like too many data streams causing too much random disk access rather than a pure throughput problem, which would explain a little bit why you can't reproduce the issue. I know I have to be careful scheduling when backups run at work and limit data streams.

Also I agree with @Mark77, sounds like RAM caching. You must have a fair amount though for it to take an average of 2 hours to clear it.

biotoxin · May 30, 2016

I have dell r910's, and I'd had half of the risers populated with 16gb dimms, well today after a phone call checking between kingston and dell and getting nowhere I've gone full stupid and pulled all the ram from the secondary and put it in the primary to see if that helps, and got lucky on ebay so I've got another 512gb on the way to repopulate the secondary, meaning to say I now have 1tb of ram in the primary up from 512.

dell won't say that the r910's can handle 32gb dimms but they don't adamantly say they don't either, I've got 16gb dimms coming but I can't help but wonder about what a 2tb of ram system would feel like? and what would that be good for anyway?

I've mounted as NFS and kerberos for encryption with sync weekly and scrubs first sunday of the month, clients are redirected through a seperate firewall (also redhat but not centos specific) that handles the redirect of incoming data and access but otherwise doesn't interfere with legit traffic, I basically tried to make it as simple as possible for any clients to setup and use, sshfs didn't seem viable only because I couldn't figure out file locking between users otherwise I might have tried that.

mount itself is ext4 (rw, nouser, noexec, async, auto, relatime, uid=gid=#, sec=krb5p, data=ordered, port=#)

I had rsize and wsize when this first started but they were my first guesses in trying to figure this out, and given that people don't need the extra speed for a backup I've decided to leave them out. Tried sync instead of async but had to add soft and an error timer for timeouts and after a dozen calls I realized that if this somehow solves my problem I've got bigger problems. Might try without specifying at all? The firewall handles most preventative security measures like stopping users from setting their uid or executing anything. So it may be redundant to mount with things like noexec but it shouldn't impact performance much less like my problem. I've also considered dropping data order but haven't actually tested yet. I can try noatime instead of relatime though not sure what to expect?

In terms of location I'm northwestern part of northdakota officially but I go as far east as bismark on calls for my second job, fargo if the pay is good enough. I'm pretty much at the montana canada border though.

no clue how the client is sending the data though, gave them access information and setup their permissions, then whoever I worked with must've gotten fired or quit because I haven't heard from him since only this goober with apparently no server experience. I imagine they are sending from a centralized server given the information is a single container, but if they're sending it in super small packets that would make sense, I'll break out the network testing tools to see if I can get a better report on exactly what they're doing. I'll see if I can reach them for details on how many systems are sending data if I can't determine on my end.

Though I repeat that they're sending a single file, not a bunch of files, like a single dvd sized iso, which I can confirm as being entirely written to disk when they disconnect.

can't confirm anything like contents for obvious reasons but I will say most of my data backup is for surveyors telemetry and system reports for other companies, not that I try to specialize but I'm oil fields and that happens to be what they need to backup. I've guessed that it could be a camera feed dumping its recording for a rewrite disk, but I wouldn't know of course. This is clearly the oddball of my clients, if only for the fact they're putting files into a container file before sending them.

leadeater · May 30, 2016

2 hours ago, biotoxin said:

Though I repeat that they're sending a single file, not a bunch of files, like a single dvd sized iso, which I can confirm as being entirely written to disk when they disconnect.

Thanks for the clarification, wasn't sure if it was multiple files during their backup or not. Would be interested in what the issue was when you find it.

Personally I would be looking at some network QoS/bandwidth shaping rules, not so much to solve this issue but just as a general quality assurance measure. Plus you'll be able to limit the total input data rate to no more than the server can actually safely ingest.

biotoxin · May 30, 2016

1 minute ago, leadeater said:

Thanks for the clarification, wasn't sure if it was multiple files during their backup or not. Would be interested in what the issue was when you find it.

Personally I would be looking at some network QoS/bandwidth shaping rules, not so much to solve this issue but just as a general quality assurance measure. Plus you'll be able to limit the total input data rate to no more than the server can actually safely ingest.

bandwidth hasn't really been a problem, gigabit network and gear, internally even higher than that, I've real tests had my servers move around 800mb/s uncached, if I exclusively move cached data I've seen 2gb/s. I imagine if I really tweaked and went full throttle I could push much higher with proper utilization between arrays, just for the sake of testing and saying I can do it. (maybe challenge linus in a video?)

what's the maximum you can squeeze from an intel x710? (4x 10gbe = 40gbe???) I'd have to check the switch to make sure it could handle it, and of course buy a few more connectors and lines but that won't be till july when I setup my next server.

I bet it's the nic come to think of it, I'll find my nic's that came with the bugger and swap out to see if that does it. Servers just do not like upgrades. Mean time I'm going to try the embedded for that specific client and see if it makes a difference.

biotoxin · May 30, 2016

did some numbers, came back

I'll totally throw down a challenge like that if I get lucky with the new server, going to need a server with pcie gen3 though

the nic and switch and such are all capable of more, but the interface between the nic and storage is pcie and in this case the nic being x8 means I'll quite literally hit the maximum data cap long before anything else.

right now pcie2x8 is still like 4gb/s (~roughly) so no actual problem for me there with what I have, but if I wanted to do a 40gbe showdown I'm going to need more

I've got to get into some high end data center and have my mind blown by these 100+gbe setups, I just have a hard time actually imagining what they're used for, even if I already know it's still incomprehensible

I bet google has some fancy setup where they test in tb/s and laugh at how slow it is

leadeater · May 30, 2016

2 hours ago, biotoxin said:

did some numbers, came back

I'll totally throw down a challenge like that if I get lucky with the new server, going to need a server with pcie gen3 though

the nic and switch and such are all capable of more, but the interface between the nic and storage is pcie and in this case the nic being x8 means I'll quite literally hit the maximum data cap long before anything else.

right now pcie2x8 is still like 4gb/s (~roughly) so no actual problem for me there with what I have, but if I wanted to do a 40gbe showdown I'm going to need more

I've got to get into some high end data center and have my mind blown by these 100+gbe setups, I just have a hard time actually imagining what they're used for, even if I already know it's still incomprehensible

I bet google has some fancy setup where they test in tb/s and laugh at how slow it is

When your talking 40Gb networking getting disk arrays, even SSD cached, to go that quickly is not a simple task. At work our distribution layer is 40Gb and so is our WAN links but all our server connections are still 10Gb, 40Gb uplinks from TOR though.

During backups we pull about 2.5GB/s off our Netapp 8040 4 node cluster at our main site and that is basically the max it can do, controllers can do more but the underlying storage can't.

If you can sustain 40Gb for a long period of time I'll be impressed, single server of course.

biotoxin · May 30, 2016

1 hour ago, leadeater said:

When your talking 40Gb networking getting disk arrays, even SSD cached, to go that quickly is not a simple task. At work our distribution layer is 40Gb and so is our WAN links but all our server connections are still 10Gb, 40Gb uplinks from TOR though.

During backups we pull about 2.5GB/s off our Netapp 8040 4 node cluster at our main site and that is basically the max it can do, controllers can do more but the underlying storage can't.

If you can sustain 40Gb for a long period of time I'll be impressed, single server of course.

now it really sounds like an exciting proposition to attempt

pm me with target goals, because with just simulated loads from ram cache I imagine it could be done

as to the problem at hand I've turned the system against itself when idle, it's now using 3/4 of all resources to try and bomb the remaining 1/4 in as many respects as possible, running all kinds of sims, I'll see how it's doing when I get in tomorrow morning to see if any reports show significant spikes or lags, looking for utilization, queue depth, timings, etc and if so under what conditions and settings

still have to find the nic's, they weren't where they're supposed to be so I'm hoping they're at least where I hope they are.

if the embedded solves this or an original nic does then it's just a matter of settings and I'll be set.

I'm really really hoping this isn't a deep problem.

biotoxin · June 2, 2016

Been fiddling for a couple days with more settings and hardware, new ram showed up so putting that in the other server and bringing it back up today.

I'm going to try one more thing with some spare drives I have. I noticed most of my arrays are running same drive setups and I hadn't thought to check if maybe it's inherent to the drives, which it shouldn't be given they're enterprise drives designed for servers but who knows right? I've got a handful of premium HGST drives I wasn't planning on using just yet but I've got to know. Wish me luck.

leadeater · June 2, 2016

11 hours ago, biotoxin said:

Been fiddling for a couple days with more settings and hardware, new ram showed up so putting that in the other server and bringing it back up today.

I'm going to try one more thing with some spare drives I have. I noticed most of my arrays are running same drive setups and I hadn't thought to check if maybe it's inherent to the drives, which it shouldn't be given they're enterprise drives designed for servers but who knows right? I've got a handful of premium HGST drives I wasn't planning on using just yet but I've got to know. Wish me luck.

You don't have mixed sector sizes in the pool by any chance? 512n, 512e, 4k. I know in Windows Storage Spaces if you mix 512n with 512e without specifying that the pool should use 512b sector size things can be very bad, I don't use ZFS so not sure if that is a thing or not.

http://jeffgraves.me/2014/06/03/ssds-on-storage-spaces-are-killing-your-vms-performance/

Quick google and it seems there is at least something similar.

http://wiki.illumos.org/display/illumos/ZFS+and+Advanced+Format+disks

http://blog.delphix.com/gwilson/2012/11/15/4k-sectors-and-zfs/

I don't think you have this problem but hey just in case plus it's interesting to know, I had this when I first started using storage spaces with SSD tiering.

biotoxin · June 3, 2016

10 hours ago, leadeater said:

You don't have mixed sector sizes in the pool by any chance? 512n, 512e, 4k. I know in Windows Storage Spaces if you mix 512n with 512e without specifying that the pool should use 512b sector size things can be very bad, I don't use ZFS so not sure if that is a thing or not.

http://jeffgraves.me/2014/06/03/ssds-on-storage-spaces-are-killing-your-vms-performance/

Quick google and it seems there is at least something similar.

http://wiki.illumos.org/display/illumos/ZFS+and+Advanced+Format+disks

http://blog.delphix.com/gwilson/2012/11/15/4k-sectors-and-zfs/

I don't think you have this problem but hey just in case plus it's interesting to know, I had this when I first started using storage spaces with SSD tiering.

lots of progress made yesterday

hadn't thought about cluster size but I know I had to go to like 8k or 16k for some reason, I'll have to see if I can pull it back to 4k but I think it's something to do with volume size limitations?

but I spent the day trying to get some serious assistance from seagate and they have a known issue which may or may not apply after I evaluate

I'm about to go through the records for last night to see how the HGST worked out and I have some WD's I'll add as well since apparently there's also a corruption problem or so my scrub is telling me, something must've happened.

I'm hoping one giant fix for everything but it might be things coming to a boil, we'll see.

biotoxin · June 4, 2016

so not sure if it's fixed or if I found the actual problem

corruption was actually the exact files they'd been sending me

from what I can tell they were sending corrupt files

so apparently my system was freaking out over their problem?

of course now its the weekend and I won't see anyone till monday but I think that might be part of the problem

though the system didn't hang with the other drives now, so maybe it was the drives?

something doesn't add up somewhere and at least now I feel like I'm making progress.

leadeater · June 5, 2016

4 hours ago, biotoxin said:

so not sure if it's fixed or if I found the actual problem

corruption was actually the exact files they'd been sending me

from what I can tell they were sending corrupt files

so apparently my system was freaking out over their problem?

of course now its the weekend and I won't see anyone till monday but I think that might be part of the problem

though the system didn't hang with the other drives now, so maybe it was the drives?

something doesn't add up somewhere and at least now I feel like I'm making progress.

I wouldn't have thought your system would care if they were corrupt files or not, or would even be able to tell. ZFS stops files getting corrupt after being stored, putting corrupt files in should mean they stay like that?

Unless some smart logic is being applied to the file meta data and when it tries to validate the file it's picking up the corruption and somehow trying to fix it? Could explain the hours and hours of disk activity?

biotoxin · June 5, 2016

12 minutes ago, leadeater said:

I wouldn't have thought your system would care if they were corrupt files or not, or would even be able to tell. ZFS stops files getting corrupt after being stored, putting corrupt files in should mean they stay like that?

Unless some smart logic is being applied to the file meta data and when it tries to validate the file it's picking up the corruption and somehow trying to fix it? Could explain the hours and hours of disk activity?

I wouldn't have thought so either, I figured it didn't analyze content and simply confirmed data integrity? I started looking into details surrounding it, smart money is still on the drives themselves but at this point I'm happy to have 3 potential answers compared to the 0 I had before.

hopefully come monday I can start getting solutions

I am however going to try and get them to do something different on their end and try that first because if I get around it on my end and the data being stored is worthless anyway I could see it coming back to bite me like "why didn't you tell us"

I just find it interesting the the only thing that threw errors during regular maintenance was their data

but granted it's not like I did QA before throwing brand new HGST drives in an array and putting them to work so maybe the fancy firmware or drivers solve the solution and in turn caused the corruption in doing so.

so have them test two backups, the newest one on the hgst drives, and one from the seagates

I'll also move another client to the new drives and test how they behave that way

also get a second sample of backup from the new client on the new drives to see if corruption is somehow recurring, maybe do mirrored setup to see if anything is different between the drive arrays? check behavior in that scenario as well.

I have a massive to-do list.

biotoxin · June 9, 2016

here we are thursday and the answers I have make no sense

the backups worked fine thankfully, no problems have occurred since

the HGST drives were fine and not causing any problems I can find

moved everyone back to normal setup, and not a single problem at all of any kind since

I literally can't recreate the scenario that was happening every day, so either I fixed it permanently or it fixed itself or the gremlins are trolling me

I'm hoping it was their end this whole time and they fixed it

only thing I can think of on my end left was a bad sector finally getting detected by the drive and no more attempts at putting data on it? IDK

thanks for the assist, I'll revive if it ever happens again.

leadeater · June 9, 2016

Little disappointing you didn't get a concrete answer for the problem, if I was to put money on it I'd say the client fixed it on their end.

Sign In

queue length and max disk usage

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

The Future of PC Cooling?

Latest From ShortCircuit:

The coolest looking monitor. Period. - ASUS ROG display at Computex (Sponsored)

Latest From TechLinked:

Microsoft Just Can’t Help Itself