Jump to content

Crazy NAS/data/file server build

58 minutes ago, Electronics Wizardy said:

nvme drives should show up as a /dev/nvm0 or so on in freebsd, and work like any other block device.

 

used drives are your friends, you can find these cheap on ebay and there very good values. Id get something like this here https://www.ebay.com/itm/Samsung-PM983-2-5-3-84TB-SSD-PCIe-Gen3/123572508794?hash=item1cc57ed87a:g:QyMAAOSwfvtcUITI

 

 

Maybe FreeBSD improved on that the last time I tried running ZFS on it. I didn't know what their nomenclature mechanism was for conventional SATA drives.

 

That's an interesting proposal.

 

Random write is only 50k IOPS though.

 

And their sequential write speeds are also terrible.

 

206 MB/s.

 

(Source: https://webcache.googleusercontent.com/search?q=cache:mpRkkuXBtWcJ:https://www.storagereview.com/samsung_983_dct_nvme_ssd_review+&cd=1&hl=en&ct=clnk&gl=us)

 

It was an interesting proposal, but unfortunately, 206 MB/s just isn't going to cut it.

 

Thank you, though, for helping me do a little bit of the research.

 

That is greatly appreciated!

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

On 6/26/2019 at 5:19 PM, alpha754293 said:

However, enterprise drives are also a lot more expensive (relatively speaking) than consumer grade drives with, often times, lower performance. The real difference is their MTBF and write endurance. But as far as performance goes (which is what this build will be geared towards), the enterprise grade SSDs doesn't really help in this regard.

The enterprise ones do have higher performance, there's two scenarios where you'd see them performing less. 1.) It's a read optimized model (you don't want these) or 2.) The testing was carried out after the drive had been write conditioned to remove the burst perform so you can see true performance (because in a server use case boost can almost never be utilized as the drive is always in write saturated condition).

 

Consumer SSDs after write conditioning perform at or below an enterprise read optimized SSD, Samsung Pros are one of the few that can do slightly better though but still nowhere near a Mixed-Use or Write Intensive SSD.

 

As for the used options on ebay you'll have to go through Samsung's/Intel's/Micron's back catalog of products then check their pricing on ebay, used prices are excellent and something I keep an eye on regularly. Personally I'd take a used, good, server SSD over a Samsung Pro. 

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, leadeater said:

The enterprise ones do have higher performance, there's two scenarios where you'd see them performing less. 1.) It's a red optimized model (you don't want these) or 2.) The testing was carried out after the drive had been write conditioned to remove the burst perform so you can see true performance (because in a server use case boost can almost never be utilized as the drive is always in write saturated condition).

 

Consumer SSDs after write conditioning perform at or below an enterprise read optimized SSD, Samsung Pros are one of the few that can do slightly better though but still nowhere near a Mixed-Use or Write Intensive SSD.

 

As for the used options on ebay you'll have to go through Samsung's/Intel's/Micron's back catalog of products then check their pricing on ebay, used prices are excellent and something I keep an eye on regularly. Personally I'd take a used, good, server SSD over a Samsung Pro. 

Again, I have to respectfully disagree.

 

The Samsung 983 DCT that's in the eBay link only has upto 52k IOPs on 4 kB, QD32 random writes (for example). (Source: https://webcache.googleusercontent.com/search?q=cache:mpRkkuXBtWcJ:https://www.storagereview.com/samsung_983_dct_nvme_ssd_review+&cd=1&hl=en&ct=clnk&gl=us)

By comparison, even Samsungs 860 Pros have upto 90k IOPs (4 kB, QD32 random writes) (Source: https://www.samsung.com/semiconductor/minisite/ssd/product/consumer/860pro/) and that's their SATA drive. That isn't even their NVMe drive.

Samsung's 970 Pro NVMe drive has upto 500k IOPs (4 kB, QD32 random writes) (Source: https://www.samsung.com/semiconductor/minisite/ssd/product/consumer/970pro/).

The only metric where the Samsung 983 DCT surpasses the others is in the number of I/Os per second in random read (580k IOPS vs. 100k IOPS and upto 500k IOPS respectively) and also the sequential read/write speeds, if you compare between SATA and NVMe. But NVMe to NVMe sequential read/write speeds the Samsung 970 Pro takes both of those as well.

In fact, in the review by StorageReview.com about the Samsung 983 DCT, they write:

"As the SSD market looks to further segment products, the Samsung 983 DCT comes in as an NVMe product offering 0.8 DWPD, slightly under competing products that focus on the 1 DWPD mark in their read-intensive positioned drives. As such, it wasn't shocking to see lower write performance from the 983 DCT. Instead the drive carries a larger emphasis on read-performance. Here it was able to offer lower initial latency in small and large-block transfers. Overall the 983 DCT will do good work in read-heavy environments that lean toward a more value-oriented NVMe drive."

The Intel doesn't do much better in the benchmarks that I've read for the DC P4510 and P4610, but the Microns (which are even MORE expensive than the Intels) fair quite a lot better (Source: https://www.micron.com/-/media/client/global/documents/products/product-flyer/9300_ssd_product_brief.pdf?la=en).

 

So it looks like that if I want to achieve what I want to be able to achieve, it will have to be with a dual EPYC system (256 PCIe lanes total) and a Supermicro A+ Server 2123US-TN24R25M because it looks like that will be able to actually give each NVMe drive the full x4 interface speed without a PCIe switch.

Downside, with the smallest capacity Micron 9300 NVMe SSD, the total system cost will be some in the neighbourhood of $30k-35k by the time all's said and done.

 

It's such a shame that a) based on the random write IOPS number, I'd need a minimum of 11 or 12 drives to be able to absorb 100 Gbps worth of data coming in (if it is purely random data) and b) the consumer grade drives aren't anywhere close to that.

 

It'd be cheaper, but I'd also get about half the performance with 6 NVMe drives in NVMe RAID based on AMD's own testing. And having 12 Micron 3.2 TB 9300 MAX drives will only take the total down from around $33k to $23k, which is still a lot of money.

 

*sigh...*

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, alpha754293 said:

Again, I have to respectfully disagree.

 

The Samsung 983 DCT that's in the eBay link only has upto 52k IOPs on 4 kB, QD32 random writes (for example). (Source: https://webcache.googleusercontent.com/search?q=cache:mpRkkuXBtWcJ:https://www.storagereview.com/samsung_983_dct_nvme_ssd_review+&cd=1&hl=en&ct=clnk&gl=us)

I wasn't talking about that specific model, Samsung makes many and that one is a read optimized one, like I mentioned not one you want. Samsung alone has a back catalog of more than 30 enterprise SSDs. You won't easily find these on their websites because they are EOL.

 

There are 3 distinct categories of enterprise/server SSDs: Read Intensive, Mixed-Use and Write Intensive. Most say what they are but you can easily tell by the DWPD specification. Read = 0.8 and below, Mixed = 0.8-1.5, Write = 2-10+.

 

There are no consumer SSDs with more than 0.8 DWPD, Samsung 970 Pro is 0.6 and the 970 EVO is 0.3.

 

I need to mention point 2 again, enterprise SSDs are tested after write conditioning (sustained writes to the drive for long time, sometimes hours) before the testing is carried out. Consumer testing does not do this, they don't because they would have horrific performance under that condition. Samsung does not make their enterprise SSDs slower than their consumer SSDs, this is not a thing. Samsung also use the same NAND flash chips and DRAM in all their SSDs and in many cases use the same SSD controller in both enterprise and consumer, the difference comes in with physical NAND over provisioning, firmware optimization and power loss protection.

 

This is why even consumer Samsung SSDs are so good, they are very alike to their enterprise counterparts but lack the DWPD which means lower write performance once write conditioned. You do actually need to consider that because it's likely the SSDs will become write saturated while you are using them, my Samsung 840 Pros only do 30-35 MB/s in that state, 850 Pro only does slightly better (well twice as good but still bad).

Link to comment
Share on other sites

Link to post
Share on other sites

On 6/27/2019 at 4:39 PM, leadeater said:

I wasn't talking about that specific model, Samsung makes many and that one is a read optimized one, like I mentioned not one you want. Samsung alone has a back catalog of more than 30 enterprise SSDs. You won't easily find these on their websites because they are EOL.

 

There are 3 distinct categories of enterprise/server SSDs: Read Intensive, Mixed-Use and Write Intensive. Most say what they are but you can easily tell by the DWPD specification. Read = 0.8 and below, Mixed = 0.8-1.5, Write = 2-10+.

 

There are no consumer SSDs with more than 0.8 DWPD, Samsung 970 Pro is 0.6 and the 970 EVO is 0.3.

 

I need to mention point 2 again, enterprise SSDs are tested after write conditioning (sustained writes to the drive for long time, sometimes hours) before the testing is carried out. Consumer testing does not do this, they don't because they would have horrific performance under that condition. Samsung does not make their enterprise SSDs slower than their consumer SSDs, this is not a thing. Samsung also use the same NAND flash chips and DRAM in all their SSDs and in many cases use the same SSD controller in both enterprise and consumer, the difference comes in with physical NAND over provisioning, firmware optimization and power loss protection.

 

This is why even consumer Samsung SSDs are so good, they are very alike to their enterprise counterparts but lack the DWPD which means lower write performance once write conditioned. You do actually need to consider that because it's likely the SSDs will become write saturated while you are using them, my Samsung 840 Pros only do 30-35 MB/s in that state, 850 Pro only does slightly better (well twice as good but still bad).

Yeah, I'm not sure if I'm quite at 1.0 DWPD yet.

 

I'll have to do some testing/benchmarking to find out (in some limited way, again, only because my current SSDs are all SATA, so that's going to limit/hamper my physical ability to achieve a certain level of DWPD performance figures just because the drive and the interface itself just isn't capable of pushing more data onto/off of it.

 

Again, this is why I am going to have to play chicken-and-the-egg with some of this stuff only because it takes quite some time to conduct some of these tests before I would have an answer/enough information to then know what I need to look/shop for.

 

Right now, I just don't have the requiste data that I can use/leverage in order to guide the decision making matrix for the capex acquisition.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, alpha754293 said:

Yeah, I'm not sure if I'm quite at 1.0 DWPD yet.

I doubt you would be but you don't actually need to write the entire drive worth of data to hit write saturated state, just have enough write rate long enough for that. If there is enough time between each large write session of your workload for TRIM and GC to clean up you'll be fine and this will never be a problem, even on less write resilient SSDs. Would just suck if you start off with excellent performance you're looking for then it slowly degrades and you have to figure out why.

Link to comment
Share on other sites

Link to post
Share on other sites

16 hours ago, leadeater said:

I doubt you would be but you don't actually need to write the entire drive worth of data to hit write saturated state, just have enough write rate long enough for that. If there is enough time between each large write session of your workload for TRIM and GC to clean up you'll be fine and this will never be a problem, even on less write resilient SSDs. Would just suck if you start off with excellent performance you're looking for then it slowly degrades and you have to figure out why.

Yes, this is true.

 

It's just a pity that, of course, enterprise grade SSDs aren't benchmarked, in the public domain, as much nor as often compared to consumer grade drives for obvious reasons.

 

And as a consequence of that, it also makes trying to find out the sustained performance/endurance vastly more difficult which means that again, when I go ahead with this build, it's going to be quite the catch 22/chicken-and-egg conundrum  because I would need to acquire the hardware to be able to conduct my own and such testing, in order to inform the choice of hardware that I need to purchase as a solution to this need/problem.

 

The other issue, for example, is that with thinkmate.com -- the NVMe drives that they offer with the build is whatever they offer.

 

The Supermicro A+ servers, for example, are sold only as a complete system (i.e. the systems integrators/builders aren't allowed to sell those barebones) which means that even if I were able to find some Samsung or Micron or Intel NVMe SSD with a higher write endurance, I would still have to buy the system with at least ONE of the lower write endurance NVMe drives to get Thinkmate to ship the system out the door, only for me to then yank and dump the drive and pop in the NVMe SSDs that have the higher write endurance. (At least per the online configurator tool that they have.)

 

*sigh...*

 

Even the lowest spec'd model carries a price tag of $7673 with only ONE of their cheapest NVMe SSDs (Intel P4510 1 TB), and only 128 GB of RAM, with dual AMD EPYC 7251s. You add the price back in for the higher write endurance NVMe SSDs, you're looking at well north of $30k, and quite possibly starting to push into or almost into $40k.

 

Conversely, with my originally proposed build specs, at $6k, and knowing that the Samsung 970 Pros, even when you have six of them in RAID0, it will only be able to achieve less than HALF of the total available bandwidth the drives have been given, but it might be able to accept at least ONE single 100 Gbps 4x EDR IB stream coming into it that it might still be useful in that regard.

 

And of course, it's quite the price difference between $6k vs. $30k-40k.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

  • 3 weeks later...

@leadeater

Question for you - if a drive endurance performance value is given in "TBW in PB", and then a number -- does that number mean the number of drive writes per day or TB written per day for the length of the warranty or what does that number mean? How do I read/interpret a number like that?

 

(cf. https://www.micron.com/~/media/documents/products/product-flyer/9200_ssd_product_brief.pdf)

 

Thank you.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, alpha754293 said:

@leadeater

Question for you - if a drive endurance performance value is given in "TBW in PB", and then a number -- does that number mean the number of drive writes per day or TB written per day for the length of the warranty or what does that number mean? How do I read/interpret a number like that?

 

(cf. https://www.micron.com/~/media/documents/products/product-flyer/9200_ssd_product_brief.pdf)

 

Thank you.

https://www.micron.com/-/media/client/global/documents/products/data-sheet/ssd/9200_u_2_pcie_ssd.pdf.

 

9200 Eco: below 1 DWPD

9200 Pro: ~1 DWPD

9200 Max: 3 DWPD

 

Conversion can be tricky if you have to, happened to find a Miron doc wtih the DWPD figures. Looking at the 1.92TB Pro it would be 1867 days or 5 years to write 3.5PB at 1 DWPD. These SSDs have 5 year warranty so that lines up with that math, that's how I've worked it out in the past when no DWPD figure could be found and it's not been greatly wrong but hard to know if it's actually the case. Some SSDs have short warranty so the math can be wrong.

Link to comment
Share on other sites

Link to post
Share on other sites

14 hours ago, leadeater said:

https://www.micron.com/-/media/client/global/documents/products/data-sheet/ssd/9200_u_2_pcie_ssd.pdf.

 

9200 Eco: below 1 DWPD

9200 Pro: ~1 DWPD

9200 Max: 3 DWPD

 

Conversion can be tricky if you have to, happened to find a Miron doc wtih the DWPD figures. Looking at the 1.92TB Pro it would be 1867 days or 5 years to write 3.5PB at 1 DWPD. These SSDs have 5 year warranty so that lines up with that math, that's how I've worked it out in the past when no DWPD figure could be found and it's not been greatly wrong but hard to know if it's actually the case. Some SSDs have short warranty so the math can be wrong.

Thank you for your help.

 

So I'm doing some math and the ONLY, single, PCIe NVMe AIC SSD that I have recently "died" (went into read-only mode) because I exhausted the drive write endurance limit on it.

 

(It was only good for about 0.175 DWPD and it was a 0.4 TB (400 GB) drive.)

 

Calculating my current usage with that drive (keeping in mind that it's only 400 GB, so there were some limitations in terms of what I can and couldn't do with it), but it worked out that with a 400 GB drive, I was at about 0.513 DWPD with that SSD.

 

So, in anticipation of that, for my next build, even though it's going to have more SSDs overall, but I also want it to last longer than this one has, and with the larger capacity, I also reckon that I am going to try and use it more, that I'd be targetting writing upwards of 17.5 PB (at 3.2TB, 3 DWPD, for 5 years) * at least 8 drives, which should hopefully be able to give me enough endurance for 10 years worth of writes.

 

(Host writes total 252.2 TB over just 3.36 power-on years. NAND writes worked out to be 847.67 TB in the same amount of time.)

 

Again, this is with a relatively "small" SSD.

 

Unfortuantely, this new data actually tells me (and yes, you mentioned this to me before, but I didn't have the data prior to this in support of it, but now I have the data due to my current drive's failure) that I actually need to pay attention the drive endurance because the Samsung 970 Pro 1 TB can only do 0.6575 DWPD (on a 1 TB drive) for 5 years, or about 0.32875 DWPD if I want it to last twice as long and my current data from my "small" drive already exceeds that.

 

And while yes, I would have had 8 drives to help spread out some of the load and wear, I think that I will actually likely exhaust the drive endurance/write limit of these drives, which means that I HAVE to go with the enterprise SSD drive solution for this reason, almost alone. In other words, it's very unlikely that it will happen just due to the sheer cost of it.

 

Since spinning rust doesn't have this issue, it looks like now, it would be to try and get as many of mechanically rotating hard drives as I can, and have it run/operate as fast as it can so that it would still be a cost effective solution (relatively speaking) because the probability that I will LITERALLY wear through consumer grade drives is actually VERY likely now given that I've already done so with one drive. (The faster the drive set up that I will have, the more likely I'm going to be using it more due to its immense and impressive speeds, which will only wear it out even sooner.)

 

*sigh...*

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

32 minutes ago, alpha754293 said:

So I'm doing some math and the ONLY, single, PCIe NVMe AIC SSD that I have recently "died" (went into read-only mode) because I exhausted the drive write endurance limit on it.

 

(It was only good for about 0.175 DWPD and it was a 0.4 TB (400 GB) drive.)

 

Calculating my current usage with that drive (keeping in mind that it's only 400 GB, so there were some limitations in terms of what I can and couldn't do with it), but it worked out that with a 400 GB drive, I was at about 0.513 DWPD with that SSD.

Bigger capacity SSDs actually don't require as high DWPD because there is more NAND cells available for write leveling and GC, endurance isn't quite linear which is nice.

 

32 minutes ago, alpha754293 said:

Samsung 970 Pro 1 TB can only do 0.6575 DWPD (on a 1 TB drive) for 5 years, or about 0.32875 DWPD if I want it to last twice as long and my current data from my "small" drive already exceeds that.

Hmm that seems a bit low, law of averages I guess and didn't get the better side of it. Samsung had a 128GB 850 Pro still running after 8PB writes, honestly that is very high and I wouldn't count on getting that but it's possible even on these smaller capacity SSDs.

Link to comment
Share on other sites

Link to post
Share on other sites

On 7/18/2019 at 1:14 AM, leadeater said:

Bigger capacity SSDs actually don't require as high DWPD because there is more NAND cells available for write leveling and GC, endurance isn't quite linear which is nice.

 

Hmm that seems a bit low, law of averages I guess and didn't get the better side of it. Samsung had a 128GB 850 Pro still running after 8PB writes, honestly that is very high and I wouldn't count on getting that but it's possible even on these smaller capacity SSDs.

Samsung's specs for the 970 Pro NVMe SSD states that it's a total of 1200 TB writes limited warranty.

 

So working backwards, that's how I got that number.

 

In regards to the larger capacities and the write endurance, I agree with you, but then there's also the associated cost with it.

 

With the current data that I have with only a 0.4 TB NVMe AIC SSD, I was able to calculate what was my current writes-per-day in TB and then added a safety margin of 5 to that number, in order to try and predict/project what my future write needs will be if I have faster and larger capacity drives.

 

The lowest capacity Micron 9300 Max is 3.2 TB, so I went with that, because it's also the least expensive option as well.

 

Despite this, at $875 a pop, and I am likely going to need at minimum of 8 of them in order to be able to sustain the write speeds, that alone will already run me $7000 just for the drives alone.

 

Beyond that, if I still wanted to stick with consumer grade drives because they're cheaper (albeit, also lower capacity, since I don't think that I'm really capacity bound, not if it is just used for transient data processing and not long term, hyper speed data storage as well), then I MIGHT be able to push the Samsung 970 Pro 1 TB to its write endurance limit and then leverage the fact that I plan on running it in a RAID0 array, to also help distribute the wear load "balancing". (And of course, the more drives I will add to it, the better, but at that point, I would be looking at getting M.2 to U.2 NVMe adapters so that I can mount them into a 2U NVMe rackmount and some of those uses PCIe switches and plexing, which will reduce the per-drive performance, but it will llimit my ability to wear through the write endurance at full speed/wear rate.)

 

So I'm still trying to decide the best strategy, but it looks like that, based on the wear data that I have from my one and only NVMe PCIe AIC SSD, it's guiding my decision making processes with this additional data.

 

It looks like that cost is really the real limiter/barrier for me. 

 

What's also interesting to me is that this is what it takes to be able to keep up with an incoming data feed into the write server at an estimated rate of 100 Gbps, let alone the 400 Gbps that I was originally targetting.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

  • 4 weeks later...

I've just had my second through fifth consumer grade SSD failure so as a result of that, all SSDs, especially consumer grade SSDs, have been eliminated from my production environment compute cluster. (Four more drives just failed today - wore through the endurance limit again, and this time, I managed to accomplish that task/feat in a little over two years.)

 

Enterprise grade SSDs are also prohibited due to the cost and the need for something that has a SUBSTANTIALLY higher write endurance, which again, comes in at a price point that is not financially feasible for me.

 

This project is now officially declared DOA.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

  • 6 months later...

Hello guys!

 

made an account just because of this thread. I’m a developer for the DANOS project at AT&T. I’ve worked with squashfs and imaging snapshots for the past few years amongst other dataplane related projects. 
 

I recently acquired myself a new PC for doing some storage projects on. Here’s the jist (I‘ll post more specifics later, on my phone right now):


Desktop:

ryzen 3960x

3 x 1TB Sabernet nvme 4

asrock trx40 creator

 

Portable:

lenovo x1 carbon 5th gen

512mb intel pro nvme

 

anyways the goal is to create a end-all-be-all file system for me to use (for the rest of my life). The solutions I’ve been experimenting with is:

 

Desktop:

HW:

- 3 x 1 TB nvme 4 

- uefi chipset

SW:

- md raid 5 for a total of 2tb workable

- Luks encryption

- btrfs with multiple subvolumes for home/kernel files

 

My hope is to use nvme over fabrics to mount first my desktop home partition as my home directory on my portable. However I’ve been encountering sync issues. My question to you all, do you know how file flushes behave with nvmeof? For example if I have a network mounted nvmeof+ext4 file system and a write a file to this portion, will other simultaneously mounted file system clients on the fabrics get a write update? As far as I can tell writes to the actual hardware don’t happen till the client unmounts. Other clients also need to re-mount in order to see the file changes. Has anyone encountered something like this or could I be using a dated set up?

 

will be posting more exact config/details later on today but thought I’d get this rolling. 
 

thanks

erik

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×