Jump to content

So I have been tasked with something that is a bit large for me.

I am used to handling some fairly big projects on my own, and normally I will handle this sort of project from the ground up. All the research, implementation, etc.

But this one has brought some interesting challenges and some very time intensive research.

And a friend thought I should bring it here, and honestly, I thought it would be indeed interesting to see what sort of recommendations I get from this online community.

 

I have been tasked by my job to look into what it will take to start providing an online cloud-style storage service for public use.

This means a system that can handle storage of files for a client. Most likely backup files and things of that nature, mostly cold~warm storage stuff.

Since this will be a startup, costs is a high priority item, but the main priorities are as follows

 

Reliability

Performance

Capacity

 

 in that order.

 

Now I have come up with something that might be a viable solution and will post it upon request, but as I am looking for recommendations.

I don't want to taint the recommendations by posting what I am looking at already.

And I don't intend this to become a massive argument for why I should or shouldn't implement my current plan.

 

Instead, I would like others to offer up solutions that they might recommend, and maybe some pros and cons

I will be looking into all of them and comparing to see what will best fit all of our requirements.

 

Mainly what I am looking for is some sort of SDS (Software Defined Storage) solution.

I am not too interested in looking into proprietary hardware storage solutions like Dell EqualLogic, as the cost can get very expensive very quickly.

But still, I will hear out any argument if you want to state your point of what you believe will work and why, regardless if its hardware based or not.

 

Ideally, this system will be used for storage of backups and things of this nature.

I don't intend people to use this as some sort of pool for their data-intensive software such as a SQL database, although flexibility is always nice.

 

I am primarily interested in block and/or file style of storage

My main focus is Throughput and reliability.

IOPS is not at the bottom of this focus list, but it's not the highest priority.

I expect customers to store data for the long haul, and most likely large single files.

I am thinking something like 3x replication on this system to ensure maximum reliability. 

 

I don't expect anyone to hold my hand through any of this, as I will gladly put in the legwork to figure out how a particular solution ticks.

But I have hit a point where I have my mind set on a certain system.

And I have a sneaking suspicion that I haven't explored as deep as I can, and there may be a contender I have missed.

 

But ultimately I am tasked to take this venture and see how it can be done, what are start-up costs etc.

Eventually, if this plays through, in time it will scale into a full-fledged data center.

 

Like I said, I can post what I am looking into currently upon request, otherwise, I am curious to see where your minds wander to.

Link to comment
https://linustechtips.com/topic/967153-recommendations-for-a-storage-cluster/
Share on other sites

Link to post
Share on other sites

What type of iops and bandwidth do you need? What network speed are you thinking?

 

How much budget are we thinking? Id say 2k usd is your min.

 

Id probably get a few system like a dell r610. Then get sas expanders like a netapp ds4243. You can also get servers like a dell r510 that holds more drives, but a sas expander will allow for lower cost per drive and easier expansion.

 

Then Id run something like centos on every system, use ZFS as the filesystem and raid, then use gluster for making the cluster. Gluster is great for this use. ceph and lizardfs are also an option. Ceph is more for volume storage, that you don't seem to want. Lizardfs is slow in my experience, but allows for easy drive adding.

 

Get 10gbe between all systems.

 

That config above is about 2k usd for  3 bays, enclosures, and a switch on ebay. Then fill it with drives.

 

 

Link to post
Share on other sites

I am going to add this to the main post, but budget wise I am in the 10~50k region, there is flexibility to a point, it all depends on how I sell it and if I can prove its reliability and worthiness of that price tag.

And I am hoping to end up with  100tb+ or so if possible, to start out with.

But I have realized that can be easy or difficult to achieve at this price depending on how I deliver the final product, aka what type of system do I use.

Networking would be 10gbe at least for the intercommunication between the nodes. 

 

Also, I believe you can do file, block, and object storage with CEPH. 

I had looked into something like openstack + CEPH as an option.

 

Edited by ElSeniorTaco
Link to post
Share on other sites

2 minutes ago, ElSeniorTaco said:

I am going to add this to the main post, but budget wise I am in the 10~50k region, there is flexibility to a point, it all depends on how I sell it and if I can prove its reliability and worthiness of that price tag.

And I am hoping to end up with  100tb+ or so if possible, to start out with.

Networking would be 10gbe at least for the intercommunication between the nodes. 

 

how much experience do you have with linux, clustered systems.

 

Do you care about windows/linux?

 

Do you need support?

 

Do you mind used?

 

For that budget, id start going r720 for the servers, used 40gbe nics between systems. Get like 128 or 256gb of ram per server.

 

Then with 100tb of storage, were thinking about 400tb of actual drives(so parity losses, then room for 3 way clustering). New drives are about 40-50dollars a tb for enterprise/nas quality so thats about 16k in drives there. Id also get some ssds for cache. So probably 1k per server for a optane slog(if using zfs) then anouther 2k per server for some nvme drives for l2arc(per servers)

Link to post
Share on other sites

I edited the above post before you responded. Added the following

Quote

 

But I have realized that can be easy or difficult to achieve at this price depending on how I deliver the final product, aka what type of system do I use.

 

Also, I believe you can do file, block, and object storage with CEPH. 

I had looked into something like openstack + CEPH as an option.

1

And I am fairly familiar with Linux and I am comfortable implementing a solution that way, actually, most of my solutions end up on some sort of Linux based OS.

Also, I am totally fine starting with used hardware, everything but the drives, and probably battery backups for raid cards if caching is used.

 

For CEPH, my solution was

 

Red Hat OS 

Used servers with

1 core per drive, 1 socket preferably for this

1gb of ram per TB of storage

1 ssd for every 5 drives for journaling purposes (new ofc)

1 sas/sata card for every 6 drives

2 networking cards, with 10gb ports on it

dual redundant power supplies

and 10 total 10TB or 6GB WD Gold's or Seagate EXOS drives (new as well )

Per Node

 

And something like 4~7 nodes to start

I believe this went out of the budget when I went for 7 nodes but it was interesting and lots of pro's to this system.

It's in the top of the list for contenders in this project so far. 

 

I had thrown gluster out because of the brain split issues I was reading, but I never confirmed if this was an issue that has been reliably resolved?

It could be an old article I was reading

Edited by ElSeniorTaco
Link to post
Share on other sites

2 minutes ago, ElSeniorTaco said:

And something like 4~7 nodes to start

I believe this went out of the budget when I went for 7 nodes but it was interesting and lots of pro's to this system.

It's in the top of the list for contenders in this project so far

That sounds about what id do, but based on ceph instead of zfs + gluster.

 

Id do fewer nodes and more drives per node using sas expanders. This will allow for cheaper per drive, but lower performance and more optimized for lots of space than speed.

 

4 minutes ago, ElSeniorTaco said:

I had thrown gluster out because of the brain split issues I was reading, but I never confirmed if this was an issue that has been reliably resolved?

This shouldn't be a problem as its mostly a problem with 2 systems. With 3+ systems it uses the data on more than 2 of the systems, so a error on one won't cause a issue. You can cause this issues if you intention break lots of systems, but if you have the majority of your nodes failing at the same times you have other problems.

 

 

Link to post
Share on other sites

17 minutes ago, Electronics Wizardy said:

That sounds about what id do, but based on ceph instead of zfs + gluster.

 

Id do fewer nodes and more drives per node using sas expanders. This will allow for cheaper per drive, but lower performance and more optimized for lots of space than speed.

 

 I will look into this, I had sort of built this based on max performance, as well as speed for recovery of a failed node or drive.

But I am curious how much of a trade-off I can do by fattening the nodes up, and what that does to performance.

Either way, I agree with this, fatter nodes can lower costs by having fewer core components and more drives being managed by those components.

I am going to look deeper into this when I hit work on Monday, trying to figure out the time for recovery and performance for these nodes has been very tricky as there isn't a ton of solid information. 

You can find a ton of speculation and individual cases, but it's been difficult to hunt down solid reliable information on these subjects.

However, I haven't had enough time to keep digging and I think I will just need to continue putting in the leg work until I come up with an algorithm of sorts that I can rely on to gauge what my end result will be

 

17 minutes ago, Electronics Wizardy said:

This shouldn't be a problem as its mostly a problem with 2 systems. With 3+ systems it uses the data on more than 2 of the systems, so a error on one won't cause a issue. You can cause this issues if you intention break lots of systems, but if you have the majority of your nodes failing at the same times you have other problems.

I don't think I gave gluster enough of a chance then. I started reading horror stories, and it got worse and worse :) 

I will also look much deeper into this option. I am curious if you can squeeze more space out of a gluster system without sacrificing reliability. 

And how it can be implemented/what features I can offer a client off a system like this.

So I will certainly look back into this option as well.

 

 

I am going to hit the bed as its 2:18 am

But I looking forward to seeing what else gets thrown on the table, and what other opinions can be given on two options listed so far.

 

Link to post
Share on other sites

Just now, ElSeniorTaco said:

I am curious if you can squeeze more space out of a gluster system without sacrificing reliability. 

Its about the same as ceph or any other system. You have x copes or parity and you lose that much.

 

Also what do you have forbackup? Id look into take here.

1 minute ago, ElSeniorTaco said:

But I am curious how much of a trade-off I can do by fattening the nodes up, and what that does to performance.

Do you have a iops figure that you need? Speed shoudn't see that much of a hit here. But really depends on network, drives, workload and client config.

 

 

Link to post
Share on other sites

9 hours ago, ElSeniorTaco said:

Mainly what I am looking for is some sort of SDS (Software Defined Storage) solution.

Ceph, Gluster, SwiftStack all spring to mind for this.

 

Gluster: Simple and easy to setup, not feature rich natively but you can add on as required with 3rd part software (S3, NFS gateways etc).

Ceph: More single site performance orientated. Most feature and protocol rich I know of.

Swift: Designed more for cool/cold storage and multiple sites so excellent for backup targets.

 

Big difference between Ceph and Swift when talking multiple sites is Swift uses an eventually consistent model so data is written locally and acknowledged then storage rules will push data to other sites to meet the policy as required. Ceph on the other hand is always consistent so all nodes must acknowledge writes before the writing client gets acknowledged, this means it's a bad idea to stretch cluster nodes over sites (use replication instead or other method to write to two separate Ceph clusters i.e. in backup software policies).

Link to post
Share on other sites

8 hours ago, ElSeniorTaco said:

I am thinking something like 3x replication on this system to ensure maximum reliability. 

Not erasure coding? Sort of an expensive way to do cold data storage but I guess it depends on how important the data is and the storage efficiency difference between an erasure rule set that would meet those requirements vs 3x replication.

Link to post
Share on other sites

8 hours ago, ElSeniorTaco said:

Also, I am totally fine starting with used hardware, everything but the drives, and probably battery backups for raid cards if caching is used.

If Ceph then use HBAs not hardware RAID cards, you can increase the cluster latency and limit throughput if using RAID write-back cache. SDS usually want full direct control of disks so RAID/abstraction is a big no for almost all of them, Gluster is an exception to that, Gluster + ZFS live very well together.

 

8 hours ago, ElSeniorTaco said:

1 ssd for every 5 drives for journaling purposes (new ofc)

Potentially not required if you go with Bluestore OSDs instead of Filestore OSDs. Bluestore doesn't need SSD journal nearly as much as Filestore did and a lot of disks in a server will perform well, if you need performance consider a flash cache tier instead.

Link to post
Share on other sites

Also FYI I've got 7 DL380 Gen8 servers with 5 SSDs per server and bunch of 10K SAS disks in them and 4 DL380 Gen8 LFF servers with 4TB HDDs in them that I'm trialing different Ceph configurations right now with. If you want me to try anything out or give perf numbers for specific tests let me know.

 

Server spec wise they are mostly 2x E5-2667 64GB 1x 10Gb (switch port limit here as they actually have 4x 10Gb), just a bunch of decommissioned servers set aside to build a cluster with before dumping actual money in to. Some servers are high spec some are lower but not by much.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×