Ceph vs OpenIO

leadeater · September 20, 2016

So as the title says I got interested in trying two Object Based Storage systems that easily scale out and up that not only support the typical object storage protocols, S3/Swift, but also NFS/SMB/RDB.

Ceph: Widely used and very popular

OpenIO: Much newer, supported by BackBlaze as a storage tier

The main purpose I'll be trying them out for is serving NFS and iSCSI storage to my ESXi hosts and to my desktop where I already host my stream library on a VHD mounted from a SMB3 share. Goes without saying I can't currently play any games since my storage is down .

The existing setup was a Windows Server 2016 TP5 VM with a couple of HBA passed through which had 6 Samsung SSDs on one and 3 3TB WD Reds on the other. I used Storage Spaces with multi-resilient virtual disks to tier the data across the SSD's and HDD's. Performance was excellent, scaling wasn't so great and what I really wanted was Storage Spaces Direct but I don't currently have the hardware to do it.

Storage Spaces benchmark for reference:

Read: 2.5GB/s
Write: 1.5GB.s
IOPs: Forgot to gather (woops too late), got some screen shots from old tests should be fine

Ceph Configuration:

6 VMs, 3 Monitor/Gateways/MetaData and 3 OSD Nodes
2 vCPU 2GB Ram per VM (Will change TBD on benchmarks)
Each OSD node will have 2 virtual SSD and 1 HDD, SSD virtual disks are on dedicated SSD datastores and thick eager zeroed (very important for Samsung SSDs)
Two pools will be used, cold-data and hot-data
The pools will be tiered and the SSD used for hot data/write-back cache

OpenIO Configuration

TBD (Not started yet)

What I'll be looking at is not only performance but also ease of configuration. I'll update as I progress, future plans will be to deploy something at a larger scale. I've already ordered 3 more Intel S5520HC's and 6 Intel L5630's to go along with my existing IBM x3500 M4 and Intel S5520HC/E5520 servers, the current Intel server is off-site at a friends house and I'll be using that as part of Ceph/OpenIO after initial testing as a replica.

@brwainer tagging you since you seemed interested last time I mentioned this.

leadeater · September 20, 2016

I should also explain why Object Based Storage is good and how it differs from say ZFS. Object Based Storage has the same end to end data integrity like ZFS does but are true scale out parallel systems so as you add storage nodes both capacity and performance increase, for ZFS you either make each node bigger and faster or deploy multiple nodes and manually balance load across them.

I should also point out you can put Object Based Storage on top of ZFS if you so wish.

The down side to Object Based Storage is they are not really designed for transactional work loads and hosting file shares using NFS/SMB. The reason for this is while they have very good throughput, 10's of GB/s, they don't typically have good I/O response times. This is totally dependent on the Object Based Storage system you are using however.

Object Based Storage systems are great for archive/backup storage, massive data dumps from super computers (parallel file systems more typically used here i.e. Lustre) or as a way to distribute your data across servers and even cities for data resiliency.

Every cloud storage provider uses Object Based Storage and hosting VMs inside of one isn't a new idea either.

leadeater · September 20, 2016

Ceph

So I've probably had enough tinkering with Ceph to make some initial assessments, I must preface this with the fact that my hardware configuration is not ideal and is only suitable for hands on training. I'm not going to cover anything that easily found on the many already existing install guides nor is this an install guide either.

Positives

Easy to install/deploy Ceph on to Nodes
Simple to configure and easy to understand configuration but you can go very deep in to it
HDD write performance much better than expected

Negatives

Not all features are supported on erasure coded pools e.g. CephFS (workaround with write-back cache replicated pool)
VERY CPU demanding
HDD read performance less than expected (non issue at larger scale)
Web interface for configuration and monitoring not yet fully integrated as a single solution, Calamari

There are some key fundamentals to understand when starting out

CRUSH maps are critical, understand this at all costs
Limit the use of pools. Create pools to apply policies to, not for something like pool per tenant (leads to too many PGs)

Hardware:

Server: IBM x3500 M4

CPU: E5-2620v1

RAM: 128GB

OS: ESXi 6.0

SSD: 2 Samsung 512GB 840 Pro, 4 Samsung 512GB 850 Pro

HDD: 3 Seagate 3TB ST3000NM0025

Ceph Configuration:

Spoiler

CL01-CEPHADM01

Admin Node
Metadata
CephFS
iSCSITarget

CL01-CEPHMON01

Monitor Node

CL01-CEPHOSD01

OSD Node
4 vCPU 4GB RAM
2 SSD, 1 HDD
OSD per disk

CL01-CEPHOSD02

OSD Node
4 vCPU 4GB RAM
2 SSD, 1 HDD
OSD per disk

CL01-CEPHOSD02

OSD Node
4 vCPU 4GB RAM
2 SSD, 1 HDD
OSD per disk

Ceph CRUSH Map

Spoiler

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host CL01-CEPHOSD01_SSD {
id -5 # do not change unnecessarily
# weight 0.9
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.450
item osd.1 weight 0.450
}
host CL01-CEPHOSD01_SATA {
id -6 # do not change unnecessarily
# weight 1.00
alg straw
hash 0 # rjenkins1
item osd.6 weight 1.000
}
host CL01-CEPHOSD01 {
id -2 # do not change unnecessarily
# weight 1.900
alg straw
hash 0 # rjenkins1
item CL01-CEPHOSD01_SSD weight 0.900
item CL01-CEPHOSD01_SATA weight 1.000
}
host CL01-CEPHOSD02_SSD {
id -7 # do not change unnecessarily
# weight 0.900
alg straw
hash 0 # rjenkins1
item osd.2 weight 0.450
item osd.3 weight 0.450
}
host CL01-CEPHOSD02_SATA {
id -8 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.7 weight 1.000
}
host CL01-CEPHOSD02 {
id -3 # do not change unnecessarily
# weight 1.900
alg straw
hash 0 # rjenkins1
item CL01-CEPHOSD02_SSD weight 0.900
item CL01-CEPHOSD02_SATA weight 1.000
}
host CL01-CEPHOSD03_SSD {
id -9 # do not change unnecessarily
# weight 0.900
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.450
item osd.5 weight 0.450
}
host CL01-CEPHOSD03_SATA {
id -10 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.8 weight 1.000
}
host CL01-CEPHOSD03 {
id -4 # do not change unnecessarily
# weight 1.900
alg straw
hash 0 # rjenkins1
item CL01-CEPHOSD03_SSD weight 0.900
item CL01-CEPHOSD03_SATA weight 1.000
}
root default {
id -1 # do not change unnecessarily
# weight 5.700
alg straw
hash 0 # rjenkins1
item CL01-CEPHOSD01 weight 1.900
item CL01-CEPHOSD02 weight 1.900
item CL01-CEPHOSD03 weight 1.900
}
root ssd {
id -11 # do not change unnecessarily
# weight 2.700
alg straw
hash 0 # rjenkins1
item CL01-CEPHOSD01_SSD weight 0.900
item CL01-CEPHOSD02_SSD weight 0.900
item CL01-CEPHOSD03_SSD weight 0.900
}
root sata {
id -12 # do not change unnecessarily
# weight 3.00
alg straw
hash 0 # rjenkins1
item CL01-CEPHOSD01_SATA weight 1.000
item CL01-CEPHOSD02_SATA weight 1.000
item CL01-CEPHOSD03_SATA weight 1.000
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule erasure-code {
ruleset 1
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}
rule hot-storage {
ruleset 2
type replicated
min_size 1
max_size 10
step take ssd
step chooseleaf firstn 0 type host
step emit
}
rule cold-storage {
ruleset 3
type erasure
min_size 3
max_size 3
step set_chooseleaf_tries 5
step set_choose_tries 100
step take sata
step chooseleaf indep 0 type host
step emit
}

# end crush map

Ceph OSD Tree

Spoiler

ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-12 3.00000 root sata
-6 1.00000 host CL01-CEPHOSD01_SATA
6 1.00000 osd.6 up 1.00000 1.00000
-8 1.00000 host CL01-CEPHOSD02_SATA
7 1.00000 osd.7 up 1.00000 1.00000
-10 1.00000 host CL01-CEPHOSD03_SATA
8 1.00000 osd.8 up 1.00000 1.00000
-11 2.69998 root ssd
-5 0.89999 host CL01-CEPHOSD01_SSD
0 0.45000 osd.0 up 1.00000 1.00000
1 0.45000 osd.1 up 1.00000 1.00000
-7 0.89999 host CL01-CEPHOSD02_SSD
2 0.45000 osd.2 up 1.00000 1.00000
3 0.45000 osd.3 up 1.00000 1.00000
-9 0.89999 host CL01-CEPHOSD03_SSD
4 0.45000 osd.4 up 1.00000 1.00000
5 0.45000 osd.5 up 1.00000 1.00000
-1 5.69998 root default
-2 1.89999 host CL01-CEPHOSD01
-5 0.89999 host CL01-CEPHOSD01_SSD
0 0.45000 osd.0 up 1.00000 1.00000
1 0.45000 osd.1 up 1.00000 1.00000
-6 1.00000 host CL01-CEPHOSD01_SATA
6 1.00000 osd.6 up 1.00000 1.00000
-3 1.89999 host CL01-CEPHOSD02
-7 0.89999 host CL01-CEPHOSD02_SSD
2 0.45000 osd.2 up 1.00000 1.00000
3 0.45000 osd.3 up 1.00000 1.00000
-8 1.00000 host CL01-CEPHOSD02_SATA
7 1.00000 osd.7 up 1.00000 1.00000
-4 1.89999 host CL01-CEPHOSD03
-9 0.89999 host CL01-CEPHOSD03_SSD
4 0.45000 osd.4 up 1.00000 1.00000
5 0.45000 osd.5 up 1.00000 1.00000
-10 1.00000 host CL01-CEPHOSD03_SATA
8 1.00000 osd.8 up 1.00000 1.00000

Ceph Benchmarks

Spoiler

Cold-Storage

Write:
Total time run: 300.393191
Total writes made: 12258
Write size: 4198176
Object size: 4198176
Bandwidth (MB/sec): 163.377
Stddev Bandwidth: 87.4009
Max bandwidth (MB/sec): 472.436
Min bandwidth (MB/sec): 0
Average IOPS: 40
Stddev IOPS: 21
Max IOPS: 118
Min IOPS: 0
Average Latency(s): 0.392053
Stddev Latency(s): 0.39109
Max latency(s): 4.83727
Min latency(s): 0.0234059

Read:
Total time run: 300.521918
Total reads made: 10156
Read size: 4198176
Object size: 4198176
Bandwidth (MB/sec): 135.303
Average IOPS 33
Stddev IOPS: 3
Max IOPS: 42
Min IOPS: 24
Average Latency(s): 0.472627
Max latency(s): 7.61403
Min latency(s): 0.0382972

Hot-Storage

Write:
Total time run: 300.635140
Total writes made: 10182
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 135.473
Stddev Bandwidth: 55.0168
Max bandwidth (MB/sec): 332
Min bandwidth (MB/sec): 12
Average IOPS: 33
Stddev IOPS: 13
Max IOPS: 83
Min IOPS: 3
Average Latency(s): 0.472376
Stddev Latency(s): 0.48684
Max latency(s): 2.95393
Min latency(s): 0.0372172

Read:
Total time run: 33.690954
Total reads made: 10182
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1208.87
Average IOPS 302
Stddev IOPS: 12
Max IOPS: 318
Min IOPS: 271
Average Latency(s): 0.0516345
Max latency(s): 0.252252
Min latency(s): 0.0206475

Cached Cold-Storage

Write:
Total time run: 301.283878
Total writes made: 9560
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 126.923
Stddev Bandwidth: 56.6323
Max bandwidth (MB/sec): 272
Min bandwidth (MB/sec): 0
Average IOPS: 31
Stddev IOPS: 14
Max IOPS: 68
Min IOPS: 0
Average Latency(s): 0.504234
Stddev Latency(s): 0.48111
Max latency(s): 3.1037
Min latency(s): 0.0377533

Read:
Total time run: 32.562249
Total reads made: 9560
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1174.37
Average IOPS 293
Stddev IOPS: 17
Max IOPS: 310
Min IOPS: 213
Average Latency(s): 0.0531657
Max latency(s): 0.408675
Min latency(s): 0.0152224

The above SSD read test used all host CPU cores at 100% which is clearly not ideal for me since I also host other VMs on this hardware.

OpenIO

-Coming Soon-

tt2468 · September 20, 2016

Very cool! I was actually thinking of switching to ceph for my proxmox hosting cluster. I'm eager to see what results you get!

brwainer · September 20, 2016

Following. Definitely interested. Just so i'm clear - right now you are testing with multiple VMs as nodes, and once you've tested both, you will be moving to physical nodes?

leadeater · September 21, 2016

23 hours ago, brwainer said:

Following. Definitely interested. Just so i'm clear - right now you are testing with multiple VMs as nodes, and once you've tested both, you will be moving to physical nodes?

It'll likely end up as 'hyper-converged', god I hate that buzz word. Anyway each physical server will have a Ceph/OpenIO VM with HBA's passed through and a Ceph Monitor/Gateway VM for CephFS and iSCSI.

It will actually stay as a single host with multiple VMs for a while, 1 OSD VM per physical HDD + 2 SSD using erasure coding for disk redundancy, since even though I have ordered the core parts of the extra servers I won't be doing that part for a long time. Likely not until I have built my house which will have a proper server room in it for all my crap. I will be replicating the data to a Ceph VM offsite to my current Intel S5520HC system, async if possible to not limit IOPs.

brwainer · September 21, 2016

22 minutes ago, leadeater said:

It'll likely end up as 'hyper-converged', god I hate that buzz word. Anyway each physical server will have a Ceph/OpenIO VM with HBA's passed through and a Ceph Monitor/Gateway VM for CephFS and iSCSI.

It will actually stay as a single host with multiple VMs for a while, 1 OSD VM per physical HDD + 2 SSD using erasure coding for disk redundancy, since even though I have ordered the core parts of the extra servers I won't be doing that part for a long time. Likely not until I have build my house which will have a proper server room in it for all my crap. I will be replicating the data to a Ceph VM offsite to my current Intel S5520HC system, async if possible to not limit IOPs.

So the Ceph or OpenIO nodes need to have some storage for the node itself that isn't part of the storage used for the pool, right? I just went and read some getting started materials and that was the impression I got. So are you going to have those on the same disk as the hypervisor itself?

leadeater · September 21, 2016

6 minutes ago, brwainer said:

So the Ceph or OpenIO nodes need to have some storage for the node itself that isn't part of the storage used for the pool, right? I just went and read some getting started materials and that was the impression I got. So are you going to have those on the same disk as the hypervisor itself?

Every SSD (6) and HDD (3) is an ESXi Datastore so the 6 VM files and system disks live on an SSD each. After that I've created 2 100GB thick eager zeroed SSD disks and 1 thin 1TB HDD disk for each of the 3 OSD nodes. This means if an SSD dies I lose either a Monitor/Gateway or an entire OSD node, depends on what SSD fails. This is what I'd call borderline safe and I need to make sure I get my erasure coding right. Short term end goal is just to get similar protection of RAID 5 + offsite replication, none of my data is important plus it's backed up atm anyway.

Currently bashing my head against osd pools and erasure profiles.

Goksha · September 21, 2016

EMC ScaleIO is another option that works in a similar way as these options. However, I personally wasn't pleased with the performance due to some hardware limitations I have in my LAB. Once I can get a few more disks, I should be in a good position to try it again.

leadeater · September 21, 2016

7 minutes ago, Goksha said:

EMC ScaleIO is another option that works in a similar way as these options. However, I personally wasn't pleased with the performance due to some hardware limitations I have in my LAB. Once I can get a few more disks, I should be in a good position to try it again.

Yea I had a look at ScaleIO but to me it fails the ease of use criteria, it's also block storage only from my understanding.

Edit: Looking at the latest info configuration has MUCH improved, I'll give it a whirl if I can be bothered.

Goksha · September 21, 2016

Just now, leadeater said:

Yea I had a look at ScaleIO but to me it fails the ease of use criteria, it's also block storage only from my understanding.

Yes, it's Block only. Great as an alternative to VSAN, but if you're trying file storage, you'd need to mount a volume created in ScaleIO on either a new dedicated system, or one of the client machines.

I've done some work with Ceph, but was tearing down, and rebuilding my lab so often that I never got it to a stable, benchmark-able point. I need to look at OpenIO.

samiscool51 · September 21, 2016

never heard either of them

but dude, what ever works for you probely won't for me because life

Quindor · September 21, 2016

Cool topic, following! I've played with CEPH in the past, it would be interesting to see how well it performs using VMware, etc.

SwiftStackGuy · September 21, 2016

Hey friends,

Full disclosure- I work at SwiftStack and made an account because I saw this thread.

Shameless Plug:

SwiftStack is built ontop of OpenStack Swift that makes it easy to build, deploy, monitor, manage distributed object storage clusters from a couple of TBs to hundreds of PB.

Shameless Plug over:

On 9/20/2016 at 7:47 AM, leadeater said:

I should also explain why Object Based Storage is good and how it differs from say ZFS. Object Based Storage has the same end to end data integrity like ZFS does but are true scale out parallel systems so as you add storage nodes both capacity and performance increase, for ZFS you either make each node bigger and faster or deploy multiple nodes and manually balance load across them.

I should also point out you can put Object Based Storage on top of ZFS if you so wish.

The down side to Object Based Storage is they are not really designed for transactional work loads and hosting file shares using NFS/SMB. The reason for this is while they have very good throughput, 10's of GB/s, they don't typically have good I/O response times. This is totally dependent on the Object Based Storage system you are using however.

Object Based Storage systems are great for archive/backup storage, massive data dumps from super computers (parallel file systems more typically used here i.e. Lustre) or as a way to distribute your data across servers and even cities for data resiliency.

Every cloud storage provider uses Object Based Storage and hosting VMs inside of one isn't a new idea either.

While I haven't had a chance to play with OpenIO, I have had a chance to play around with Ceph. And my only point of comparison is against Swift.

Key Differences

Architecture:

Ceph is strongly consistent with functionality aimed at databases and real-time data and makes sacrifices in geographic scale and availability

Swift is eventually consistent with functionality aimed at storing massive amounts of data at 100% availability. The trade off is that if if you make revisions, you might get an older copy because the data hasn't been distributed throughout the cluster.

Scalabiity:

Ceph can scale large inside a single region; other regions are passive incase primary region fails

Swift can scale large and was built with multi-region in mind. All regions are also active

Use Case:

Ceph is primarily used for applications that require block storage- VMs, DBs, consistent stuff

Swift is primarily used for storing large amounts of unstructured data- media, backup images, etc

Ease of Use:

Ceph- Deployed and managed by 3rd party tools and GUI

Swift- Deployed and managed by 3rd party tools and GUi

SwiftStack- Enterprise grade software product; Deployment and management is all handled through controller software

They both do super different things for different workloads. If you have a chance, I'd like to suggest that you play around with our trial software and see if it fits to what you're trying to accomplish.

Trial Signup/Download

leadeater · September 21, 2016

46 minutes ago, SwiftStackGuy said:

-snip-

Thanks for the input, from my prior looking in to options Swift didn't seem to fit my use case as well as Ceph did but you already pointed that out. SwiftStack would probably be a better fit for work as a cloud target for Commvault and for archiving all our unused data off primary storage.

We're actually looking at using Ceph + Netapp AltaVault + Commvault which is why it's the first thing I am trying at home. The big draw card for Ceph was it's free and it's wide protocol support, however there is definitely something to be said about paying for a decent management interface and product support.

leadeater · September 22, 2016

Minor Ceph update, I had everything configured and working with cold-storage pool and hot-storage ssd caching pool with all the required crush maps etc but teared it all down to try a different configuration. Samsung 840/850 Pro SSDs really don't like being ESXi datastores and the write speed per SSD starts off at 250MB/s-300MB/s and slowly drops down to 100MB/s which is pathetic. I was already aware of this issue but happened last time when I hardware RAID configured them with my LSI 9631-8i, assumed it had more to do with the RAID card than ESXi.

This quick RBD benchmark I did showed around 250MB/s-300MB/s write performance observed by the client and ~900MB/s write across the cluster, this is due to the object replication.

I tried to pass through the HBA with all the SSDs to a linux VM that I was going to use as the Ceph OSD for the hot-storage but this just locked up the VM and HBA then caused the ESXi host to panic and reboot, WTF?! Passing the same HBA to a Windows VM works fine.

Anyway re-created the ESXi SSD datastores but this time with a 50GB un-partitioned provisioning buffer to allow garbage collection more headroom, it has stabilized the performance a reasonable amount.

Edit: Found a workaround to the HBA passthrough, http://www.vm-help.com/esx40i/SATA_RDMs.php, fake RDM using local disks.

Mikensan · September 22, 2016

All outside of my experience and extremely interesting, will be following.

What version of ESXI are you using, 5.5/6.0? I've tried a couple SSDs with ESXI and haven't had much success either (sub-par speeds), both attached at the motherboard and my IBM M1015 (just goold old trial and error). One of my motivations to buy 10gb cards so I can stick it in my FreeNAS system in hopes of improvement.

leadeater · September 22, 2016

7 minutes ago, Mikensan said:

All outside of my experience and extremely interesting, will be following.

What version of ESXI are you using, 5.5/6.0? I've tried a couple SSDs with ESXI and haven't had much success either (sub-par speeds), both attached at the motherboard and my IBM M1015 (just goold old trial and error). One of my motivations to buy 10gb cards so I can stick it in my FreeNAS system in hopes of improvement.

ESXi 6.

The core of the issues comes from ESXi not supporting TRIM so consumer SSD really struggle without it, the current best mitigation is to create the datastore 10%-20% smaller than the disk size to give the flash controller some headroom to GC. This is one of the main differences between server and desktop class SSDs, they have much more flash in them reserved for GC.

Edit: You'll see a desktop SSD at 512GB and the equivalent server one at 480GB for reference.

Mikensan · September 22, 2016

1 minute ago, leadeater said:

ESXi 6.

The core of the issues comes from ESXi not supporting TRIM so consumer SSD really struggle without it, the current best mitigation is to create the datastore 10%-20% smaller than the disk size to give the flash controller some headroom to GC. This is one of the main differences between server and desktop class SSDs, they have much more flash in them reserved for GC.

GC - garbage collection? Definitely would enjoy a SLC enterprise SSD, but those price tags >.< . At the moment I have a SSD in my ESXI box just for swap files. Only 1-2 servers I have I'd like to dedicate the SSD to, hoping 10gb + FreeNAS will be a good workaround.

Aside from that, would love to work with an enterprise level storage infrastructure. Maybe my next job will give me the opportunity ^_^.

leadeater · September 22, 2016

OSD Nodes:3

OSDs: 6

Copies: 2

SSD Disks: 2 840 Pro, 4 850 Pro

Very rough RADOS write benchmark:

Total time run: 200.627719
Total writes made: 8090
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 161.294
Stddev Bandwidth: 60.8764
Max bandwidth (MB/sec): 304
Min bandwidth (MB/sec): 20
Average IOPS: 40
Stddev IOPS: 15
Max IOPS: 76
Min IOPS: 5
Average Latency(s): 0.396649
Stddev Latency(s): 0.374931
Max latency(s): 2.42652
Min latency(s): 0.0338184

Very rough RADOS read benchmark

Total time run: 51.233578
Total reads made: 15329
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1196.79
Average IOPS 299
Stddev IOPS: 10
Max IOPS: 317
Min IOPS: 276
Average Latency(s): 0.0521512
Max latency(s): 0.203764
Min latency(s): 0.0215344

The above read bench maxed out all 6 cores on the ESXi host so that's as fast as it can go without finding a way to lower the CPU usage.

Eniqmatic · September 23, 2016

Probably keep an eye on this out of interest. I'm currently contemplating investigating Ceph as I'm fighting with Gluster at the moment. Not really Gluster's fault per say, think it's an OS bug.

Anyways good work!

leadeater · September 23, 2016

2 minutes ago, Eniqmatic said:

Probably keep an eye on this out of interest. I'm currently contemplating investigating Ceph as I'm fighting with Gluster at the moment. Not really Gluster's fault per say, think it's an OS bug.

Anyways good work!

Yea I was also considering Gluster, what's it's CPU demand like?

Eniqmatic · September 23, 2016

Just now, leadeater said:

Yea I was also considering Gluster, what's it's CPU demand like?

Seems to be fine although truth be told I haven't checked looked into it deep enough yet as I'm having an issue mounting on boot. Manual mounting works so fortunately I can still test Gluster.

I did test the performance of small files as that is important to me (eventually want to use it to complete my web cluster) and surprisingly got very good results as I know that has been its downfall in the past, having said that the VM's are on the same network so it should be decent performance. So good enough for what I want to do. I can check the performance of the CPU for you soon if your interested? What's the hardware your running on that your seeing high CPU usage? That might put me off it as I probably have less powerful hardware than you at a guess so that may limit me.

The very brief tests I ran were to generate 1000 x 1MB files, 10240 x 100KB files and 1 x 1GB file and check how long the second node took to update. I'd be interested to know how Ceph handles those quick tests!

leadeater · September 23, 2016

5 minutes ago, Eniqmatic said:

Seems to be fine although truth be told I haven't checked looked into it deep enough yet as I'm having an issue mounting on boot. Manual mounting works so fortunately I can still test Gluster.

I did test the performance of small files as that is important to me (eventually want to use it to complete my web cluster) and surprisingly got very good results as I know that has been its downfall in the past, having said that the VM's are on the same network so it should be decent performance. So good enough for what I want to do. I can check the performance of the CPU for you soon if your interested? What's the hardware your running on that your seeing high CPU usage? That might put me off it as I probably have less powerful hardware than you at a guess so that may limit me.

The very brief tests I ran were to generate 1000 x 1MB files, 10240 x 100KB files and 1 x 1GB file and check how long the second node took to update. I'd be interested to know how Ceph handles those quick tests!

Single E5-2620v1, spreading the load over multiple nodes is the answer but I was surprised just how much it was using. To be fair I was reading at 1.2GB/s off the hot-storage pool of SSDs. I was also getting terrible write speed to the same SSDs, blaming ESX + Samsung for that though. I have no doubt if I directly installed on to the hardware rather than using VMs the odd behavior wouldn't be there.

Sustained long term write performance to the HDDs was actually better than to my SSDs.

I used the inbuilt Ceph/RADOS/RDB benchmark tools, plus mounted an NFS export to the host and created a virtual disk for a windows VM along with presenting to the same VM an iSCSI RBD and did some HDTune/ATTO tests which both gave some really weird results, again blaming SSD write issues.

Really do like Ceph but I think I'll have to wait for my extra hardware to arrive to properly performance test it, it is something I think I will continue to use at a later time.

Going to try Ceph again tomorrow using Hyper-V to see if my hardware setup is happier with that.

Eniqmatic · September 23, 2016

1 minute ago, leadeater said:

Single E5-2620v1, spreading the load over multiple nodes is the answer but I was surprised just how much it was using. To be fair I was reading at 1.2GB/s off the hot-storage pool of SSDs. I was also getting terrible write speed to the same SSDs, blaming ESX + Samsung for that though. I have no doubt if I directly installed on to the hardware rather than using VMs the odd behavior wouldn't be there.

Sustained long term write performance to the HDDs was actually better than to my SSDs.

I used the inbuilt Ceph/RADOS/RDB benchmark tools, plus mounted an NFS export to the host and created a virtual disk for a windows VM along with presenting to the same VM an iSCSI RBD and did some HDTune/ATTO tests which both gave some really weird results, again blaming SSD write issues.

Really do like Ceph but I think I'll have to wait for my extra hardware to arrive to properly performance test it, it is something I think I will continue to use at a later time.

Going to try Ceph again tomorrow using Hyper-V to see if my hardware setup is happier with that.

Indeed. ESX has some terribly infuriating behaviour sometimes but I had to start using it at home because Hyper-V and certain Linux distributions that I wanted/need to use had terrible support from Microsoft (big surprise) and their documentation of the LIS was terrible. That was about a year or so ago though, might have changed.

Your use case and performance expectations and therefor your hardware are much higher than mine so I don't think I would really be able to provide you with relevant stats on Gluster unfortunately. Setup is extremely easy however I will comment on that.

Sign In

Ceph vs OpenIO

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites