Ceph – Distributed Software Defined Storage Part 2

By: Steve Horan

In this blog, we will continue to build on what we have already covered Part 1. An additional OSD Host has been added to our mini cluster. As it stands now, here is the current cluster configuration. Please keep in mind this is the minimal configuration essentially required to run a Ceph cluster. In production nothing less than 3 monitor servers as well as 7 OSD Hosts are recommended. Scaling out before scaling up is critical in reducing the number of failure domains in the infancy of a new cluster. While your CAPEXP may increase (not in comparison to traditional storage), costs quickly reduce as you scale up as this only requires commodity hard drives be added. Also again while ARM is growing rapidly and is supported for using Ceph, Raspberry Pi’s are not adequate in any sense of a cluster under real workload. This is simply for demonstration purposes.image001

To kick off a deeper dive into Ceph I would like to backtrack to disk creation and how writes to a replicated cluster are performed. By default, all objected stored to a cluster are replicated 3 times. Once an OSD receives the write request from a client, the OSD is responsible for pushing this write to 2 other OSD’s. These transactions ONLY require that they be written to the journal of these OSD’s at which time the write acknowledgment is sent back to the client. Write ahead journals are very common for atomic transactions to take place but we can see that waiting for 3 write acknowledgements can generate serious bottleneck’s on slow spinning disks. Fortunately, there is a solution.

image003

It is almost recommended that journals are offloaded onto SSD’s. Typically, the SSD is partitioned 4 ways and provide journals for 4 OSD’s. Bluestore will even allow you to separate metadata and journal from the slow spinning disks for increased performance. Separating these is as simple as the following command (sda=HDD, sdb=ssd/nvram):

ceph-deploy osd prepare --bluestore cephpi03:sda:sdb1

Now we can look at pools. Pools are just a way of organizing data in a cluster. These pools are highly configurable from the number of Placement Groups to Replicas to Erasure Coding. Placement groups themselves are simply a collection of objects. Managing millions of objects and how they are replicated would be a daunting task, so grouping these objects and managing just groups while allowing CRUSH determine where to place our data is more efficient.

When creating a pool it is very important to calculate the correct number of placement groups to spread data as efficiently across our disks as possible. The formula for this calculation is as follows: (# of OSD’s * 100 ) % Num of Replicas. Then rounding this number UP to the nearest power of 2.

I current have a cluster containing 3 OSD’s. I am going to create a pool of 2 replicas. Based on this information our pool would be optimized with 256 Placement Groups.

Tip: Running “ceph –w” in another terminal is a great way to troubleshoot and see things happening live in the cluster.

#The second 256 is for placement group placements, this in general should match the number of placement groups
cephadm@cephmon01:~$ ceph osd pool create testpool 256 256 replicated
pool 'testpool' created
cephadm@cephmon01:~$ ceph osd pool set testpool size 2
set pool 4 size to 2
cephadm@cephmon01:~$ ceph osd pool ls
testpool

You can actually see how the PG’s (Placement groups moving forward) are placed in the cluster by running the following command:

ceph pg dump

Now that we have a pool to place objects in, I am going to use a client to create a RBD (Rados Block Device) image for use. Clients will need the cephadm account configured and the ceph.conf and keyrings pushed to them using the ceph-deploy admin command.

cephadm@rbdclient:~$ rbd -p testpool create testrbd --size 1G
cephadm@rbdclient:~$ rbd -p testpool ls
testrbd

Now that we have created the image, we can map it to our client at which point we can treat this as a normal block device.

cephadm@rbdclient:~$ sudo rbd -p testpool map testrbd
/dev/rbd1
cephadm@rbdclient:~$ sudo mkdir -p /opt/rbdtest
cephadm@rbdclient:~$ sudo mkfs.xfs /dev/rbd
cephadm@rbdclient:~$ sudo mount /dev/rbd1 /opt/rbdtest
cephadm@rbdclient:~$ sudo chown –R cephadm.cephadm /opt/rbdtest
cephadm@rbdclient:~$ touch /opt/rbdtest/test.txt
cephadm@rbdclient:~$ ls /opt/rbdtest
test.txt

Easily block storage can be created and mapped. These images support everything from snapshots to copy of write clones. This allows you to possibly spin up many Virtual machines almost instantly based on a snapshot in time. As they are copy on write clones each new image is only one 4MB object, only deviations from the parent snapshot are stored. This provide massive storage savings in for example a Dev environment.

Benchmarking

To wrap up this entry, it would not be complete without steps to benchmark your cluster. Ceph has some very neat features built in as well as fio support. iPerf is your friend. Ceph is extremely network I/O intensive. If you think about each write replicated 2 more times over Ethernet, without the proper configuration this can be devastating. Production clusters typically require a 10G Cluster network preferably using 802.3ad (LACP) along with jumbo packets enabled (The default object size is 4M, which is outside the default MTU size).

From a disk perspective all of the normal benchmarking methods apply.

Again these are Raspberry Pi’s. Do not let these numbers show any reflection on actual performance of the product

cephadm@rbdclient:~$ /opt/rbdtest$ dd if=/dev/zero of=500M.out bs=500M count=1 oflag=direct
1+0 records in
1+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 44.5441 s, 11.8 MB/s

My personal favorite: fio, offers an RBD engine as well. Below is a default template you can use, standard fio rules apply. The only package that is required is librbd(some distro’s librbd-dev).

[global]

#logging

#write_iops_log=write_iops_log

#write_bw_log=write_bw_log

#write_lat_log=write_lat_log

ioengine=rbd

clientname=admin

pool=testpool

rbdname=testingrbd

invalidate=0    # mandatory

rw=write

bs=4k

[rbd_iodepth32]

iodepth=32

cephadm@rbdclient:~$ fio rbd.fio

Rados has some neat tools built into it as well that allow you to benchmark out of the box. Add –t to change number of threads (default 16) and –b to change the object size (default 4M)

cephadm@rbdclient:~$ rados -p testpool bench 10 write
cephadm@rbdclient:~$ rados -p testpool bench 10 seq
cephadm@rbdclient:~$ rados -p testpool bench 10 rand

Finally, you can also benchmark the individual RBD.

cephadm@rbdclient:~$ rbd bench-write -p testpool testrbd

Reading the output of some of these tools obviously has some drawbacks. Tracking this data for future benchmarking also can present a problem as well as possible human error. Fortunately, Galileo tracks rbd block devices for us. Running a Galileo agent across all of our OSD nodes, monitors and clients allows us to quickly find bottlenecks. Ceph does introduce a lot of moving parts which potentially is its draw back. Without proper monitoring in place, troubleshooting large Ceph implementations can be a nightmare. Before even considering a migration to a software defined storage solution, proper monitoring must be in place.

image005

Ceph provides unified storage for object, block and file. SUSE brings iSCSI to enterprise over their ISCSI gateway which allows for datastores to VMware and disk to Windows. Ceph also provides distributed file system over CephFS. I plan on demonstrating the true power of Ceph in a final blog entry which will cover S3 integration with the Rados gateway, creating your own CRUSH map to make your system more resilient based on physical layout and hopefully Metadata Servers and CephFS along with a brief discussion on Erasure Coding. As always feel free to comment with questions or email me with any questions or concerns.