ceph barcelona-v-1.2

.

Best practices & Performance Tuning OpenStack Cloud Storage with Ceph

OpenStack Summit Barcelona25th Oct 2015 @17:05 - 17:45

Room: 118-119

Swami ReddyRJIL

Openstack & Ceph Dev

Pandiyan MRJIL

Openstack Dev

Who are we?

Agenda

• Ceph - Quick Overview

• OpenStack Ceph Integration

• OpenStack - Recommendations.

• Ceph - Recommendations.

• Q & A

• References

Cloud Environment Details

Cloud env with 200 nodes for general purpose use-cases. ~2500 VMs - 40 TB RAM and 5120 cores -on 4 PB storage.• Average boot volume sizes

o Linux VMs - 20 GBo Windows VMs – 100 GB

• Average data Volume sizes: 200 GB

Compute (~160 nodes)• CPU : 2 * 16 @ 2.60 Hz• RAM : 256 GB• HDD : 3.6 TB (OS Drive)• NICs : 2 * 10 Gbps, 2 * 1 Gbps• Overprovision: CPU - 1:8

RAM - 1:1

Storage (~44 nodes)• CPU : 2 * 12 @ 2.50 Hz• RAM : 128 GB• HDD : 2 * 1 TB (OS Drive)• OSD : 22 * 3.6 TB• SSD : 2 * 800 GB (Intel S3700)• NICs : 2 * 10 Gbps , 2 * 1 Gbps• Replication : 3

Ceph - Quick Overview

Ceph Overview

Design Goals• Every component must scale• No single point of failure• Open source• Runs on commodity hardware• Everything must self-manage

Key Benefits• Multi-node striping and redundancy • COW cloning of images to volumes• Live migration of Ceph-backed VMs

OpenStack - Ceph Integration


CEPH STORAGE CLUSTER (RADOS)

CINDER GLANCE NOVA

RBD

HYPERVISOR (Qemu / KVM)

OPENSTACK

RGW

SWIFT


OpenStack Block storage - RBD flow:

• libvirt• QEMU• librbd• librados• OSDs and MONs

OpenStack Object storage - RGW flow:

• S3/SWIFT APIs• RGW • librados• OSDs and MONs

Openstack

libvirt

QEMU

librbd

librados

RADOS

Configures

S3 Compatible API Swift Compatible API

radosgw

librados

RADOS

OpenStack - Recommendations

Glance Recommendations

• What is Glance ?

• Configuration settings: /etc/glance/glance-api.conf• Use the ceph rbd as glance storage

• During the boot from volumes:• Disable local cache

• Expose Image URL helps saving time as image download and copy are NOT required

default_store=rbd

flavor = keystone+cachemanagement/flavor = keystone/

show_image_direct_url = Trueshow_multiple_locations = True

# glance --os-image-api-version 2 image-show 64b71b88-f243-4470-8918-d3531f461a26+------------------+-----------------------------------------------------------------+| Property | Value |+------------------+-----------------------------------------------------------------+| checksum | 24bc1b62a77389c083ac7812a08333f2 || container_format | bare || created_at | 2016-04-19T05:56:46Z || description | Image Updated on 18th April 2016 || direct_url | rbd://8a0021e6-3788-4cb3-8ada- || | 1f6a7b0d8d15/images/64b71b88-f243-4470-8918-d3531f461a26/snap || disk_format | raw |

Glance Recommendations

Image Format: Use ONLY RAW Images

With QCOW2 images:• Convert qcow2 to RAW image• Get the image UUID

With RAW images (No conversion; saves time):• Get the image UUID

Image Size (in GB) Format VM Boot time (Approx.)

50 (Windows) QCOW2 ~ 45 minutes

RAW ~ 1 minute

6 (Linux) QCOW2 ~ 2 minutes

RAW ~ 1 minute

Cinder Recommendations

• What is Cinder ?

• Configuration settings: /etc/glance/cinder.conf Enable Ceph as backend

• Cinder Backup Ceph supports Incremental backup

enabled_backend=ceph

backup_driver = cinder.backup.drivers.cephbackup_ceph_conf=/etc/ceph/ceph.confbackup_ceph_user = cinderbackup_ceph_chunk_size = 134217728backup_ceph_pool = backupsbackup_ceph_stripe_unit = 0backup_ceph_stripe_count = 0

Nova Recommendations

• What is Nova ?

• Configuration settings: /etc/nova/nova.conf

• Use librados (instead of krdb).

[libvirt]# enable discard support (be careful of perf)hw_disk_discard = unmap# disable password injectioninject_password = false# disable key injectioninject_key = false# disable partition injectioninject_partition = -2# make QEMU aware so caching worksdisk_cachemodes = "network=writeback"live_migration_flag="VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER,VIR_MIGRATE_LIVE,VIR_MIGRATE_PERSIST_DEST“

Ceph - Recommendations

Performance Decision Factors

• What is required storage (usable/RAW)?

• How many IOPS?• Aggregated • Per VM (min/max)

• Optimization for?• Performance• Cost

Ceph Cluster Optimization Criteria

Cluster Optimization Criteria Properties Sample Use Cases

IOPS - Optimized • Lowest cost per IOP • Highest IOPS • Meets minimum fault domain recommendation

• Typically block storage • 3x replication

Throughput - Optimized • Lowest cost per given unit of throughput • Highest throughput • Highest throughput per BTU • Highest throughput per watt • Meets minimum fault domain recommendation

• Block or object storage • 3x replication for higher read throughput

Capacity - Optimized • Lowest cost per TB • Lowest BTU per TB • Lowest watt per TB • Meets minimum fault domain recommendation

• Typically object storage Erasure coding common for maximizing usable capacity

OSD Considerations

• RAMo 1 GB of RAM per 1TB OSD space

• CPUo 0.5 CPU cores/1Ghz of a core per OSD (2 cores for SSD drives)

• Ceph-mons o 1 ceph-mon node per 15-20 OSD nodes

• Networko The sum of the total throughput of your OSD hard disks doesn’t exceed the network

bandwidth

• Thread counto High numbers of OSDs: (e.g., > 20) may spawn a lot of threads, during recovery and

rebalancing

HOSTOSD.2OSD.4OSD.6

OSD.1

OSD.3OSD.5

Ceph OSD Journal

• Run operating systems, OSD data and OSD journals on separate drives to maximize overall throughput.

• On-disk journals can halve write throughput .

• Use SSD journals for high write throughput workloads.

• Performance comparison with/without SSD journal using rados bench o 100% Write Operation with 4MB object size (default):

On-disk journal: 45 MB/s SSD journal: 80 MB/s

• Note: Above results with 1:11 SSD:OSD ratio

• Recommended to use 1 SSD with 4 - 6 OSDs for better results

Op Type No SDD SSD

Write (MB/s) 45 80

Seq Read (MB/s) 73 140

Rand Read (MB/s) 55 655

OS Considerations

• Kernel: Latest stable release• BIOS : Enable HT (hyperthreading) and VT(Virtualization Technology).• Kernel PID max:

• Read ahead: Set in all block devices

• Swappiness:

• Disable NUMA : Disabled by passing the numa_balancing=disable parameter to the kernel. • The same parameter could be controlled via the kernel.numa_balancing sysctl:

• CPU Tuning: Set “performance” mode use 100% CPU frequency always.

• I/O Scheduler:

# echo “4194303” > /proc/sys/kernel/pid_max

# echo "8192" > /sys/block/sda/queue/read_ahead_kb

# echo "vm.swappiness = 0" | tee -a /etc/sysctl.conf

# echo 0 > /proc/sys/kernel/numa_balancing

SATA/SAS Drives: # echo "deadline" > /sys/block/sd[x]/queue/scheduler SSD Drives : # echo "noop" > /sys/block/sd[x]/queue/scheduler

# echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Ceph Deployment Network


• Each host have at least two 1Gbps network interface controllers (NICs).

• Use 10G Ethernet• Always use JUMBO frames

• High BW connectivity between TOR switches and spine routers, Example: 40Gbps to 100Gbps• Hardware should have a Baseboard Management Controller (BMC)

• Note: Running three networks in HA mode may seem like overkill

Public N/w

Cluster N/w

NIC-1

NIC-2

# ifconfig ethx mtu 9000 #echo "MTU=9000" | tee -a /etc/sysconfig/network-script/ifcfg-ethx


• NIC Bonding - Balance-alb mode both NICs are used to send and receive traffics:

• Test results with 2x10G NIC:

• Active-Passive bond mode: Traffic between 2 nodes: Case#1 : node-1 to node-2 => BW 4.80 Gb/s Case#2: node-1 to node-2 => BW 4.62 Gb/s

• Speed of one 10Gig NIC.

• Balance-alb bond mode:

• Case#1 : node-1 to node-2 => BW 8.18 Gb/s• Case#2: node-1 to node-2 => BW 8.37 Gb/s

• Speed of two 10Gig NICs

Ceph Failure Domains

• A failure domain is any failure that prevents access to one or more OSDs. Added costs of isolating every potential failure domain.

Failure domains:• osd • host • chassis• rack • row • pdu • pod • room • datacenter • region

Ceph Ops Recommendations

Scrub and deep scrub operations are very IO consuming and can affect cluster performance.o Disable scrub and deep scrub

o After setting noscrub, nodeep-scrub ceph health became WARN state

o Enable Scrub and Deep Scrub

o Configure Scrub and Deep Scrub

#ceph osd set noscrub set noscrub#ceph osd set nodeep-scrubset nodeep-scrub

#ceph health HEALTH_WARN noscrub, nodeep-scrub flag(s) set

# ceph osd unset noscrubunset noscrub# ceph osd unset nodeep-scrubunset nodeep-scrub

osd_scrub_begin_hour = 0 # begin at this hourosd_scrub_end_hour = 24 # start last scrub atosd_scrub_load_threshold = 0.05 #scrub only below loadosd_scrub_min_interval = 86400 # not more often than 1 dayosd_scrub_max_interval = 604800 # not less often than 1 weekosd_deep_scrub_interval = 604800 # scrub deeply once a week

Ceph Ops Recommendations

• Decreasing recovery and backfilling performance impact

• Settings for recovery and backfilling :

Note: The above setting will slow down the recovery/backfill process and prolongs the recovery process, if we decrease the values. Increasing these settings value will increase recovery/backfill performance, but decrease client performance and vice versa

‘osd max backfills’ - maximum backfills allowed to/from a OSD [default 10]

‘osd recovery max active’ - Recovery requests per OSD at one time. [default 15]

‘osd recovery threads’ - The number of threads for recovering data. [default 1]

‘osd recovery op priority’ - Priority for recovery Ops. [ default 10]

Ceph Performance Measurement Guidelines

For best measurement results, follow these rules while testing:

• One option at a time.• Check - what is changing.• Choose the right performance test for the changed option.• Re-test the changes - at least ten times.• Run tests for hours, not seconds.• Trace for any errors.• Decisively look at results.• Always try to estimate results and see at standard difference to eliminate spikes and false tests.

Tuning:• Ceph clusters can be parametrized after deployment to better fit the requirements of the workload. • Some configuration options can affect data redundancy and have significant implications on stability and safety of data. • Tuning should be performed on test environment prior issuing any command and configuration changes on production.

Any questions?

Reference Links

• Ceph documentation • Previous Openstack summit presentations • Tech Talk Ceph • A few blogs on Ceph

• https://www.sebastien-han.fr/blog/categories/ceph/• https://

www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro-INC0270868_v2_0715.pdf

https://www.sebastien-han.fr/blog/categories/ceph/



https://www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro-INC0270868_v2_0715.pdf

https://www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro-INC0270868_v2_0715.pdf

Appendix

Ceph H/W Best Practices

OSD

HOST

MDS

HOST

MON

HOST

1 x 64-Bit core1 x 32-Bit Dual Core1 x i386 Dual- Core

1GB per 1TB 1 x 64-Bit core1 x 32-Bit Dual Core1 x i386 Dual- Core

1GB per daemon 1GB per daemon1 x 64-Bit core1 x 32-Bit Dual Core1 x i386 Dual- Core

HDD, SDD, Controllers

• Ceph best practices to run operating systems, OSD data and OSD journals on separate drives.

Hard Disk Drives (HDD)• minimum hard disk drive size of 1 terabyte.• ~1GB of RAM for 1TB of storage space.NOTE: NOT a good idea to run:1. multiple OSDs on a single disk. 2. OSD/monitor/metadata server on a single disk.

Solid State Drives (SSD)Use SSDs to improve performance. NOTE:

Controllers Disk controllers also have a significant impact on write throughput.

Controller

Ceph OSD Journal - Results

Write operations


Seq Read operations


Read operations

ceph barcelona-v-1.2

Engineering