ceph barcelona-v-1.2
TRANSCRIPT
.
Best practices & Performance Tuning OpenStack Cloud Storage with Ceph
OpenStack Summit Barcelona25th Oct 2015 @17:05 - 17:45
Room: 118-119
Swami ReddyRJIL
Openstack & Ceph Dev
Pandiyan MRJIL
Openstack Dev
Who are we?
Agenda
• Ceph - Quick Overview
• OpenStack Ceph Integration
• OpenStack - Recommendations.
• Ceph - Recommendations.
• Q & A
• References
Cloud Environment Details
Cloud env with 200 nodes for general purpose use-cases. ~2500 VMs - 40 TB RAM and 5120 cores -on 4 PB storage.• Average boot volume sizes
o Linux VMs - 20 GBo Windows VMs – 100 GB
• Average data Volume sizes: 200 GB
Compute (~160 nodes)• CPU : 2 * 16 @ 2.60 Hz• RAM : 256 GB• HDD : 3.6 TB (OS Drive)• NICs : 2 * 10 Gbps, 2 * 1 Gbps• Overprovision: CPU - 1:8
RAM - 1:1
Storage (~44 nodes)• CPU : 2 * 12 @ 2.50 Hz• RAM : 128 GB• HDD : 2 * 1 TB (OS Drive)• OSD : 22 * 3.6 TB• SSD : 2 * 800 GB (Intel S3700)• NICs : 2 * 10 Gbps , 2 * 1 Gbps• Replication : 3
Cloud Environment Details
Cloud env with 200 nodes for general purpose use-cases. ~2500 VMs - 40 TB RAM and 5120 cores -on 4 PB storage.• Average boot volume sizes
o Linux VMs - 20 GBo Windows VMs – 100 GB
• Average data Volume sizes: 200 GB
Compute (~160 nodes)• CPU : 2 * 16 @ 2.60 Hz• RAM : 256 GB• HDD : 3.6 TB (OS Drive)• NICs : 2 * 10 Gbps, 2 * 1 Gbps• Overprovision: CPU - 1:8
RAM - 1:1
Storage (~44 nodes)• CPU : 2 * 12 @ 2.50 Hz• RAM : 128 GB• HDD : 2 * 1 TB (OS Drive)• OSD : 22 * 3.6 TB• SSD : 2 * 800 GB (Intel S3700)• NICs : 2 * 10 Gbps , 2 * 1 Gbps• Replication : 3
Ceph - Quick Overview
Ceph Overview
Design Goals• Every component must scale• No single point of failure• Open source• Runs on commodity hardware• Everything must self-manage
Key Benefits• Multi-node striping and redundancy • COW cloning of images to volumes• Live migration of Ceph-backed VMs
OpenStack - Ceph Integration
OpenStack - Ceph Integration
CEPH STORAGE CLUSTER (RADOS)
CINDER GLANCE NOVA
RBD
HYPERVISOR (Qemu / KVM)
OPENSTACK
RGW
SWIFT
OpenStack - Ceph Integration
OpenStack Block storage - RBD flow:
• libvirt• QEMU• librbd• librados• OSDs and MONs
OpenStack Object storage - RGW flow:
• S3/SWIFT APIs• RGW • librados• OSDs and MONs
Openstack
libvirt
QEMU
librbd
librados
RADOS
Configures
S3 Compatible API Swift Compatible API
radosgw
librados
RADOS
OpenStack - Recommendations
Glance Recommendations
• What is Glance ?
• Configuration settings: /etc/glance/glance-api.conf• Use the ceph rbd as glance storage
• During the boot from volumes:• Disable local cache
• Expose Image URL helps saving time as image download and copy are NOT required
default_store=rbd
flavor = keystone+cachemanagement/flavor = keystone/
show_image_direct_url = Trueshow_multiple_locations = True
# glance --os-image-api-version 2 image-show 64b71b88-f243-4470-8918-d3531f461a26+------------------+-----------------------------------------------------------------+| Property | Value |+------------------+-----------------------------------------------------------------+| checksum | 24bc1b62a77389c083ac7812a08333f2 || container_format | bare || created_at | 2016-04-19T05:56:46Z || description | Image Updated on 18th April 2016 || direct_url | rbd://8a0021e6-3788-4cb3-8ada- || | 1f6a7b0d8d15/images/64b71b88-f243-4470-8918-d3531f461a26/snap || disk_format | raw |
Glance Recommendations
Image Format: Use ONLY RAW Images
With QCOW2 images:• Convert qcow2 to RAW image• Get the image UUID
With RAW images (No conversion; saves time):• Get the image UUID
Image Size (in GB) Format VM Boot time (Approx.)
50 (Windows) QCOW2 ~ 45 minutes
RAW ~ 1 minute
6 (Linux) QCOW2 ~ 2 minutes
RAW ~ 1 minute
Cinder Recommendations
• What is Cinder ?
• Configuration settings: /etc/glance/cinder.conf Enable Ceph as backend
• Cinder Backup Ceph supports Incremental backup
enabled_backend=ceph
backup_driver = cinder.backup.drivers.cephbackup_ceph_conf=/etc/ceph/ceph.confbackup_ceph_user = cinderbackup_ceph_chunk_size = 134217728backup_ceph_pool = backupsbackup_ceph_stripe_unit = 0backup_ceph_stripe_count = 0
Nova Recommendations
• What is Nova ?
• Configuration settings: /etc/nova/nova.conf
• Use librados (instead of krdb).
[libvirt]# enable discard support (be careful of perf)hw_disk_discard = unmap# disable password injectioninject_password = false# disable key injectioninject_key = false# disable partition injectioninject_partition = -2# make QEMU aware so caching worksdisk_cachemodes = "network=writeback"live_migration_flag="VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER,VIR_MIGRATE_LIVE,VIR_MIGRATE_PERSIST_DEST“
Ceph - Recommendations
Performance Decision Factors
• What is required storage (usable/RAW)?
• How many IOPS?• Aggregated • Per VM (min/max)
• Optimization for?• Performance• Cost
Ceph Cluster Optimization Criteria
Cluster Optimization Criteria Properties Sample Use Cases
IOPS - Optimized • Lowest cost per IOP • Highest IOPS • Meets minimum fault domain recommendation
• Typically block storage • 3x replication
Throughput - Optimized • Lowest cost per given unit of throughput • Highest throughput • Highest throughput per BTU • Highest throughput per watt • Meets minimum fault domain recommendation
• Block or object storage • 3x replication for higher read throughput
Capacity - Optimized • Lowest cost per TB • Lowest BTU per TB • Lowest watt per TB • Meets minimum fault domain recommendation
• Typically object storage Erasure coding common for maximizing usable capacity
OSD Considerations
• RAMo 1 GB of RAM per 1TB OSD space
• CPUo 0.5 CPU cores/1Ghz of a core per OSD (2 cores for SSD drives)
• Ceph-mons o 1 ceph-mon node per 15-20 OSD nodes
• Networko The sum of the total throughput of your OSD hard disks doesn’t exceed the network
bandwidth
• Thread counto High numbers of OSDs: (e.g., > 20) may spawn a lot of threads, during recovery and
rebalancing
HOSTOSD.2OSD.4OSD.6
OSD.1
OSD.3OSD.5
Ceph OSD Journal
• Run operating systems, OSD data and OSD journals on separate drives to maximize overall throughput.
• On-disk journals can halve write throughput .
• Use SSD journals for high write throughput workloads.
• Performance comparison with/without SSD journal using rados bench o 100% Write Operation with 4MB object size (default):
On-disk journal: 45 MB/s SSD journal: 80 MB/s
• Note: Above results with 1:11 SSD:OSD ratio
• Recommended to use 1 SSD with 4 - 6 OSDs for better results
Op Type No SDD SSD
Write (MB/s) 45 80
Seq Read (MB/s) 73 140
Rand Read (MB/s) 55 655
OS Considerations
• Kernel: Latest stable release• BIOS : Enable HT (hyperthreading) and VT(Virtualization Technology).• Kernel PID max:
• Read ahead: Set in all block devices
• Swappiness:
• Disable NUMA : Disabled by passing the numa_balancing=disable parameter to the kernel. • The same parameter could be controlled via the kernel.numa_balancing sysctl:
• CPU Tuning: Set “performance” mode use 100% CPU frequency always.
• I/O Scheduler:
# echo “4194303” > /proc/sys/kernel/pid_max
# echo "8192" > /sys/block/sda/queue/read_ahead_kb
# echo "vm.swappiness = 0" | tee -a /etc/sysctl.conf
# echo 0 > /proc/sys/kernel/numa_balancing
SATA/SAS Drives: # echo "deadline" > /sys/block/sd[x]/queue/scheduler SSD Drives : # echo "noop" > /sys/block/sd[x]/queue/scheduler
# echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Ceph Deployment Network
Ceph Deployment Network
• Each host have at least two 1Gbps network interface controllers (NICs).
• Use 10G Ethernet• Always use JUMBO frames
• High BW connectivity between TOR switches and spine routers, Example: 40Gbps to 100Gbps• Hardware should have a Baseboard Management Controller (BMC)
• Note: Running three networks in HA mode may seem like overkill
Public N/w
Cluster N/w
NIC-1
NIC-2
# ifconfig ethx mtu 9000 #echo "MTU=9000" | tee -a /etc/sysconfig/network-script/ifcfg-ethx
Ceph Deployment Network
• NIC Bonding - Balance-alb mode both NICs are used to send and receive traffics:
• Test results with 2x10G NIC:
• Active-Passive bond mode: Traffic between 2 nodes: Case#1 : node-1 to node-2 => BW 4.80 Gb/s Case#2: node-1 to node-2 => BW 4.62 Gb/s
• Speed of one 10Gig NIC.
• Balance-alb bond mode:
• Case#1 : node-1 to node-2 => BW 8.18 Gb/s• Case#2: node-1 to node-2 => BW 8.37 Gb/s
• Speed of two 10Gig NICs
Ceph Failure Domains
• A failure domain is any failure that prevents access to one or more OSDs. Added costs of isolating every potential failure domain.
Failure domains:• osd • host • chassis• rack • row • pdu • pod • room • datacenter • region
Ceph Ops Recommendations
Scrub and deep scrub operations are very IO consuming and can affect cluster performance.o Disable scrub and deep scrub
o After setting noscrub, nodeep-scrub ceph health became WARN state
o Enable Scrub and Deep Scrub
o Configure Scrub and Deep Scrub
#ceph osd set noscrub set noscrub#ceph osd set nodeep-scrubset nodeep-scrub
#ceph health HEALTH_WARN noscrub, nodeep-scrub flag(s) set
# ceph osd unset noscrubunset noscrub# ceph osd unset nodeep-scrubunset nodeep-scrub
osd_scrub_begin_hour = 0 # begin at this hourosd_scrub_end_hour = 24 # start last scrub atosd_scrub_load_threshold = 0.05 #scrub only below loadosd_scrub_min_interval = 86400 # not more often than 1 dayosd_scrub_max_interval = 604800 # not less often than 1 weekosd_deep_scrub_interval = 604800 # scrub deeply once a week
Ceph Ops Recommendations
• Decreasing recovery and backfilling performance impact
• Settings for recovery and backfilling :
Note: The above setting will slow down the recovery/backfill process and prolongs the recovery process, if we decrease the values. Increasing these settings value will increase recovery/backfill performance, but decrease client performance and vice versa
‘osd max backfills’ - maximum backfills allowed to/from a OSD [default 10]
‘osd recovery max active’ - Recovery requests per OSD at one time. [default 15]
‘osd recovery threads’ - The number of threads for recovering data. [default 1]
‘osd recovery op priority’ - Priority for recovery Ops. [ default 10]
Ceph Performance Measurement Guidelines
For best measurement results, follow these rules while testing:
• One option at a time.• Check - what is changing.• Choose the right performance test for the changed option.• Re-test the changes - at least ten times.• Run tests for hours, not seconds.• Trace for any errors.• Decisively look at results.• Always try to estimate results and see at standard difference to eliminate spikes and false tests.
Tuning:• Ceph clusters can be parametrized after deployment to better fit the requirements of the workload. • Some configuration options can affect data redundancy and have significant implications on stability and safety of data. • Tuning should be performed on test environment prior issuing any command and configuration changes on production.
Any questions?
Thank You
Swami Reddy | [email protected] | swamireddy @ irc
Satish | [email protected] | satish @ irc
Pandiyan M | [email protected] | maestropandy @ irc
Reference Links
• Ceph documentation • Previous Openstack summit presentations • Tech Talk Ceph • A few blogs on Ceph
• https://www.sebastien-han.fr/blog/categories/ceph/• https://
www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro-INC0270868_v2_0715.pdf
Appendix
Ceph H/W Best Practices
OSD
HOST
MDS
HOST
MON
HOST
1 x 64-Bit core1 x 32-Bit Dual Core1 x i386 Dual- Core
1GB per 1TB 1 x 64-Bit core1 x 32-Bit Dual Core1 x i386 Dual- Core
1GB per daemon 1GB per daemon1 x 64-Bit core1 x 32-Bit Dual Core1 x i386 Dual- Core
HDD, SDD, Controllers
• Ceph best practices to run operating systems, OSD data and OSD journals on separate drives.
Hard Disk Drives (HDD)• minimum hard disk drive size of 1 terabyte.• ~1GB of RAM for 1TB of storage space.NOTE: NOT a good idea to run:1. multiple OSDs on a single disk. 2. OSD/monitor/metadata server on a single disk.
Solid State Drives (SSD)Use SSDs to improve performance. NOTE:
Controllers Disk controllers also have a significant impact on write throughput.
Controller
Ceph OSD Journal - Results
Write operations
Ceph OSD Journal - Results
Seq Read operations
Ceph OSD Journal - Results
Read operations