abhishek presentation
TRANSCRIPT
![Page 1: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/1.jpg)
Srikanth Kandula, Sudipta Sengupta, Albert Greenberg, Parveen Patel, Ronnie Chaiken
Microsoft ResearchIMC November, 2009
Abhishek [email protected]
THE NATURE OF DATACENTER: MEASUREMENTS & ANALYSIS
![Page 2: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/2.jpg)
OutlineIntroductionData & MethodologyApplicationTraffic CharacteristicsTomographyConclusion
![Page 3: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/3.jpg)
IntroductionAnalysis and mining of data sets
Processing around some petabytes of data
This paper has tried to describe characteristics of traffic Detailed view of traffic Congestion conditions and patterns
![Page 4: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/4.jpg)
ContributionMeasurement Instrumentation
Measures traffic at data centers rather than switches
Traffic characteristics Flow, congestion and rate of change of traffic mix.
Tomography Inference Accuracy Performs
Clusters =1500 servers Rack = 20
2 months
![Page 5: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/5.jpg)
Data & MethodologyISPs
SNMP CountersSampled FlowDeep packet Inspection
Data CenterMeasurements at Server
Servers, Storage and networkLinkage of network traffic with application level
logs
![Page 6: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/6.jpg)
Socket level events at each serversETW – Event Tracing for Windows
One per application read or write
Aggregates over several packets
http://msdn.microsoft.com/en-us/magazine/cc163437.aspx#S1
![Page 7: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/7.jpg)
ETW – Event tracing for Windows
![Page 8: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/8.jpg)
Application WorkloadSQL Programming language like Scope3 phases of different types
Extract PartitionAggregateCombine
Short interactive programs to long running programs
![Page 9: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/9.jpg)
Traffic Characteristics
![Page 10: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/10.jpg)
Patterns
![Page 11: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/11.jpg)
Work-Seeks-BW and Scatter-Gather patterns in datacenter traffic
exchanged b/w server pairs
![Page 12: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/12.jpg)
Work-seeks-bandwidthWithin same serversWithin servers in same rackWithin servers in same VLAN
Scatter-gather-patternsData is divided into small parts and each
servers works on particular partAggregated
![Page 13: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/13.jpg)
How much traffic is exchanged between server pairs?
![Page 14: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/14.jpg)
Server pair with same rack are more likely to exchange more bytes
Probability of exchanging no traffic 89% - servers within same rack99.5% - servers in different rack
![Page 15: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/15.jpg)
How many other servers does a server correspond with?
![Page 16: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/16.jpg)
Sever either talks to all other servers with the same rack
Servers doesn’t talk to servers outside the rack or talks 1-10% outside servers.
![Page 17: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/17.jpg)
Congestion within the Datacenter
![Page 18: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/18.jpg)
N/W at as high an utilization as possible without adversely affecting throughput
Low network utilization indicateApplication by nature demands more of
other resources such as CPU and disk than the network
Applications can be re-written to make better use of available network bandwidth
![Page 19: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/19.jpg)
Where and when the congestion happens in data center
![Page 20: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/20.jpg)
Congestion Rate 86% - 10 seconds 15% - 100 seconds
Short congestion periods are highly correlated across many tens of links and are due to brief spurts of high demand from the application
Long lasting congestion periods tend to be more localized to a small set of links
![Page 21: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/21.jpg)
Length of Congestion Events
![Page 22: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/22.jpg)
Compares the rates of flows that overlap high utilization periods with
the rates of all flows
![Page 23: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/23.jpg)
Impact of high utilization
![Page 24: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/24.jpg)
Read failure - Job is killedCongestion
To attribute network traffic to the applications that generate it, they merge the network event logs with logs at the application-level that describe which job and phase were active at that time
![Page 25: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/25.jpg)
Reduce phase - Data in each partition that is present at multiple servers in the cluster has to be pulled to the server that handles the reduce for the partitione.g. count the number of records that begin with ‘A’
Extract phase – Extracting the dataLargest amount of data
Evaluation phase – Problem
Conclusion – High utilization epochs are caused by application demand and have a moderate negative impact to job performance
![Page 26: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/26.jpg)
Flow Characteristics
![Page 27: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/27.jpg)
Traffic mix changes frequently
![Page 28: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/28.jpg)
How traffic changes over time within the data center
![Page 29: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/29.jpg)
Change in traffic10th and 90th percentiles are 37% and 149% the median change in traffic is roughly 82%
even when the total traffic in the matrix remains the same, the server pairs that are involved in these traffic exchanges change appreciably
![Page 30: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/30.jpg)
Short bursts cause spikes at the shorter time-scale (in dashed line) that smooth out at the longer time scale (in solid line) whereas gradual changes appear conversely, smoothed out at shorter time-scales yet pronounced on the longer time-scale
Variability - key aspect for data center
![Page 31: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/31.jpg)
Inter-arrival times in the entire cluster, at Top-of-Rack switches
and at servers
![Page 32: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/32.jpg)
Inter-arrivals at both servers and top-of-rack switches have spaced apart by roughly 15ms
This is likely due to the stop-and-go behavior of the application that rate-limits the creation of new flows
Median arrival rate of all flows in the cluster is 105 flows per second or 100 flows in every millisecond
![Page 33: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/33.jpg)
TomographyN/W tomography methods to infer traffic matricesIf the methods used in ISP n/w is applicable to
datacenters, it would help to unravel the nature of traffic
Why?Data flow volume is quadratic n(n - 1) – no. of links
measurements are fewer Assumptions - Gravity model - Amount of traffic a
node (origin) would send to another node (destination) is proportional to the traffic volume received by the destination
Scalability
![Page 34: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/34.jpg)
Methodology
Computes ground truth TM and measure how well the TM estimated by tomography from these link counts approximates the true TM
![Page 35: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/35.jpg)
Tomogravity and Spare Maximization
![Page 36: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/36.jpg)
Tomogravity - Communication likely to be B/W nodes with same job rather than all nodes, whereas gravity model, not being aware of these job-clusters, introduces traffic across clusters, resulting in many non-zero TM entries
Spare maximization – Error rate starts from several hundreds
![Page 37: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/37.jpg)
Comparison the TMs by various tomography methods with the
ground truth
![Page 38: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/38.jpg)
Ground TMs are sparser than tomogravity estimated TMs, and denser than sparsity maximized estimated TMs
![Page 39: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/39.jpg)
ConclusionCapture both
Macroscopic patterns – which servers talk to which others, when and for what reasons
Microscopic characteristics – flow durations, inter-arrival times
Tighter coupling between network, computing, and storage in datacenter applications
Congestion and negative application impact do occur, demanding improvement - better understanding of traffic and mechanisms that steer demand
![Page 40: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/40.jpg)
My TakeMore data should be examined over a period
of 1 year instead of 2 monthsI would certainly like to see some mining of
data and application running at datacenters of companies like Google, Yahoo etc
![Page 41: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/41.jpg)
Related WorkT. Benson, A. Anand, A. Akella, andM.
Zhang: Understanding Datacenter Traffic Characteristics, In SIGCOMMWREN workshop, 2009.
A. Greenberg, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta:
VL2: A Scalable and Flexible Data Center Network, In ACM SIGCOMM, 2009.
![Page 42: Abhishek Presentation](https://reader035.vdocumento.com/reader035/viewer/2022062310/577cc6a31a28aba7119ebf19/html5/thumbnails/42.jpg)
Thank You