Download - Yahoo Chid Presentation
-
8/8/2019 Yahoo Chid Presentation
1/47
Large Scale Distributed Infrastructures (using Hadoop)
Chidambaran KollengodeHadoop EngineeringCloud Computing and Data Infrastructure GroupYahoo India R & D, Bangalore
Workshop on Cloud Computing18-20, Aug 2010at IIT Madras, Chennai
-
8/8/2019 Yahoo Chid Presentation
2/47
Agenda
DemystifyingtheCloud
Whyhave
Cloudy datacenters?
WhydoesYahoo!needacloudinfrastructure
CasestudiesinYahoo!
HadoopArchitecture birdseyeview
KeyChallengesinCloudComputing Andthecontinuedlearningfromthosechallenges!
Q&A
-
8/8/2019 Yahoo Chid Presentation
3/47
Demystifying the Cloud.
-
8/8/2019 Yahoo Chid Presentation
4/47
Is the Cloud concept new?
FirstGen
Networkas
acloud,
message
in
and
out
Cloudhidprocessingfromusers
NextEvolution
www:Cloudarounddocuments
URLin,
document
out
-
8/8/2019 Yahoo Chid Presentation
5/47
And the present really rocks!
Butnowwithcloudcomputingwehave..YOUR BUSINESSYOUR BUSINESS
YOUR
DATA
Control
Processing
Storage
YOUR
DATA?
TheInternet
TheInternet
HOSTED SERVICES
-
8/8/2019 Yahoo Chid Presentation
6/47
So what is a cloud?
CloudComputingiseither: Hosteddataprocessingservices,or
Hostedweb
services
Whichare: Ahighlydistributedandelastic computingenvironment Withpredictableavailability
Basically,its
aself
scaling
computing
resource
Fireandforget Thanks,inpartto:
Cheaper
bandwidth
and
hardware Muchfastermachines
Abstraction
Businesses,theresearchcommunity&OpenSource
-
8/8/2019 Yahoo Chid Presentation
7/47
Why should new datacenters beclouds?
-
8/8/2019 Yahoo Chid Presentation
8/47
Cloud ROI
Scalabilityondemand
Concentrateon
improving
business
process/moretimeforinnovation
Streamliningdata
centers
(including
off
loadingtopubliccloud)
Levelplayingfield minimizingstartupcosts
-
8/8/2019 Yahoo Chid Presentation
9/47
Two facets of Scalability
Withalotofmachines
Moreusers,moredata,moremining,moreadsetc
or Potentialtodothingslog(N)yearsearlierthan
othersThe latter is innovation!
-
8/8/2019 Yahoo Chid Presentation
10/47
Where did all my time go?
Ifonlymyteamhasmorebandwidthwecan
innovate UCBerkeleystudy
3040%differentiatedandvaluecreation
7060%undifferentiatedtasks hardware,installs,upgrades,provisioning,loadbalancing!!
-
8/8/2019 Yahoo Chid Presentation
11/47
Data Center streamlining
DataCenterchallenges
Conflictingdemands
bring
costs
down
yet
provideinnovativesolutions
Canonpremisedatacentersdothebalancingact?
Enterpriseswillbeginwithprivateclouds(andsolvescalabilityproblems!)
Offloading
spikes
to
public
clouds
-
8/8/2019 Yahoo Chid Presentation
12/47
In the horizon
EverybusinesshastohaveWebpresence
Zerocontrol
on
who,
how
many,
when,
how
long..
Nochoicebuttomigratethistoscalableinfrastructures
Blurringthelinebetweenappsforemployeesversus
customers(web
will
enable
this)
so
why
have
two
experiences?
Enterprisethenrunsitsappsasmeteredutilities
increasedmachine
usage
and
ROI!
Thismeanscloud
-
8/8/2019 Yahoo Chid Presentation
13/47
Why Cloud @ Yahoo!
-
8/8/2019 Yahoo Chid Presentation
14/47
Yahoo Business Model
Customer Experience
Traffic
Ads
Simple Growth Model
For this growth- Incremental scaling is the key - Add one node at a time!
- Reverse scalability (redirecting resource to apps)
21st century is about understanding people the experiences they want.It is a lot more than infrastructure
-
8/8/2019 Yahoo Chid Presentation
15/47
Yahoo! is Perfect for Cloud Computing
HUNDREDSOF PROPERTIES / PRODUCTS
600MUNIQUE USERS / MONTH
300M+YAHOO! MAIL USERS / MONTHHUNDREDS
OF PETABYTES OF STORAGEBILLIONS
OF OBJECTS STOREDPETABYTES
OF TRAFFIC DAILY
-
8/8/2019 Yahoo Chid Presentation
16/47
Why Cloud Infrastructure is the only answer
Cost
effective Multitenant RapidExperimentation Handlefailuredaily Unpredictable
peaks
(scale)
onlycloudcanenablethis
-
8/8/2019 Yahoo Chid Presentation
17/47
What is Yahoo! doing?
Privatecloudforinternaluse.But,manyopensource
components. Hadoop:
Opensource(Apache)framework forrunning
applicationson
large
clusters
on
commodityhardwarecommodityhardware
islargest(only?)opensourceframeworkfordataintensiveapps(petabytes)
PIG:
Parallel
Programming
Language
and
Runtime
Zookeeper:HighAvailabilityDirectoryand
Configuration
Service
-
8/8/2019 Yahoo Chid Presentation
18/47
Yahoo! Cloud ServicesHorizontal and Functional
(Hadoop)
-
8/8/2019 Yahoo Chid Presentation
19/47
How is Yahoo! seeing the space?
Yahoo!seestwokindsofCloudservices:
HorizontalCloud
Services
Functionalityenablingtenantstobuildapplicationsornewservicesontopofthecloud
ThefocusofCCDI
FunctionalCloudServices Functionalitythatisusefulinandofitselftotenants.
Yahoo!sIndexTools;Yahoo!propertiesaimedatendusers
e.g.,flickr,
Groups,
Mail,
News,
Shopping
Couldbebuiltontopofhorizontalcloudservicesorfromscratch
-
8/8/2019 Yahoo Chid Presentation
20/47
-
8/8/2019 Yahoo Chid Presentation
21/47
AdvertisingOptimization
&DeliveryContent
Optimization
SearchIndex
Image/VideoStorage&Delivery
Yahoo!s Cloud Use Case
RSSFeeds
Caching,LoadBalancing
MachineLearning
(e.g.Spamfilters)
-
8/8/2019 Yahoo Chid Presentation
22/47
Large Applications2008 2009
Webmap ~70 hours runtime
~300 TB shuffling~200 TB output
~73 hours runtime
~490 TB shuffling~280 TB output
+55% Hardware
Sort benchmarks
(Jim Gray contest)
1 Terabyte sorted
209 sec onds900 nodes
1 Terabyte sorted
62 sec ond s, 1500 nodes1 Petabyte sorted16.25 hours, 3700 nodes
Largest c luster 2000 nodes
6PB raw disk16TB of RAM16K Cores
4000 nodes
16PB raw disk64TB of RAM32K Cores(40% faster too!)
-
8/8/2019 Yahoo Chid Presentation
23/47
23
Example: Search AssistTM
Before Hadoop After Hadoop
Time 26 days 20 minutes
Language C++ Python
Development Time 2-3 weeks 2-3 days
DatabaseforSearchAssist isbuiltusingHadoop. 3yearsoflogdata
20stepsofmapreduce
-
8/8/2019 Yahoo Chid Presentation
24/47
Image Search components
Yahoo! Confidential & Proprietary
200 Node HDFS cluster
RHEL AS 4 U4, 64-bit
1TB * 4 Disks, JBOD (RF as 2 in HDFS)8 tasks per machine
Dump jobs 800% performance gains
-
8/8/2019 Yahoo Chid Presentation
25/47
-
8/8/2019 Yahoo Chid Presentation
26/47
Excellent Cloud use case
NYTIMES Neededofflineconversionofpublicdomainarticlesfrom18511922.
Used
Hadoop
to
convert
scanned
images
to
PDF Ran100AmazonEC2instancesforaround24hours
4TBofinput
1.5TBofoutput
Published 1892, copyright New York Times
-
8/8/2019 Yahoo Chid Presentation
27/47
Hadoop Big Data Processor!
A birdseyeviewofArchitecture
-
8/8/2019 Yahoo Chid Presentation
28/47
How to Process BigData?
Justreading100terabytesofdatacanbe
overwhelming Takes~11daystoreadonastandardcomputer
Takesadayacrossa10Gbitlink(veryhighendstoragesolution)
Butitonlytakes15minuteson1000standardcomputers!
Usingclusters
of
standard
computers,
you
get
Linearscalability
Commoditypricing
-
8/8/2019 Yahoo Chid Presentation
29/47
Yahoo! Hadoop Cluster
What 25,0000 Hadoop nodes look like
-
8/8/2019 Yahoo Chid Presentation
30/47
How does Hadoop scale?
Map/Reduce
InputInput
MapMap MapMap MapMap MapMap
Transient DataTransient Data
ResultsResults
ReduceReduce ReduceReduce ReduceReduce ReduceReduce
Split intobits
Process the bitson each node
Process the bitson each node
Collate each binon each node
Collate each binon each node
Shuffle into
bins
Join it alltogether
-
8/8/2019 Yahoo Chid Presentation
31/47
Map-Reduce and HDFS
file1 (1,3)file2 (2,4,5)
Namenode
1 12
224 5
33 4 4
55
Map tasksReduce tasks
JobTracker
TT TT TT
TT TT
-
8/8/2019 Yahoo Chid Presentation
32/47
Map-Reduce on a larger scale
TakethepreviousexampleandmakeitWeb
scale Billionsofwebpages
Indexcanreachafewpetabytes
Thousandsof
machines
Runmultiplejobs/programs
Computeandprocessintensive
Weneed
aplatform to
do
this
HADOOP!
wearebuildingHadoopwiththecommunity!
Hadoopisopensource
-
8/8/2019 Yahoo Chid Presentation
33/47
Challenges & Learnings
Surprisesweveencounteredalong
theway.andourapproach
-
8/8/2019 Yahoo Chid Presentation
34/47
Key Challenges
Elasticscaling
Typically,
with
commodity
infrastructure
Availability Tradingconsistency/performance/availability
Handling
failures Whatcanbecountedonafterafailure?
Operationalefficiency Managingandtuningmultitenantedclouds
Therightabstractions Data,security,andservicesinthecloud
Dontforgetfailures!
-
8/8/2019 Yahoo Chid Presentation
35/47
Data Diversity Challenges
TypesofDatainclude:
StaticText
Web
page
crawl
DynamicText Socialproperties(Answers,Flickr)
StructuredData(Autos,Local,Shopping)
Streams(Finance,
News)
Multimedia
MailHowtoanalyzeandintegratethisBigData?
-
8/8/2019 Yahoo Chid Presentation
36/47
Growth Challenges
Challenge Opportunity
Data transferbottlenecks FedEx-ing disks, DataBackup/Archival
Performanceunpredictability
Improved VM support, flashmemory, scheduling VMs
Scalable structuredstorage
Major research opportunity
Bugs in large distributedsystems
Invent Debugger that relieson Distributed VMs
Scaling quickly Snapshots (may be?)
RAD Labs
-
8/8/2019 Yahoo Chid Presentation
37/47
Adoption Challenges (Public Clouds)
Challenge Opportunity
Availability /business continuity Multiple providers & DCs
Data lock-in Standardization
Data Confidentiality andAuditability
Encryption, VLANs,Firewalls; Jurisdiction ofData Storage
RAD Labs
-
8/8/2019 Yahoo Chid Presentation
38/47
Users!! Cant live with them, cant
shoot them! Thereisalwaysanewwaytocrashthesystem!
Tragedyofthecommons When
have
you
seen
ashared
drive
that
is
not
full?
Wedolovethemofcourse,theypayourwages Weengagethem!
Makesharedcostsvisible! Baddesignsleadtobadresults
-
8/8/2019 Yahoo Chid Presentation
39/47
Challenges in Hadoop QE and RE
ReliabilityLossofnodesData
corruption
Lossofdatablocks
Scale
Usesimulation
DataNode
/Task
tracker
simulation
RepeatabilityDeploymentonmultinodeclusters
Configs
forvariety
of
clusters
ContinuousIntegration(dailyintegration)
-
8/8/2019 Yahoo Chid Presentation
40/47
Testing -> Stability and Agility
Twocompetingneeds: Rapiddevelopment
Addingnew
features/Innovate
Increasestability Hadoopismissioncritical/Pressuretomoveslowly!
Howdo
you
move
the
curve?
Investinautomatedtesting!
Continuousintegration
Stresstesting
-
8/8/2019 Yahoo Chid Presentation
41/47
Research Problems
Checkpointingparallelapplications
Reschedulingpolicies
Performancemodeling
Energybased
optimizations
Performance Problems for Hadoop /
-
8/8/2019 Yahoo Chid Presentation
42/47
Performance Problems for Hadoop /Hadoop Clusters
Understandexternalfailurecharacteristicsandthecost
Externalfailures(i.e.,otherthanthosecausedbyHadoopbugs) faultydisks/Memory/Network/CPU.
AutotuningHadoopforperformance
toolsthat
can
tuneHadoopclusterswiththerightdefaults(e.g.,MapandReduceslots)fortypicalworkloads.
autotuneHadoopjobconfigurationstooptimizeexecutiontime.
Buildtools
to
pinpointhotspotsthatcausedapplicationstorunslowly
VisualizeJobProgressandClusterUtilization
-
8/8/2019 Yahoo Chid Presentation
43/47
Challenges for Cloud providers
Hardware datacenterinvestment(machines,power,cooling)
whatkind
of
HW/OS?
Homogenous?
Commodity?
insertnewmachinesorremovebadones,withoutdisruptingservice
Software Whatsoftwarestacktoprovide
Data howdoescustomerdatageton/offcloud?
QoS
high
availability,
always
available,
updating
SW/HW
without
bringing
servicedown
payasyougo?easy/automaticelasticity?payforwhatyouuse?
-
8/8/2019 Yahoo Chid Presentation
44/47
Challenges for Cloud users
Existingapplications
Needelasticity?
Costeffectiveness(HW/SW,ops)ofrunningontheCloud
Migration(large
business
opportunity
in
future!)
Newapps.Donewthingswithelasticresources?
lotsofdata?
batchprocessing,analytics
-
8/8/2019 Yahoo Chid Presentation
45/47
Cloud Computing is NOT
about saving money(as it exists today)
-
8/8/2019 Yahoo Chid Presentation
46/47
The future is here; its just not widely distributed yet.-- William Gibson
Chid Kollengode
-
8/8/2019 Yahoo Chid Presentation
47/47
Foranyspecificquestionsrelatedto
Yahoo!(otherthancoveredinthis
presentation)please
contact
pr