![Page 1: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/1.jpg)
The Future of Real-Time in Spark
Reynold Xin @rxinSpark Summit, New York, Feb 18, 2016
![Page 2: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/2.jpg)
Why Real-Time?
Making decisions faster is valuable.
• Preventing credit card fraud• Monitoring industrial machinery• Human-facing dashboards• …
![Page 3: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/3.jpg)
Streaming Engine
Noun.
Takes an input stream and produces an output stream.
![Page 4: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/4.jpg)
SQL Streaming MLlib
Spark Core
GraphX
Spark Unified Stack
![Page 5: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/5.jpg)
StreamingSQL MLlib
Spark Core
GraphXStreaming
Introduced 3 years ago in Spark 0.750% users consider most important part of Spark
Spark Unified Stack
![Page 6: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/6.jpg)
Spark Streaming
• First attempt at unifying streaming and batch• State management built in• Exactly once semantics• Features required for large clusters
• Straggler mitigation, dynamic load balancing, fast fault-recovery
![Page 7: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/7.jpg)
Streaming computations don’t run in isolation.
![Page 8: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/8.jpg)
Use Case: Fraud Detection
STREAM
ANOMALY
Machine learning modelcontinuously updates to detect new anomalies
Ad-hoc analyze historic data
![Page 9: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/9.jpg)
Continuous Application
noun.
An end-to-end application that acts on real-time data.
![Page 10: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/10.jpg)
Challenges Building Continuous Applications
Integration with non-streaming systems often an after-thought• Interactive, batch, relational databases, machine learning, …
Streaming programming models are complex
![Page 11: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/11.jpg)
Integration Example
Streaming engine
Stream(home.html, 10:08)
(product.html, 10:09)
(home.html, 10:10)
. . .
What can go wrong?• Late events• Partial outputs to MySQL• State recovery on failure• Distributed reads/writes • ...
MySQL
Page Minute Visits
home 10:09 21
pricing 10:10 30
... ... ...
![Page 12: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/12.jpg)
ProcessingBusiness logic change & new ops
(windows, sessions)
Complex Programming Models
OutputHow do we define
output over time & correctness?
DataLate arrival, varying distribution over time, …
![Page 13: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/13.jpg)
Structured Streaming
![Page 14: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/14.jpg)
The simplest way to perform streaming analyticsis not having to reason about streaming.
![Page 15: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/15.jpg)
Spark 2.0Infinite DataFrames
Spark 1.3Static DataFrames
Single API !
![Page 16: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/16.jpg)
Structured Streaming
High-level streaming API built on Spark SQL engine• Runs the same queries on DataFrames• Event time, windowing, sessions, sources & sinks
Unifies streaming, interactive and batch queries• Aggregate data in a stream, then serve using JDBC• Change queries at runtime• Build and apply ML models
![Page 17: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/17.jpg)
output fordata at 1
Result
Que
ry
Time
data upto PT 1
Input
completeoutput
Output
1 2 3
Trigger: every 1 sec
data upto PT 2
output fordata at 2
data upto PT 3
output fordata at 3
Model
![Page 18: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/18.jpg)
deltaoutput
output fordata at 1
Result
Que
ry
Time
data upto PT 2
data upto PT 3
data upto PT 1
Input
output fordata at 2
output fordata at 3
Output
1 2 3
Trigger: every 1 secModel
![Page 19: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/19.jpg)
Model Details
Input sources: append-only tables
Queries: new operators for windowing, sessions, etc
Triggers: based on time (e.g. every 1 sec)
Output modes: complete, deltas, update-in-place
![Page 20: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/20.jpg)
Example: ETL
Input: files in S3
Query: map (transform each record)
Trigger: “every 5 sec”
Output mode: “new records”, into S3 sink
![Page 21: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/21.jpg)
Example: Page View Count
Input: records in Kafka
Query: select count(*) group by page, minute(evtime)
Trigger: “every 5 sec”
Output mode: “update-in-place”, into MySQL sink
Note: this will automatically update “old” records on late data!
![Page 22: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/22.jpg)
Logically:DataFrame operations on static data(i.e. as easy to understand as batch)
Physically:Spark automatically runs the query in streaming fashion(i.e. incrementally and continuously)
DataFrame
Logical Plan
Continuous, incremental execution
Catalyst optimizer
Execution
![Page 23: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/23.jpg)
logs = ctx.read.format("json").open("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.save("jdbc:mysql//...")
Example: Batch Aggregation
![Page 24: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/24.jpg)
logs = ctx.read.format("json").stream("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
Example: Continuous Aggregation
![Page 25: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/25.jpg)
T = 0 Aggregate
AggregateT = 1
AggregateT = 2
…
Automatic Incremental Execution
![Page 26: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/26.jpg)
Rest of Spark will follow
• Interactive queries should just work
• Spark’s data source API will be updated to support seamless streaming integration• Exactly once semantics end-to-end• Different output modes (complete, delta, update-in-place)
• ML algorithms will be updated too
![Page 27: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/27.jpg)
What can we do with this that’s hard with other engines?Ad-hoc, interactive queries
Dynamic changing queries
Benefits of Spark: elastic scaling, straggler mitigation, etc
![Page 28: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/28.jpg)
Use Case: Fraud Detection
STREAM
ANOMALY
Machine Learning Modelcontinuously updates to detect new anomalies
Analyze Historic Data
![Page 29: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/29.jpg)
Timeline
Spark 2.0• API foundation• Kafka, file systems, and
databases• Event-time aggregations
Spark 2.1 + • Continuous SQL• BI app integration• Other streaming sources / sinks• Machine learning
![Page 30: Tecnicas e Instrumentos de Recoleccion de Datos](https://reader030.vdocumento.com/reader030/viewer/2022021801/58f12cf21a28abbd3e8b456d/html5/thumbnails/30.jpg)
Thank you.@rxin