tech spark presentation
TRANSCRIPT
Big Data Stephen Borg
Tech Spark – Microsoft Innovation Center
Tech Spark – Microsoft Innovation Center
What we are covering… today
• Problems that Big Data can help with• Infrastructure setup through Azure• Architecture for our use case• Spark application, NoSQL Database• Presentation of data through Zeppelin
Tech Spark – Microsoft Innovation Center
Problems….• Time… ETL process taking most of the
night…• Process crashes during peak times due to
application processing• Outgrown what you can do with RDBMS
such as MySQL, SQL Server, Oracle• To scale out your environment is a big issue
Tech Spark – Microsoft Innovation Center
Problems….• Non-structured data doesn’t fit my RDBMS• Hitting the limit with Optimisation…. • Optimisation on backend, or SQL only extends
the time• Not being able to be proactive due to
analytics taking too long to process• Analytics not being on time!
Tech Spark – Microsoft Innovation Center
Moving on to HadoopTypical components of big data systems• Distributed databases, Hbase, Hive• Distributed processing systems, Map Reduce &
YARN• Distributed file systems, HDFS
Tech Spark – Microsoft Innovation Center
Industry wide problem : Payment Fraud
• Merchants are loosing $250B globally• Cost of fraud is around 1% of Revenue for
retailers (2014)• Fraud increases due to newer channels on the
market
• Reference : http://www.lexisnexis.com/risk/downloads/assets/true-cost-fraud-2014.pdf
Tech Spark – Microsoft Innovation Center
The requirement…Fraud toolkit
Knowledge from past payment fraud
Batch processing
RT Analytics
Interactive Analytics
Tech Spark – Microsoft Innovation Center
Fraud : Anomaly• Generic rules• Base rules on scores• Use of models to detect fraud
CATCH THEM IN THE ACT!
Tech Spark – Microsoft Innovation Center
Fraudsters : flag examples• Stolen cards• Buy expensive items• In larger than usual quantities• Very quickly• At odd hours based on country• Risky country?• Blacklisted IP
What are the right tools?Answer : There isn’t just one! However…. Pick
wisely….
Tech Spark – Microsoft Innovation Center
Tech Spark – Microsoft Innovation Center
Our selection of tools and languages…
Tech Spark – Microsoft Innovation Center
Microsoft Azure for a virtual env.
Tech Spark – Microsoft Innovation Center
Our node configuration…
Resource ManagerActive NN
Metrics & Ambari
Service Clients
Kafka Cluster
Hbase MasterSecondary NN
Spark History ServerZookeeper Server
services01 services02 services03
gateway
ambari
Worker01 Worker02 Worker03
Total of 8 cores and 42GB , 84GB SSD
Kafka Cluster
Kafka Cluster
2 core 14GB RAM 2 core 14GB RAM2 core 14GB RAM
1 core 3.5GB
2 core 7GB RAM
Tech Spark – Microsoft Innovation Center
Architecture working together
Tech Spark – Microsoft Innovation Center
Setting up the infrastructure• Static IPs & password less authentication
from Ambari node to all cluster nodes
Tech Spark – Microsoft Innovation Center
Admin your cluster through Ambari• Alerts• Change configurations easily• Handle rolling restarts• Add more nodes to your cluster• Manage upgrades of Hadoop
Tech Spark – Microsoft Innovation Center
Brief overview of Ambari
Tech Spark – Microsoft Innovation Center
First step : Mock Data!• Using fluttercode random test data that mock
payments • Typical transaction:
{"cardNumber":"3584237251420382","longtitude":"46.88295","latitude":"36.54631","itemPrice":6.2724032E7,"quantity":1010366557,"currencyCode":"MXN","email":"[email protected]","ip":"184.149.76.20","username":"[email protected]","customerId":"1af87f6a-d6a8-4c7e-b5f5-d11410878eb8","countryCode":"KH","paymentMethodName":"Mastercard","productName":"Nike T-Shirt M","productCategory":"Clothes","premium":false,"timestamp":1479941448382}
Tech Spark – Microsoft Innovation Center
Sending data to Kafka• Our java application will be our data mocker,
and sends data to Kafka as a producer• Kafka uses Zookeeper for configurations
Tech Spark – Microsoft Innovation Center
Messages received can be seen by Kafka console consumers
Tech Spark – Microsoft Innovation Center
Preparing to store data - NoSQL• Use of Hbase• Create several tables to store lookup data• These tables can be populated by an overnight
process• Compare TX coming in with this data
Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al.
Tech Spark – Microsoft Innovation Center
Create Hbase tables & Insert Data• Open hbase shell, and execute “list” will list
you available tables
Tech Spark – Microsoft Innovation Center
Querying Hbase has a bit of a learning curve…
• Query with a filterscan 'PotentialFraudStream', { COLUMNS => ['info:Day'], FILTER => "RowFilter(=, 'substring:2016-11-24')" }• Query limiting resultsscan "PotentialFraudStream",{LIMIT=>5}
Tech Spark – Microsoft Innovation Center
Apache Phoenix puts SQL back in NoSQL…• Open up a session in Phoenix…
• Create a view that maps to our Hbase table CREATE VIEW "PotentialFraudStream" ( "CustomerId" VARCHAR PRIMARY KEY, "info"."Score" VARCHAR, "info"."Lon" VARCHAR, "info"."Lat" VARCHAR, "info"."Message" VARCHAR , "info"."LastTransactionTime" VARCHAR, "info"."LastTransactionJSON" VARCHAR, "info"."Day" VARCHAR, "info"."Country" VARCHAR);
Tech Spark – Microsoft Innovation Center
And there you have it… • We can use SQL again…
Tech Spark – Microsoft Innovation Center
Benchmarking Phoenix• Phoenix vs Hive (running over HDFS & Hbase)
https://phoenix.apache.org/performance.html
Tech Spark – Microsoft Innovation Center
Recap…• So now we have our tables in Hbase to be
accessed real-time• We have a table that can be queried by users in
real-time• So let’s insert the lookup informationhttps://stephenborg.atlassian.net/wiki/display/BD/TechSpark+Demo
Tech Spark – Microsoft Innovation Center
Apache Spark• Large scale in memory processing engine• Processes data in micro batches• Supports Java, Scala, Python & R• Configure batches of X seconds, and allocate
resources to a context, and processes batches
Tech Spark – Microsoft Innovation Center
Submitting spark application to YARNspark-submit --conf "spark.ui.port=4099" --master yarn --class com.techsparkdemo.StreamBooter /home/stephen.borg/spark-streamer-1.0-SNAPSHOT-jar-with-dependencies.jar "/tmp/SparkCheckpoints/StreamingCheckpoint" "Tech Spark Demo" "yarn-client" 15 "104.40.216.218:2181,52.174.110.65:2181,13.95.23.152:2181" "hdfs://13.95.23.152:8020" "purchasesStreamingDemo" "Purchases" "/hbase-unsecure" 10
Additional can be --num-executors 3 (workers) --executor-memory 10G (memory allocated per worker) --driver-memory 2G (submitted via yarn-client, the worker executing the application will have 2GB)
Tech Spark – Microsoft Innovation Center
Spark History Server• Optimize your spark application• Know where resources are allocated, and
debug failures
Tech Spark – Microsoft Innovation Center
Recap again…• So we now have our data mocker sending us
data• We have data being processed by the spark
application, and potential fraud customers are funneled into a stream, and written into a specific table in Hbase
Tech Spark – Microsoft Innovation Center
Let’s now visualise that data…• Web based notebook that allows developers to
create interactive analysis with different languages such as Scala, java, SQL
• You tell the Zeppelin context what language you are supplying the notebook by the special reserved keyword %[interpreter] in the beginning of each notebook
Tech Spark – Microsoft Innovation Center
What interpreters will we use?• Up to the developers choice but we will use • %dep – to load a dependent JAR• %jdbc (phoenix) – to communite with Phoenix• %angular – to plot an OpenStreetMap using
angular
Tech Spark – Microsoft Innovation Center
A mini ETL to show potential• Step 1 : Load dependencies • Step 2 : Collect and prepare data in an array
using Scala• Step 3 : Plot chart via angular script• Step 4 : Schedule via embedded cron scheduler
• Next : Results can be used to action
Tech Spark – Microsoft Innovation Center
An OpenStreetMap for fraud
Tech Spark – Microsoft Innovation Center
Most important point• Phoenix uses JDBC and many tools can extract
data either via UI or code
Tech Spark – Microsoft Innovation Center
Points that can backfire • So what if people buy expensive items? Rich
person• Someone can also buy a lot… Buying gifts• Very quickly… Busy person• At odd hours… Who doesn’t have a long night
every now and then?
Tech Spark – Microsoft Innovation Center
Room for improvement • We can apply data science models to the spark
application and not just rules
Tech Spark – Microsoft Innovation Center
Worth mentioning• ZeppelinHub for keeping track of your
notebooks• All that we mentioned today is documented
https://stephenborg.atlassian.net/wiki/display/BD/TechSpark+Demo
• All source code for this demo https://bitbucket.org/stephenborg1987/techspark
Tech Spark – Microsoft Innovation Center
References• https://www.forrester.com/report/Stop+Billion
s+In+Fraud+Losses+With+Machine+Learning/-/E-RES120912
• https://pkghosh.wordpress.com/2013/10/21/real-time-fraud-detection-with-sequence-mining/
Tech Spark – Microsoft Innovation Center
Any help, questions• Feel free to drop a line