Big Data software infrastructures Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI)
Gábor Hermann Zoltán Zvara András Benczúr Informatics Laboratory „Big Data – Momemtum” research group
MTA SZTAKI, 10/11/2016
Outline n Motivation for “Big Data” n “Big Data” is about software infrastructure n Batch and streaming approaches n What we do at SZTAKI n Solving a problem: handling dataskew
2
Some data about data n Google was guesstimated to have over 1M machines in 2012 n Large Hadron Collider (LHC) collisions generated about 75 petabytes in past 3 years n Facebook 31.25 million messages every minute (2015)
3
The Project Triangle n Small machine to store and process data: slow n Large servers are fast but costly n Distributed computing?
4
Problem with distributed data processing n When failures happen… n Left side will not know about new data on right n Immediate response from left side might give incorrect answer n If failures can happen (partitioned) we can either choose correctness (consistence) or fast response (availability)
5
CAP (Fox&Brewer) Theorem Theorem: You may choose two of C-A-P
C
Consistency (Good)
Availability (Fast)
A AP: some replicas may give erroneous answer
6
P
Partition-resilience (Cheap)
Failures do happen in distributed data processing! n We must have P, need to choose between A and C n Fast response vs. correct results n Most applications need fast response n Best we can do: eventual consistency (if connection resumes and data can be exchanged)
n „Big Data” today is mostly about software infrastructure n Trying to do the best
7
Approach 1: batch processing n Process the whole dataset n Consistent, but takes time (hours, days) n If failure happens, wait for recovery (choosing CP)
n Apache Hadoop n MapReduce, HDFS (distributed file system) n Data in chunks across many machines, replicated n Bring computation close to the data
8
Beyond Hadoop and MapReduce (batch) n MapReduce has the first open source distributed software, Hadoop n Limitations n Join and more complex primitives n Graphs, machine learning
n Alternatives n Graph processing: Apache Giraph, Apache HAMA, … n In memory based: Apache Spark n Streaming dataflow engine: Apache Flink
9
Approach 2: stream processing n Continuously process all incoming data n Faster response time (low-latency, within 1 sec) n If failure: wait for recovery n Still choosing PC, but sophisticated recovery mechanisms give lower downtime
n Harder to implement and reason about n Stream processing frameworks n Apache Flink, Apache Storm, Apache Spark
10
STREAMLINE H2020 New initiative on top of Apache Flink A general data processing framework to unify batch and stream processing At SZTAKI: Machine Learning n DFKI (DE) n SICS (SE) n Portugal Telecom (PT) n Internet Memory (FR) n Rovio (FI) n SZTAKI (HU) 11
B. – Volker Markl (TU Berlin)
Batch and stream: same execution engine Real-time data streams
Event logs Kafka, RabbitMQ, ...
Historic data HDFS, JDBC, ... 12
An engine that puts equal emphasis to streaming and batch Flink
ETL, Graphs, Machine Learning Relational, … Low latency windowing, aggregations, ...
What we do for “Big Data” at SZTAKI n Developing Apache Flink (STREAMLINE H2020) n Machine Learning algorithms, experimenting
n Projects with industrial partners n Using Spark, Flink, Cassandra, Hadoop etc.
n Research n Improvements on current systems n Ongoing project: handling dataskew
13
Solving a problem: handling dataskew n We have developed an application aggregating telco data n After a while, on real dataset it could become slow or even crash n Investigated the problem: dataskew n 80% of the traffic generated by 20% of the communication towers
14
The problem n Default hashing is not going to distribute the data uniformly n Data distribution is not known in advance n The heavy keys might even change stage boundary & shuffle
slow task
slow task
even partitioning 15
skewed data
Our solution: Dynamic Repartitioning n Monitoring tasks, repartitioning based on that data n System aware n No significant overhead
n Can handle arbitrary data distributions n Does everything on-the-fly n Works for streaming and batch
n Pluggable n Initially on Spark batch and streaming n Plugged into Flink streaming 16
Execution visualization of Spark jobs
17
Future in handling dataskew n Generalizing the problem n Balancing load in a processing system n Balancing resources between processing systems on a cluster (YARN)
18
Conclusion n We can tame “Big Data” with better software n Lot to do… n Connection between batch and streaming n Highly scalable machine learning (batch and online both) n Optimizations (e.g. handling dataskew) n Better understanding our tools (e.g. visualization) n ...
n Evolving fast, but we can take part in it
19
Questions? n You can reach us! n
[email protected] n
[email protected] n
[email protected]
20
References n STREAMLINE H2020 n https://streamline.sics.se/
n Dynamic Repartitioning n https://spark-summit.org/2016/events/handling-data-skewadaptively-in-spark-using-dynamic-repartitioning/
n Visualizations n http://flink-forward.org/kb_sessions/advanced-visualization-offlink-and-spark-jobs/
21
References (continued) n Apache Hadoop n https://hadoop.apache.org/
n Apache Flink n https://flink.apache.org/
n Apache Spark n https://spark.apache.org/
22
References (continued) n E. A. Brewer. Towards robust distributed systems, 2000 n J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, 2004 n N. Marz. How to beat the CAP theorem, 2011 n http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
23