Big Data software infrastructures

PDF Reader
Full Text

Big Data software infrastructures Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI)

Gábor Hermann Zoltán Zvara András Benczúr Informatics Laboratory „Big Data – Momemtum” research group

MTA SZTAKI, 10/11/2016

Outline n Motivation for “Big Data” n “Big Data” is about software infrastructure n Batch and streaming approaches n What we do at SZTAKI n Solving a problem: handling dataskew

2

Some data about data n Google was guesstimated to have over 1M machines in 2012 n Large Hadron Collider (LHC) collisions generated about 75 petabytes in past 3 years n Facebook 31.25 million messages every minute (2015)

3

The Project Triangle n Small machine to store and process data: slow n Large servers are fast but costly n Distributed computing?

4

Problem with distributed data processing n When failures happen… n Left side will not know about new data on right n Immediate response from left side might give incorrect answer n If failures can happen (partitioned) we can either choose correctness (consistence) or fast response (availability)

5

CAP (Fox&Brewer) Theorem Theorem: You may choose two of C-A-P

C

Consistency (Good)

Availability (Fast)

A AP: some replicas may give erroneous answer

6

P

Partition-resilience (Cheap)

Failures do happen in distributed data processing! n We must have P, need to choose between A and C n Fast response vs. correct results n Most applications need fast response n Best we can do: eventual consistency (if connection resumes and data can be exchanged)

n „Big Data” today is mostly about software infrastructure n Trying to do the best

7

Approach 1: batch processing n Process the whole dataset n Consistent, but takes time (hours, days) n If failure happens, wait for recovery (choosing CP)

n Apache Hadoop n MapReduce, HDFS (distributed file system) n Data in chunks across many machines, replicated n Bring computation close to the data

8

Beyond Hadoop and MapReduce (batch) n MapReduce has the first open source distributed software, Hadoop n Limitations n Join and more complex primitives n Graphs, machine learning

n Alternatives n Graph processing: Apache Giraph, Apache HAMA, … n In memory based: Apache Spark n Streaming dataflow engine: Apache Flink

9

Approach 2: stream processing n Continuously process all incoming data n Faster response time (low-latency, within 1 sec) n If failure: wait for recovery n Still choosing PC, but sophisticated recovery mechanisms give lower downtime

n Harder to implement and reason about n Stream processing frameworks n Apache Flink, Apache Storm, Apache Spark

10

STREAMLINE H2020 New initiative on top of Apache Flink A general data processing framework to unify batch and stream processing At SZTAKI: Machine Learning n DFKI (DE) n SICS (SE) n Portugal Telecom (PT) n Internet Memory (FR) n Rovio (FI) n SZTAKI (HU) 11

B. – Volker Markl (TU Berlin)

Batch and stream: same execution engine Real-time data streams

Event logs Kafka, RabbitMQ, ...

Historic data HDFS, JDBC, ... 12

An engine that puts equal emphasis to streaming and batch Flink

ETL, Graphs, Machine Learning Relational, … Low latency windowing, aggregations, ...

What we do for “Big Data” at SZTAKI n Developing Apache Flink (STREAMLINE H2020) n Machine Learning algorithms, experimenting

n Projects with industrial partners n Using Spark, Flink, Cassandra, Hadoop etc.

n Research n Improvements on current systems n Ongoing project: handling dataskew

13

Solving a problem: handling dataskew n We have developed an application aggregating telco data n After a while, on real dataset it could become slow or even crash n Investigated the problem: dataskew n 80% of the traffic generated by 20% of the communication towers

14

The problem n Default hashing is not going to distribute the data uniformly n Data distribution is not known in advance n The heavy keys might even change stage boundary & shuffle

slow task

slow task

even partitioning 15

skewed data

Our solution: Dynamic Repartitioning n Monitoring tasks, repartitioning based on that data n System aware n No significant overhead

n Can handle arbitrary data distributions n Does everything on-the-fly n Works for streaming and batch

n Pluggable n Initially on Spark batch and streaming n Plugged into Flink streaming 16

Execution visualization of Spark jobs

17

Future in handling dataskew n Generalizing the problem n Balancing load in a processing system n Balancing resources between processing systems on a cluster (YARN)

18

Conclusion n We can tame “Big Data” with better software n Lot to do… n Connection between batch and streaming n Highly scalable machine learning (batch and online both) n Optimizations (e.g. handling dataskew) n Better understanding our tools (e.g. visualization) n ...

n Evolving fast, but we can take part in it

19

Questions? n You can reach us! n [email protected] n [email protected] n [email protected]

20

References n STREAMLINE H2020 n https://streamline.sics.se/

n Dynamic Repartitioning n https://spark-summit.org/2016/events/handling-data-skewadaptively-in-spark-using-dynamic-repartitioning/

n Visualizations n http://flink-forward.org/kb_sessions/advanced-visualization-offlink-and-spark-jobs/

21

References (continued) n Apache Hadoop n https://hadoop.apache.org/

n Apache Flink n https://flink.apache.org/

n Apache Spark n https://spark.apache.org/

22

References (continued) n E. A. Brewer. Towards robust distributed systems, 2000 n J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, 2004 n N. Marz. How to beat the CAP theorem, 2011 n http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

23

Big Data software infrastructures

Big Data software infrastructures Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI) Gábor Hermann Zoltán Zvara A...

Download PDF

1MB Sizes 0 Downloads 0 Views

Big Data software infrastructures

Big Data software infrastructures

Recommend Documents