10 essential skills: (4) understanding big data processing frameworks.

Pamela Kinga Gill
Jan 19, 2019
7 min read

Updated: Jan 20, 2019

It boggles my mind that already 90% of the world's data was created in just the last two years, from 2016-2018.

Our activities and interactions on the internet, in communication, on social media, and snapping digital photos are all adding to this deluge of "big data". As discussed in 10 essential skills: (3) knowledge of databases, the creation of all this data requires storage, otherwise, how can companies even begin to uncover value? And yet there is an important next step: the process of extracting information and insight from stored data.

The methods and frameworks to process large data sets is considered "big data technology" and you'll find that the literature on this topic - or advancements in the area - are widely discussed in computer science journals. Having read through a few of these, I'll do my best to distill the important features below.

The obvious champions of converting user-generated data into insights, products, revenue, etc... are Google, Amazon, and Facebook. These companies are also among the most valuable in the world. They understand better than anyone the importance that big data processing has on their core business. In fact, these companies are also pioneering the very big data technologies that are changing the world we live in today!

This article discusses big data processing frameworks as the technologies mentioned above and provides some examples. The significance of processing frameworks become clear when we define "big data."

Step 1: So what is big data?

In 2001, Gartner analyst, Doug Laney, put forward the three defining characteristics of "big data". They are: volume, velocity, and variety. Volume refers to the size of data; velocity is the speed at which that data is coming in; and variety is the diversity in the data types (structured, unstructured, and in between). Each of these vectors presents new challenges for data processing. Old school data management is insufficient given the volume, velocity, and variety of "big data" which requires non-traditional technologies and strategies to gather, organize, and process information assets.

Processing frameworks

There are three types of processing frameworks: batch; stream; and a hybrid of the two.

Batch processing

Batch processing operates over a large, fixed data set and returns the results once complete. This type of processing requires that the data set is finite, persistent, and large. An example of this is historical data. Batch processing is popular across many organizations where oftentimes operations can run their large batch jobs overnight and return to results in the morning... (if no faults occurred... perhaps another conversation!?) An example of a widely used batch processing framework is Apache Hadoop.

Hadoop was built for big data - I think this is why their mascot is the elephant - and it is a powerful open-source software that exclusively performs batch processing. It is a distributed processing framework which means that it can manage hundreds or thousands of servers to store and process structured, unstructured, and semi-structured big data. This scalability and cluster management is what makes Hadoop so impressive.

At the heart of Hadoop is its native engine - a brilliantly simple, yet powerful, programming paradigm - MapReduce. This Key algorithm distributes work across clusters by first mapping the input data into key-value pairs, and then reducing that output into smaller inputs wherein analysis is performed on each block from the file. At the same time, processing is distributed across multiples servers that are run in parallel.

Hadoop is built on three layers: HDFS, YARN and MapReduce.

HDFS is the Distributed File System which manages the storage and replication of data across clusters. It is the layer responsible for taking huge files and breaking them into smaller parts known as data blocks which are then stored. YARN is Yet Another Resource Negotiator and the layer that schedules and monitors (at times very complex) processing jobs across servers.

The Hadoop cluster is so powerful that it is the building block for other software. Hadoop clusters are in use by Google, Yahoo, and Facebook, among others.

Facebook relies on a massive installation of Hadoop, a highly scalable open-source framework that uses clusters of low-cost servers to solve problems. Facebook even designs its own hardware for this purpose. Hadoop is just one of many Big Data technologies employed at Facebook. - Ken Rudin, Facebook Inc. analytics chief

Stream processing

In contrast to the batch paradigm, stream processing computes over data as it enters a system so that results are immediate and constantly updated as data becomes available. Therefore, this framework applies to individual data points rather data as a complete set; with stream processing, there is no "end" to the data as it is continuously entering a system and processed in 'real-time'. An example of a widely used stream processing framework is Apache Storm.

Apache Storm provides stream processing with incredibly 'low latency' - this means that it can return insights in less time (than, i.e, batch frameworks), or near 'real-time'. This is important when a user's experience is dependent on immediate results, among other use-cases.

Apache Storm's stream processing works by creating Directed Acyclic Graphics across user-defined topologies. Given this functionality, it becomes important that Apache Storm have multi-language capabilities so that its users can write specific topologies in the Storm cluster and process data in a way that is unique to their requirements.

Topologies are small and discrete transformations/steps that are applied to each data point entering the system. Sort of like: addition, subtraction, taking the inverse, etc... of course, these transactions can get quite complex when necessary.

In the Storm framework, a topology is a graph of spouts and bolts that are connected with stream groupings. Just like Hadoop runs MapReduce jobs, Storm runs topologies. The difference is that MapReduce jobs finish, while topologies run for ever or until "killed". Hopefully at this point, it is understood why this distinction is purposeful.

When topologies are applied to individual data points, this is considered pure stream processing and herein lies Storm's true strength. However, for reasons that need not be discussed just yet, there is also a layer called Trident which allows for processing in micro-batches but has the effect of increasing latency of output - i.e. results are returned in "less-than near real-time." Once Trident is applied, Storm loses its real-time advantage and other frameworks become more attractive.

Hybrid processing: batch and stream

If there is a need for diverse processing requirements, hybrid processing is designed to handle both batch and stream activities. There are two popular hybrid processing frameworks: Apache Spark and Apache Flink.

Apache Spark is a batch processing framework - like Hadoop - but with streaming capacities via "micro-batching". It is distinct from Hadoop's native MapReduce engine because it processes data in-memory making it very CPU-efficient. A disk-based storage level can be specified in Spark for special cases but this is not playing to Spark's underlying strengths. In-memory storage enables extremely fast processing and in some settings, Spark can operate 100 times faster than MapReduce. The underlying technology contributing to this speed advantage is explained below, and imperfectly.

At the heart of Spark is the concept of Resilient Distributed Datasets (RDD). Spark uses RDD's to generate immutable clusters for its in-memory batch processing. Operations running on an RDD cluster can also be split for scalable parallel processing. A cool thing about RDDs is that operations applied to the dataset generate a new RDD. These RDDs then become connected by their lineage of transformations. This gives Spark the important feature of fault-intolerance (hence the term "resilient" in RDD) so that if any part of the RDD is lost, it will automatically be recomputed using the original transformations that generated the RDD in the first place. If the storage-level selected is spilt to disk (aka, not in-memory) Spark still provides resiliency and fault-intolerance, but the user must wait for re-computation across RDDs. With such resiliency, processing of data can continue without a glitch or delay if something corrupts during the process. This can be very valuable.

Another key feature to Apache Spark is persisting (or caching) a dataset in memory across operations. This allows future actions to be faster when an RDD is stored and computed on because those computations can be reused. On top of Spark's versatility in running both batch and stream functionalities, it also has an ecosystem of libraries that can be referenced for machine learning, for example.

However, a significant drawback to Spark over Hadoop is that it is necessarily resource-intensive given that its in-memory storage is more costly to operate and takes up great space. The trade-off here is in Spark's incredible speed.

Apache Flink is a stream processing framework that handles batch-like tasks by considering them as "bounded" or "unbounded" streams of data moving away from the "micro-batch" architecture of Spark. It is still a relatively new technology that continues to be developed. Its advantage is that it is designed for organizations that need powerful stream processing and only some batch processing capacities. It is motivated to deliver the same processing speed as in-memory computations while also automating many features of its design.

A key strength backing Flink is in its support of event-driven semantics. In context, applications that benefit from this are: fraud-detection, anomaly detection, rules-based alerting, business process monitoring and social networks. Flink features a library for Complex Event Processing (CEP) to detect patterns in data streams that make it potentially invaluable as a future processing framework.

Something I'll be watching in Flink is in its advantage of continuous data pipelines over periodic ETL jobs. This type of the reduced latency in moving data from multiple storage systems to its destination could serve well in the productization of certain developments and applications in the space. Continuous data pipelines are also more versatile and can be deployed in many scenarios because they are able to continuously digest and output data. I can only imagine what benefits this will deliver.

In conclusion

Processing frameworks can be tricky to understand, but they are well worth knowing. The volume, velocity and variety in today's "big data" stores necessitates a suite of frameworks that can process on the plethora of requirements we expect and benefit from. Furthermore, you may find yourself interacting with anyone of these frameworks in your career, at least I hope so!

References

Gurusamy, Vairaprakash & Kannan, Subbu & Nandhini, K. (2017). The Real Time Big Data Processing Framework: Advantages and Limitations. INTERNATIONAL JOURNAL OF COMPUTER SCIENCES AND ENGINEERING. 5. 305-312. 10.26438/ijcse/v5i12.305312.

PAMELA KINGA @

THE BIG DATA LIFE

Big data trends | tools | services

10 essential skills: (4) understanding big data processing frameworks.

It boggles my mind that already 90% of the world's data was created in just the last two years, from 2016-2018.

Processing frameworks

Recent Posts

Comments