Big data is getting bigger via a constant regular stream of incoming data. This data is arriving at incredible rates in high-volume environments and has to be analyzed and stored effectively.
Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through system logs, running ETL computing web indexes, and powering personal recommendation systems. However, its reliance on persistent storage to provide fault tolerance and its one-pass computation model make MapReduce a poor fit for low-latency applications and iterative computations, such as machine learning and graph algorithms.
Apache Spark addresses these limitations by generalizing the MapReduce computation model while dramatically improving performance and ease of use.
Brisk and Simple Big Data Processing with Apache Spark
Apache Spark provides a generic programming model that enables developers to write an application by composing arbitrary operators, such as mappers, joints, reducers, group by, and filters. This architecture makes its simple to execute the huge array of computations, including iterative machine learning, streaming, complex queries and batch processing.
Spark keeps track of each of the operators produces and enables applications to reliably store this data in memory. This is the key to Spark’s performance, as it allows applications to avoid costly disk accesses.
Spark computing framework provides programming abstraction and parallel runtime to hide complexities of fault-tolerance and slow machines. This features enables:
· Low-latency computations by caching the working dataset in memory and then performing computations at memory speeds, and
· Efficient iterative algorithm by having subsequent iterations share data through memory, or repeatedly accessing the same dataset
In Memory Can Make a Big Difference
In-memory analytics is an approach to querying data when it resides in a computer’s random access memory (RAM), as opposed to querying data that is stored on physical disks. This results in vastly shortened query response times, allowing business intelligence (BI) and analytic applications to support faster business decisions.
The generic programming model of Spark makes it easy to use. With Spark it is possible to control different kinds of workload, so if there is an interaction between various workloads in the same process it is easier to manage and secure such workload which come as a limitation with MapReduce.
Faster Decision Making with Spark
Spark Machine Learning capability make or facilitate user’s decisions in the form of a recommendation system, ad targeting, or predictive analytics. One of the key properties of any decision is latency, that is, the time it takes to make the decision from the moment the input data is available. Reducing decision latency can significantly increase their effectiveness, and ultimately increase the company’s return on investment. Spark is an ideal fit to speed up decisions.
Figure source: http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/
By: Ayush Kesarwani