Forum Posts

Raihan Ali
Apr 09, 2022
In Welcome to the Forum
It replaces the speed of previous mapreduce by adding the ability to process data in memory faster. It is also more efficient on disk. Utilize memory processing using the basic data unit rdd (resilient distributed dataset). They keep as many datasets in memory as possible for the full life cycle of the job, thus saving disk I / o. Some data can be leaked to disk if the memory limit is exceeded. The graph below shows the execution time in seconds for both apache hadoop and spark for calculating logistic regression. Hadoop took 110 seconds, but spark finished the same job in just 0.9 seconds. Hadoop vs spark spark does not store all data in memory. However, if the data is in memory, it makes the best use of the lru cache to process the data faster. It's 100 times faster while computing the data in memory, and faster than hadoop on disk. Spark's distributed data storage model, the restoring distributed dataset (rdd), guarantees fault tolerance and minimizes network I / o. Spark paper says:"Rdd provides email list fault tolerance through the concept of lineage. If a partition of an rdd is lost, the rdd has enough information on how it is derived from other rdds so that only that partition can be rebuilt. I have information. " therefore, there is no need to duplicate the data to achieve fault tolerance. In spark mapreduce, unlike hadoop,. The output is spilled to disk and read again, the mapper's output is kept in the os buffer cache, and the reducer pulls it to the side and writes it directly to memory. Spark's memory cache is suitable for machine learning algorithms that need to use the same data over and over again. Spark can use the direct acyclic graph (dag) to run complex jobs, multi-step data pipelines. Spark is written in scala and runs in the jvm (java virtual machine). Spark provides development apis for java, scala, python, and the r language. Spark runs on hadoop yarn, apache mesos and has its own standalone cluster manager. In 2014, it topped the world record by sorting benchmarks of 100tb dataExplore Imdbtop250 in Tableau Dashboards and Stories
0
0
2
 

Raihan Ali

More actions