EduLearn - Online Education Platform

Welcome to EduLearn!

Start learning today with our wide range of courses taught by industry experts. Gain new skills, advance your career, or explore new interests.

Browse Courses

Scaling Out

Map Reduce works for small inputs; now it’s time to take a bird’s-eye view of the system and look at the data flow for large inputs.

For simplicity, the examples so far have used files on the local filesystem.

However, to scale out, we need to store the data in a distributed filesystem (HDFS).

This allows Hadoop to move the MapReduce computation to each machine hosting a part of the data, using Hadoop’s resource management system, called YARN.

Data Flow

First, some terminology. A MapReduce job is a unit of work that the client wants to be performed:

it consists of the input data, the MapReduce program, and configuration information.

Hadoop runs the job by dividing it into tasks, of which there are two types:map tasks and reduce tasks.

The tasks are scheduled using YARN and run on nodes in the cluster. If a task fails, it will be automatically rescheduled to run on a different node.

Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits.

Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split.

MapReduce data flow with multiple reduce tasks

Finally, it’s also possible to have zero reduce tasks.

Combiner Functions

Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks.

Hadoop allows the user to specify a combiner function to be run on the map output, and the combiner function’s output forms the input to the reduce function.

Because the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all.

In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer.

What is Scaling Out and Data Flow Hadoop in Bigdata

Leave a Reply

Your email address will not be published. Required fields are marked *

EduLearn - Online Education Platform

Welcome to EduLearn!

Start learning today with our wide range of courses taught by industry experts. Gain new skills, advance your career, or explore new interests.

Browse Courses

Popular Courses

[Course Image]

Introduction to Programming

Learn the fundamentals of programming with Python in this beginner-friendly course.

12 Hours Beginner
[Course Image]

Data Science Essentials

Master the basics of data analysis, visualization, and machine learning.

20 Hours Intermediate
[Course Image]

Web Development Bootcamp

Build modern websites with HTML, CSS, JavaScript and popular frameworks.

30 Hours Beginner
[Course Image]

Digital Marketing Fundamentals

Learn SEO, social media marketing, email campaigns and analytics.

15 Hours Beginner
Educational Website Footer