Introduction to Hadoop in Big Data

Data

The volume of data being made publicly available increases every year.

Organizations no longer have to merely their own data; success in the feature will be dictated to a large extent by their ability to extract value from other organizations data.

A Brief History of Apache Hadoop

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.

The Origin of the Name “Hadoop”

The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about:

The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.

Projects in the Hadoop ecosystem also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example).

Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name.

For example, the name node” manages the filesystem namespace.

Web Search

Building a web search engine from scratch was an ambitious goal, for not only is t software required to crawl and index websites complex to write, but it is also a challenge to run without a dedicated operations team, since there are so many moving parts. expensive, too: Mike Cafarella and Doug Cutting estimated a system supporting one-billion-page index would cost around $500,000 in hardware, with a monthly running cost of $30,000. Nevertheless, they believed it was a worthy goal, as it would open up and ultimately democratize search engine algorithms.

Nutch was started in 2002, and a working crawler and search system quickly emerged.

However, its creators realized that their architecture wouldn’t scale to the billions of pages on the Web.

Help was at hand with the publication of a paper in 2003 that described the architecture of Google’s distributed filesystem, called GFS, which was being used in production at Google.

NDFS

GFS or something like it, would solve their storage needs for the very large files generated as a part of the web crawl and indexing process.

In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes.

In 2004, Nutch’s developers set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS).

In 2004, Google published the paper that introduced MapReduce to the world.” Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch,and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS.

NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale

This was demonstrated in February 2008 when Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster

Other courses

NoSQL

Power BI

Popular Courses

[Course Image]

Introduction to Programming

Learn the fundamentals of programming with Python in this beginner-friendly course.

12 Hours Beginner

[Course Image]

Data Science Essentials

Master the basics of data analysis, visualization, and machine learning.

20 Hours Intermediate

[Course Image]

Web Development Bootcamp

Build modern websites with HTML, CSS, JavaScript and popular frameworks.

30 Hours Beginner

[Course Image]

Digital Marketing Fundamentals

Learn SEO, social media marketing, email campaigns and analytics.

15 Hours Beginner

A Brief History of Apache Hadoop

Web Search

NDFS

Leave a Reply Cancel reply

EduLearn

Welcome to EduLearn!

Popular Courses

Introduction to Programming

Data Science Essentials

Web Development Bootcamp

Digital Marketing Fundamentals