HADOOP

About Course

Do you need to understand big data and how it will impact your business? This Specialization is for you. You will gain an understanding of what insights big data can provide through hands-on experience with the tools and systems used by big data scientists and engineers. Programming language is not required. You will be guided through the basics of using Hadoop with MapReduce, Spark, Pig and Hive.
You will experience how one can perform predictive modeling and leverage graph analytics to model problems.
This specialization will prepare you to ask the right questions about data, communicate effectively with data scientists, and do basic exploration of large, complex datasets.
To enhance your learning experience, we will also make you work on real-time industry-based projects

Hadoop is a framework to store and process big data. Hadoop specifically designed to provide distributed storage and parallel data processing that big data requires. Hadoop is an open source project from Apache Software Foundation.
It provides a software framework for distributing and running applications on clusters of servers that is inspired by Google’s Map-Reduce programming model as well as its file system(GFS).
Hadoop was originally written for the nutch search engine project.
Hadoop is open source framework written in Java. It efficiently processes large volumes of data on a cluster of commodity hardware.
Hadoop can be setup on single machine , but the real power of Hadoop comes with a cluster of machines , it can be scaled from a single machine to thousands of nodes. Hadoop consists of two key parts,
Hadoop Distributes File System(HDFS)
Map-Reduce.

Describe the Big Data landscape including examples of real world big data problems including the three key sources of Big Data: people, organizations, and sensors.
Explain the V’s of Big Data (volume, velocity, variety, veracity, valence, and value) and why each impacts data collection, monitoring, storage, analysis and reporting.
Get value out of Big Data by using a 5-step process to structure your analysis.
Identify what are and what are not big data problems and be able to recast big data problems as data science questions.
Provide an explanation of the architectural components and programming models used for scalable big data analysis.
Summarize the features and value of core Hadoop stack components including the YARN resource and job management system, the HDFS file system and the MapReduce programming model.
Install and run a program using Hadoop!

Introduction To Bigdata Hadoop

APACHE HDFS

APACHE MAP-REDUCE

APACHE HIVE

APACHE SQOOP

PIG

1. What is HBASE
2. Architecture of HBASE
3. CRUD operations in HBASE
4. Retrieval of HBASE Data.
5. Introduction of Apache Oozie (Scheduler tool)

1. Basic data types and literals used
2. Basic data types and literals used
3. Classes ,Traites, Control Structure of Scala
4. Collection and Libraries of Scala

INTRODUCTION TO SPARK

1. Features of RDDs
2. How to create RDDs
3. RDD operations and methods
4. Explain RDD functions and describe how to write different codes in Scala

1. Explain the importance and features of SparkQL
2. Describe methods to convert RDDs to DataFrames
3. Explain concepts of SparkSQL
4. Describe the concept of hive integration

SPARK ML PROGRAMMING

1. Use cases and techniques of Machine Learning
2. Spark Configuration, Cluster Modes
3. Describe the key concepts of Spark ML
4. Concept of an ML Dataset, and ML algorithm, model selection via cross validation

Hadoop stores huge files as they are raw without specifying any schema.
High scalability - We can add any number of nodes, hence enhancing performance dramatically.
High availability - In hadoop data is highly available despite hardware failure. If a machine or few hardware crashes, then we can access data from another path.
Reliable - Data is reliably stored on the cluster despite of machine failure.
Economic - Hadoop runs on a cluster of commodity hardware which is not very expensive.