Do you need to understand big data and how it will impact your business? This Specialization is for you. You will gain an understanding of what insights big data can provide through hands-on experience with the tools and systems used by big data scientists and engineers. Programming language is not required. You will be guided through the basics of using Hadoop with MapReduce, Spark, Pig and Hive.
You will experience how one can perform predictive modeling and leverage graph analytics to model problems.
This specialization will prepare you to ask the right questions about data, communicate effectively with data scientists, and do basic exploration of large, complex datasets.
To enhance your learning experience, we will also make you work on real-time industry-based projects
Discription
Hadoop is a framework to store and process big data. Hadoop specifically designed to provide distributed storage and parallel data processing that big data requires. Hadoop is an open source project from Apache Software Foundation.
It provides a software framework for distributing and running applications on clusters of servers that is inspired by Google’s Map-Reduce programming model as well as its file system(GFS).
Hadoop was originally written for the nutch search engine project.
Hadoop is open source framework written in Java. It efficiently processes large volumes of data on a cluster of commodity hardware.
Hadoop can be setup on single machine , but the real power of Hadoop comes with a cluster of machines , it can be scaled from a single machine to thousands of nodes. Hadoop consists of two key parts, Hadoop Distributes File System(HDFS) Map-Reduce.
What will I learn
Describe the Big Data landscape including examples of real world big data problems including the three key sources of Big Data: people, organizations, and sensors.
Explain the V’s of Big Data (volume, velocity, variety, veracity, valence, and value) and why each impacts data collection, monitoring, storage, analysis and reporting.
Get value out of Big Data by using a 5-step process to structure your analysis.
Identify what are and what are not big data problems and be able to recast big data problems as data science questions.
Provide an explanation of the architectural components and programming models used for scalable big data analysis.
Summarize the features and value of core Hadoop stack components including the YARN resource and job management system, the HDFS file system and the MapReduce programming model.
Concept of an ML Dataset, and ML algorithm, model selection via cross validation
Why learn Hadoop
Hadoop stores huge files as they are raw without specifying any schema.
High scalability - We can add any number of nodes, hence enhancing performance dramatically.
High availability - In hadoop data is highly available despite hardware failure. If a machine or few hardware crashes, then we can access data from another path.
Reliable - Data is reliably stored on the cluster despite of machine failure.
Economic - Hadoop runs on a cluster of commodity hardware which is not very expensive.