Big Data Hadoop Developer

Course Code: HDPG01

 

Duration: 56 Hours

Big Data Hadoop Developer course by Computing Academics, is comprehensively designed to explain from basics of Big Data and Hadoop ecosystem to advanced implementation concepts. The course is based on extensive hands-on practice, so that the candidate gets expertise in all the concepts taught and would become knowledgeable to work on real time projects. Apart from lab exercises for each of the modules listed, you will be assigned with projects, that would be based on real world requirements and problems. You will be well equipped to perform Big Data Hadoop developer job functions, through this course.

Audience

• Data Analysts and Business Analysts
• Application, Information and Data Architects
• Project Managers
• Mainframe, enterprise data warehouse and business intelligence professionals
• Application testing professionals
• Recent graduates who want to learn and pursue a career in Data Science and Big Data

Prerequisites

Required Prerequisites:

Basic knowledge in database management systems, data processing, UNIX/LINUX basics and JAVA basics.
We provide orientation program for those who would need to attain this prerequisite.

Suggested Prerequisites:

None

Objective

Upon successful completion of the course, participants are expected to be proficient in:
1. Learn what is Data Science and Big Data and where and why they are implemented
2. Explanation on job prospects for data science experts, and how to leverage the same
3. Hadoop ecosystem and core components
4. Hadoop distributed file system (HDFS) architecture and features
5. Yarn architecture and implementation
6. Hadoop configuration
7. Mapreduce basic and advanced concepts
8. PIG architecture and implementation
9. HIVE architecture and implementation
10. Basics of Apache Spark
11. Resilient distributed datasets (RDD) concepts and implementation
12. Ability to apply the concepts learned, in real time industry requirements

Day 1

Module: Introduction to Big Data

- Characteristics of Big Data
- Challenges for Big Data
- Popular Tools Used to Store, Process, Analyze and Visualize Big Data
- Use Cases for Big Data

Day 1

Module: Hadoop Eco-system and Architecture

- What is Hadoop?
- Hadoop's Key Characteristics
- Hadoop Eco-system and Core Components
- Where Hadoop Fits?
- Traditional vs. Hadoop’s Data Analytics Architecture
- When to Use and Not Use Hadoop?
- Apache Hadoop and Distributions
- Hadoop Job Trends
- Exercises

Day 2

Module: HDFS Architecture

- Introduction to Hadoop Distributed File System
- HDFS Architecture and Features
- Files and Data Blocks
- Anatomy of a File Read / Write on HDFS
- Replication and Rack Awareness
- Exercises

Day 2

Module: YARN Architecture

- Classic vs. YARN
- YARN Daemons
- Containers
- Speculative Execution
- HDFS Federation
- Authentication and High Availability
- Exercises

Day 2

Module: Hadoop Setup Part 1

- Hadoop Deployment Modes
- Setting up a Pseudo-distributed Cluster
- Hortonworks Sandbox Installation and Configuration
- Linux Terminal Commands
- Configuration Parameters and Values
- Exercises

Day 2

Module:Hadoop Setup Part 2

- HDFS File System Operations
- Working with Hadoop Services using Ambari
- HDFS, MapReduce and YARN Parameters
- Hadoop Web UI
- Filesystem and Linux Commands
- Exercises

Day 2

Module: MapReduce Basics

- What is MapReduce?
- MapReduce Framework, Architecture and Use Cases
- Input Splits
- Hands on with MapReduce Programming
- Packaging MapReduce Jobs in a JAR

Day 2

Module: MapReduce Advanced

- Setting Mapper and Reducer Counts
- Combiners
- Partitioners and Custom Partitioners
- Input and Output Formats
- Sequence Files and Compressions
- Distributed Cache
- Map Side Join and Reduce Side Join
- Exercises

Day 3

Module: Using Pig

- Background
- Pig Architecture
- Pig Latin Basics
- Pig Execution Modes
- Pig Processing – Loading and Transforming Data
- Pig Built-in Functions
- Filtering, Grouping, Sorting Data
- Relational Join Operators
- Pig User Defined Functions
- Exercises

Day 4

Module: Using Hive

- Background of Hive
- Hive Architecture
- Warehouse Directory and Metastore
- Hive Query Language
- Managed and External Tables
- Data Processing – Loading Data into Tables
- Using Hive Built-in Functions
- Using Joins in Hive
- Partitioning Data using Hive - Static and Dynamic
- Bucketing in Hive
- Exercises

Day 5

Module: Basics of Apache Spark

- What is Apache Spark?
- Using the Spark Shell
- RDDs (Resilient Distributed Datasets)
- Functional Programming in Spark
- Quiz
- A Closer Look at RDDs
- Key-Value Pair RDDs
- Other Pair RDD Operations
- Quiz
- Exercises

Day 5

Module: RDDs in Spark

- RDD Lineage
- Caching Overview
- Distributed Persistence
- Storage Levels of RDD Persistence
- Common Spark Use Cases
- Iterative Algorithms in Spark
- Machine Learning
- Example: k-means
- Quiz
- Spark SQL and the SQL Context
- Creating DataFrames
- Transforming and Querying DataFrames
- Exercises

Days 6 and 7

Module: Project work ( from uses cases applied in various industries )

- Requirement specification study
- Drafting Design
- Development
- Testing against requirements and functions
- Evaluation and review ( by course instructor )