VESIT-CLOUDERA



Course Description:
The Big Data landscape is continuously evolving as new technologies emerge and existing technologies mature. This is a comprehensivecourse covering Sparkand key elements of the Hadoop Ecosystem used in developing end to end applications for processing Big Data efficiently.Students who complete this course will understand key Spark and Hadoop concepts, and they will learn to apply Spark and Hadoop tools in developing applications for solving the types of problems faced by enterprises and research institutions today.

Prerequisites:
This course is designed for developers and engineers who have programming experience. Apache Spark examples and homework labs are presented in Scala and Python, therefore, the ability to program in one of those languages is required. Basic familiarity with the Linux command line is assumed. Basic knowledge of SQL is helpful; prior knowledge of Hadoop is not required.

Course Objectives
During this course, the learner will learn:

  • How the Hadoop Ecosystem fits in with the data processing lifecycle
  • How data is distributed, stored and processed in a Hadoop cluster
  • How to use Sqoop and Flume to ingest data
  • How to process distributed data with Spark
  • How to model structured data as tables in Impala and Hive
  • How to choose a data storage format for your data usage patterns
  • Best practices for data storage

Course Outcomes:
After Completing the course learner will be able to :

  • Understand components of Hadoop and Hadoop Ecosystem.
  • Access and Process Data on Distributed File System
  • Manage Job Execution in Hadoop Environment
  • Ingest data using Sqoop and Flume
  • Analyze the Big Data using Hive and Impala
  • Develop Big data applications using Spark and Hadoop Eco-System

1

Module 1 : Introduction

  • About This Course
  • About Cloudera

Session 1

2 Hrs

2

Module 2: Introduction to Hadoop and the Hadoop Ecosystem

  • Hadoop
  • Data Storage and Ingest
  • Data Processing
  • Data Analysis and Exploration
  • Other Ecosystem Tools
  • Introduction to the Homework Labs
  • Homework Labs: Setup and General Notes

Session 2

2 Hrs

3

Hadoop Architecture and HDFS

  • Hadoop
  • Data Storage and Ingest
  • Data Processing
  • Data Analysis and Exploration
  • Other Ecosystem Tools
  • Introduction to the Homework Labs
  • Homework Labs: Setup and General Notes

Session 3

2 Hrs

4

Importing and Modeling Structured Data,Importing Relational Data with Apache Sqoop

  • Sqoop Overview
  • Basic Imports and Exports
  • Limiting Results
  • Improving Sqoop’s Performance
  • Sqoop 2
  • Homework Labs: Import Data from MySQL Using Sqoop

Session 4

2 Hrs

5

Introduction to Impala and Hive

  • Introduction to Impala and Hive
  • Why Use Impala and Hive?
  • Querying Data With Impala and Hive
  • Comparing Hive and Impala to Traditional Databases

Session 5

2 Hrs

6

Modeling and Managing Data with Impala and Hive

  • Data Storage Overview
  • Creating Databases and Tables
  • Loading Data into Tables
  • HCatalog
  • Impala Metadata Caching
  • Homework Labs: Create and Populate Tables in Impala or Hive

Session 6

2 Hrs

7

Data Formats

  • File Formats
  • Avro Schemas
  • Avro Schema Evolution
  • Using Avro with Impala, Hive and Sqoop
  • Using Parquet with Impala, Hive and Sqoop
  • Compression
  • Homework Labs: Select a Format for a Data File

Session 7

2 Hrs

8

Data File Partitioning

  • Partitioning Overview
  • Partitioning in Impala and Hive
  • Conclusion
  • Homework Labs:Partition Data in Impala or Hive

Session 8

2 Hrs

9

Module 4: Ingesting Streaming Data

  • What is Apache Flume?
  • Basic Flume Architecture
  • Flume Sources
  • Flume Sinks
  • Flume Channels
  • Flume Configuration
  • Homework Labs: Collect Web Server Logs with Flume

Session 9

2 Hrs

10

Module 5: Distributed Data Processing with SparkSpark Basics

  • What is Apache Spark?
  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark
  • Homework Labs:
  • View the Spark Documentation
  • Explore RDDs Using the Spark Shell
  • Use RDDs to Transform a Dataset

Session 10

2 Hrs

11

Working with RDDs in Spark

  • Creating RDDs
  • Other General RDD Operations
  • Homework Labs:Process Data Files with Spark

Session 11

2 Hrs

12

Aggregating Data with Pair RDDs

  • Key?Value Pair RDDs
  • Map?Reduce
  • Other Pair RDD Operations
  • Homework Labs:Use Pair RDDs to Join Two Datasets

Session 12

2 Hrs

13

Writing and Deploying Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Homework Labs:
  • Write and Run a Spark Application
  • Configuring Spark Properties
  • Logging
  • b>Homework Labs:Configure a Spark Application

Session 13

2 Hrs

14

Parallel Processing in Spark

  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File?based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks
  • Homework Labs:View Jobs and Stages in the Spark Application UI

Session 14

2 Hrs

15

Spark RDD Persistence

  • RDD Lineage
  • RDD Persistence Overview
  • Distributed Persistence
  • Homework Labs:Persist an RDD

Session 15

2 Hrs

16

Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k?means
  • Homework Labs: Iterative Processing in Spark
  • Optional Homework Lab: Partition Data Files Using Spark

Session 16

2 Hrs

17

Spark SQL and DataFrames

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • DataFrames and RDDs
  • Comparing Spark SQL, Impala and Hive?on?Spark
  • Homework Labs:Use Spark SQL for ETL

Session 17

2 Hrs

18

Conclusion and Project Discussion

Session 18

2 Hrs

Evaluation :
Students registered for the Cloudera course are evaluated based on following parameters:
Evaluation Metrics:

  1. Project (40%)
    • Leaderboard Rank(in the competition website like Kaggle, Data
    • Presentation of Solution
    • Report (soft copy)
  2. End of the course Exam (40%) - 50 Marks
  3. Attendance (20%)

Text Books Recommended

  • Learning Spark, by Karau, Konwinski, Wendell, and Zaharia
Optional
  • Hadoop: The Definitive Guide (third edition), by Tom White
  • Using Flume, by Hari Shreedharan
  • Hadoop Operations, by Eric Sammer
  • Programming Hive, by Capriolo, Wampler, and Rutherglen
  • Advanced Analytics with Spark, by Ryza, Laserson, Owen, and Wills

Tools:
Cloudera supplies two fully configured Hadoop VMs (virtual machines), which include datasets for homework labs. The professor VM contains solutions to homework assignments as well as a set of supporting examples. The student VM does not contain solutions or examples; we leave it to the professor to decide whether to share these items.

Resource Persons :

  1. Dr. Mrs. M. Vijayalakshmi , Vice Principal
  2. Dr. Mrs. Sujata Khedkar, Associate Professor,Department of Computer Engineering
  3. Mrs. Asha Bharambe, Associate Professor, Department of Information Technology
  4. Mrs. Sangeeta Oswal, Assistant Professor, Department of Master of Computer Applications
  5. Mrs Jayshree Hajgude, Assistant Professor, Department of Information Technology