course detail

Big Data Analytical Training

Learn Hadoop Big Data Training In Chennai  At BITA  Academy– No 1 Big Data Training Institute in Chennai. Call 956600-4616 For More Details.

Register today for free demo classes with our industry experts.


“Enormous information” is a field that treats approaches to examine, efficiently extricate data from, or generally manage information sets that are excessively huge or complex to be managed by customary information preparing application programming. Information with numerous cases (lines) offer more noteworthy factual power, while information with higher multifaceted nature (more characteristics or sections) may prompt a higher false disclosure rate. Big information difficulties incorporate catching information, information stockpiling, information examination, search, sharing, move, representation, questioning, refreshing, data protection and information source. Huge information was initially connected with three key ideas: volume, assortment, and velocity. Other ideas later ascribed to enormous information are veracity (i.e., how much commotion is in the information) and value.

Course Syllabus   

INTRODUCTION
CCA 175 Spark and Hadoop Developer – Curriculum

GETTING STARTED
Introduction and Curriculum
Setup Environment – Options
Setup Environment – Locally
Setup Environment – using Cloudera Quickstart VM
Using Windows – Putty and WinSCP
Using Windows – Cygwin
HDFS Quick Preview
YARN Quick Preview
Setup Data Sets

HADOOP CONCEPT
Hadoop Commands

SCALA FUNDAMENTALS
Introduction and Setting up of Scala
Setup Scala on Windows
Basic Programming Constructs
Functions
Object Oriented Concepts – Classes
Object Oriented Concepts – Objects
Object Oriented Concepts – Case Classes
Collections – Seq, Set and Map
Basic Map Reduce Operations
Setting up Data Sets for Basic I/O Operations
Basic I/O Operations and using Scala Collections APIs
Tuples
Development Cycle – Developing Source code
Development Cycle – Compile source code to jar using SBT
Development Cycle – Setup SBT on Windows
Development Cycle – Compile changes and run jar with arguments
Development Cycle – Setup IntelliJ with Scala
Development Cycle – Develop Scala application using SBT in IntelliJ

DATA INGESTION – APACHE SQOOP
Introduction and Objectives
Accessing Sqoop Documentation
Preview of MySQL on labs
Sqoop connect string and validating using list commands
Run queries in MySQL using eval
Sqoop Import – Simple Import
Sqoop Import – Execution Life Cycle
Sqoop Import – Managing Directories
Sqoop Import – Using split by
Sqoop Import – auto reset to one mapper
Sqoop Import – Different file formats
Sqoop Import – Using compression
Sqoop Import – Using Boundary Query
Sqoop Import – columns and query
Sqoop Import – Delimiters and handling nulls
Sqoop Import – Incremental Loads
Sqoop Import – Hive – Create Hive Database
Sqoop Import – Hive – Simple Hive Import
Sqoop Import – Hive – Managing Hive tables
Sqoop Import – Import all tables
Role of Sqoop in typical data processing life cycle
Sqoop Export – Simple export with delimiters
Sqoop Export – Understanding export behavior
Sqoop Export – Column Mapping
Sqoop Export – Update and Upsert
Sqoop Export – Stage Tables

TRANSFORM,STAGE,STORE – SPARK
Introduction
Introduction to Spark
Setup Spark on Windows
Quick overview about Spark documentation
Initializing Spark job using spark-shell
Create Resilient Distributed Data Sets (RDD)
Previewing data from RDD
Reading different file formats – Brief overview using JSON
Transformations Overview
Manipulating Strings as part of transformations using Scala
Row level transformations using map
Row level transformations using flatM
Filtering the data
Joining data sets – inner join
Joining data sets – outer join
Aggregations – Getting Started
Aggregations – using actions (reduce and countByKey)
Aggregations – understanding combiner
Aggregations using groupByKey – least preferred API for aggregations
Aggregations using reduceByKey
Aggregations using aggregateByKey
Sorting data using sortByKey
Global Ranking – using sortByKey with take and takeOrdered
By Key Ranking – Converting (K, V) pairs into (K, Iterable[V]) using groupByKey
Get topNPrices using Scala Collections API
Get topNPricedProducts using Scala Collections API
Get top n products by category using groupByKey, flatMap and Scala function
Set Operations – union, intersect, distinct as well as minus
Save data in Text Input Format
Save data in Text Input Format using Compression
Saving data in standard file formats – Overview
Revision of Problem Statement and Design the solution
Solution – Get Daily Revenue per Product – Launching Spark Shell
Solution – Get Daily Revenue per Product – Read and join orders and order_items
Solution – Get Daily Revenue per Product – Compute daily revenue per product id
Solution – Get Daily Revenue per Product – Read products data and create RDD
Solution – Get Daily Revenue per Product – Sort and save to HDFS
Solution – Add spark dependencies to sbt
Solution – Develop as Scala based application
Solution – Run locally using spark-submit
Solution – Ship and run it on big data cluster

DATA ANALYSIS – SPARK SQL OR HQL
Different interfaces to run Hive queries
Create Hive tables and load data in text file format
Create Hive tables and load data in ORC file format
Using spark-shell to run Hive queries or commands
Functions – Getting Started
Functions – Manipulating Strings
Functions – Manipulating Dates
Functions – Aggregation
Functions – CASE
Row level transformations
Joins
Aggregations
Sorting
Analytics Functions – Ranking
Windowing Functions
Create Data Frame and Register as Temp table
Writing Spark SQL Applications – process data
Writing Spark SQL Applications – Save data into Hive tables
Data Frame Operations

Data Ingest – real time, near real time and streaming analytics
Introduction
Flume – Getting Started
Flume – Web Server Logs to HDFS – Introduction
Flume – Web Server Logs to HDFS – Setup Data
Flume – Web Server Logs to HDFS – Source exec
Flume – Web Server Logs to HDFS – Sink HDFS – Getting Started
Flume – Web Server Logs to HDFS – Sink HDFS – Customize properties
Flume – Web Server Logs to HDFS – Deep dive to memory channel
Kafka – Getting Started – High Level Architecture
Kafka – Getting Started – Produce and consume messages using commands
Kafka – Anatomy of a topic
Flume and Kafka in Streaming analytics
Spark Streaming – Getting Started
Spark Streaming – Setting up netcat
Spark Streaming – Develop Word Count program
Spark Streaming – Ship and run word count program on the cluster
Spark Streaming – Data Structure (DStream) and APIs overview
Spark Streaming – Get department wise traffic – Problem Statement
Spark Streaming – Get department wise traffic – Development
Spark Streaming – Get department wise traffic – Run on the cluster
Flume and Spark Streaming – Department Wise Traffic – Setup Flume
Flume and Spark Streaming – Department Wise Traffic – Add sbt dependencies
Flume and Spark Streaming – Department Wise Traffic – Develop and build
Flume and Spark Streaming – Department Wise Traffic – Run and Validate
Flume and Kafka integration – Develop configuration file
Flume and Kafka integration – Run and validate
Kafka and Spark Streaming – Add dependencies
Kafka and Spark Streaming – Develop and build application
Kafka and Spark Streaming – Run and Validate

Sample scenarios with solutions
Introduction to Sample Scenarios and Solutions
Problem Statements – General Guidelines
Initializing the job – General Guidelines
Getting crime count per type per month – Understanding Data
Getting crime count per type per month – Implementing the logic – Core API
Getting crime count per type per month – Implementing the logic – Data Frames
Getting crime count per type per month – Validating Output
Get inactive customers – using Core Spark API (leftOuterJoin)
Get inactive customers – using Data Frames and SQL
Get top 3 crimes in RESIDENCE – using Core Spark API
Get top 3 crimes in RESIDENCE – using Data Frame and SQL
Convert NYSE data from text file format to parquet file format
Get word count – with custom control arguments, num keys and file format

 

Free Demo Classes