PySpark Training Course Syllabus
PART 1: Introduction to Big Data Hadoop
- What is Big Data?
- Big Data Customer Scenarios
- Limitations and Solutions of Existing Data Analytics Architecture
- How Hadoop Solves the Big Data Problem?
- What is Hadoop?
- Key Characteristics of Hadoop
- Hadoop Ecosystem and HDFS
- Hadoop Core Components
- Rack Awareness and Block Replication
- YARN and its advantage
- Hadoop Cluster and its architecture
- Hadoop: Different Cluster modes
- Big Data Analytics with Batch and Real Time Processing
PART 2: Why do we need to use Spark with Python?
- History of Spark
- Why do we need Spark?
- How Spark differs from its competitors
PART 3: How to get an Environment and Data?
- CDH + Stack Overflow
- Prerequisites and known issues
- Upgrading Cloudera Manager and CDH
- How to install Spark?
- Stack Overflow and Stack Exchange Dumps
- Preparing your Big Data
PART 4: Basics of Python
- History of Python
- The Python Shell
- Syntax, Variables, Types and Operators
- Compound Variables: List, Tuples and Dictionaries
- Code Blocks, Functions, Loops, Generators and Flow Control
- Map, Filter, Group and Reduce
- Enter PySpark: Spark in the Shell
PART 5: Functions and Modules in Python
- Function Parameters
- Global Variables
- Variable Scope and Returning Values
- Lambda functions
- Object Oriented Concepts
- Standard Libraries
- Modules used in Python
- The Import Statements
- Module Search Path
- Package Installation
PART 6: Overview of Spark
- Spark, Word Count, Operations and Transformations
- Fine Grained Transformations and Scalability
- How Word Count works?
- Parallelism by Partitioning Data
- Spark Performance
- Narrow and Wide Transformations
- Lazy Execution, Lineage, Directed Acyclic Graph (DAG) and Fault Tolerance
- The Spark Libraries and Spark Packages
PART 7: Deep Dive on Spark
- Spark Architecture
- Storage in Spark and supported Data formats
- Low Level and High Level Spark API
- Performance Optimization: Tungsten and Catalyst
- Deep Dive on Spark Configuration
- Spark on Yarn: The Cluster Manager
- Spark with Cloudera Manager and YARN UI
- Visualizing your Spark App: Web UI and History Server
PART 8: The Core of Spark – RDD’s
- Deep Dive on Spark Core
- Spark Context: Entry Point to Spark App
- RDD and Pair RDD – Resilient Distributed Datasets
- Creating RDD with Parallelize
- Partition, Repartition, Saving as Text and HUE
- How to create RDD’s from External Data Sets?
- How to create RDD’s with transformations?
- Lambda functions in Spark
- A quick look at Map, FlatMap, Filter and Sort
- Why do we need Actions?
- Partition Operations: MapPartitions and PartitionBy
- Sampling your Data
- Set Operations
- Combining, Aggregating, Reducing and Grouping on PairRDD’s
- Comparison of ReduceByKey and GroupByKey
- How to group Data in to buckets with Histogram?
- Caching and Data Persistence
- Accumulators and Broadcast Variables
- Developing self-contained SpyPark App, Package and Files
- Disadvantages of RDD
PART 9: DataFrames and Spark SQL
- How to Create Data Frames?
- DataFrames to RDD’s
- Loading Data Frames: Text and CSV
- Parquet and JSON Data Loading
- Rows, Columns, Expressions and Operators
- Working with Columns
- User Defined Functions on Spark SQL
PART 10: Deep Dive on DataFrames and SQL
- Querying, Sorting and Filtering DataFrames
- How to handle missing or corrupt Data?
- Saving DataFrames
- How to query using temporary views?
- Loading Files and Views into DataFrames using SparkSQL
- Hive Support and External Databases
- Aggregating, Grouping and Joining
- The Catalog API
- A quick look at Data
PART 11: Apache Spark Streaming
- Why Streaming is necessary?
- What is Spark Streaming?
- Spark Streaming features and workflow
- Streaming Context and DStreams
- Transformation on DStreams
Why PySpark is trending now in recent days?
Enterprises and Humans need storage space for data to be analyzed and stored. Spark was open sourced in 2010 and at that time, it had only about 1600 lines of code. It was donated to Apache Software foundation in 2013 and it became a top level project in 2014. Spark is moving strong in to the future and the important feature to be noted in Spark is Resilient Distributed Data Sets. Java API and Python API that is added in the latest version of Spark is a big advantage as Python has several libraries and frameworks to perform data mining. You will learn the key benefits once you learn PySpark developer course in Chennai. Python is easy to use and read with an elegant syntax to perform machine learning operations.
PySpark stores data in data frames which is a collection of structured or semi structured data. DataFrames is distributed and a user can get the dataframe from RDD or Schema. So it is important for IT professionals to learn PySpark if they want to start their Career in Big Data field. Feel free to contact us if you have any queries. We are here to help you. We offer special discounts for college students and freshers. Call us if you need a free demo or a session. We Wish you all the best.
Android Training in Chennai
Data Science Training in Chennai
Web Design Training in Chennai
AngularJS Training in Chennai
RPA Training in Chennai
Blue Prism Training in Chennai
Python Training in Chennai
Automation Anywhere Training in Chennai
Frequently Asked Questions
We will arrange for a counselling session with you first to understand your requirements and based on it, we will allot any one of our trainer who are industry experts and has real time working experience in this field.
Yes. We will arrange a back up session for you if you miss any one of the classes. But we request you to be regular for the classes as we have limited training sessions for a course.
Yes, you need to have a laptop to attend our classroom training sessions. We will provide you the software details that are required for the course.
Yes. Our tech team will assist you on the software installation process that is required for the course program and we will guide or offer technical support if in case you face any issues during the course period.
Yes. We have a proper process in place to share with you the materials and codes that we will be used in this course program.
Yes, you can walk in walk in any time to our office for practise sessions. Our support team is always available to support you.
You can call us or walk in to our office to provide you more details on it.
Yes. we Provide certificate after completion of the course that will add more value to your profile for anyone who plans to attend job interviews.
Yes. we offer good discounts for professionals or students who join as batches. Please call us for more details on the current offers that is going on.
Yes, we offer corporate training at the best price ensuring that there is no compromise in the quality. Call us for if you need support there.