Introduction to big-data using PySpark

Everyone heard about Big data but what is it really? And what can we do with it? How can we handle several terabytes datasets?

In this lesson, we introduce Big data analysis using PySpark.

The Spark Python API (PySpark) exposes the Spark programming model to Python. Apache® Spark™ is an open source and is one of the most popular Big Data frameworks for scaling up your tasks in a cluster. It was developed to utilize distributed, in-memory data structures to improve data processing speeds.

The goal of this lesson is to teach novice programmers to write python code using map-reduce programming model.

Prerequisites

A basic knowledge of python is recommended but you don’t need to have any previous knowledge of big data analysis or Apache Spark.

Schedule

00:00 What is Big-Data?
00:30 Introduction to UIO galaxy eduPortal How to find your way around UIO galaxy eduPortal?
How to interact with python/pyspark jupyter notebook?
00:45 Map-filter-Reduce in python What is a lambda function in python?
What is a map-filter-reduce function in python?
How can I use map-filter-reduce in python?
01:20 Introduction to (Py)Spark What is [a]Spark and PySpark?
How to use PySpark?
How to define a Spark context?
How can I create a RDD (Resilient Distributed Dataset)?
02:35 Introduction to Spark SQL What is SPARK SQL?
Create a SPARK DataFrame
Query your DataFrame with SQL
03:20 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.