Introduction to big-data using PySpark

Everyone heard about Big data but what is it really? And what can we do with it? How can we handle several terabytes datasets?

The Spark Python API (PySpark) exposes the Spark programming model to Python. Apache® Spark™ is an open source and is one of the most popular Big Data frameworks for scaling up your tasks in a cluster. It was developed to utilize distributed, in-memory data structures to improve data processing speeds.

The goal of this lesson is to teach novice programmers to write python code using map-reduce programming model.

Schedule

00:00	What is Big-Data?
00:30	Introduction to UIO galaxy eduPortal	How to find your way around UIO galaxy eduPortal? How to interact with python/pyspark jupyter notebook?
00:45	Map-filter-Reduce in python	What is a lambda function in python? What is a map-filter-reduce function in python? How can I use map-filter-reduce in python?
01:20	Introduction to (Py)Spark	What is [a]Spark and PySpark? How to use PySpark? How to define a Spark context? How can I create a RDD (Resilient Distributed Dataset)?
02:35	Introduction to Spark SQL	What is SPARK SQL? Create a SPARK DataFrame Query your DataFrame with SQL
03:20	Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

Introduction to big-data using PySpark

Prerequisites

Schedule