Introduction to big-data using PySpark

Introduction to UIO galaxy eduPortal

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • How to find your way around UIO galaxy eduPortal?

  • How to interact with python/pyspark jupyter notebook?

Objectives
  • To gain familiarity with the various panes in the UIO galaxy eduPortal

  • To gain familiarity with the buttons and options in the pySpark jupyter notebook

  • To be able to manage your Galaxy history

Motivation

You have been using python for analyzing your data but considering the growth in volume and complexity you are now willing to make a further step. This lesson will teach you how to start using PySpark and introduce you to the map-reduce programming model.

Before starting the workshop

To ease our work and avoid installing Spark on your laptop, we will be using the UIO Galaxy eduPortal. If you haven’t received a login and password yet, don’t panic. This can be handled in few minutes during the workshop.

 

Remark: without changing your pySpark code, you can scale up to hundred processors on UIO HPC abel… For more information see

Introduction to UIO Galaxy eduPortal

We’ll be using UIO Galaxy eduPortal: Galaxy is an open source, web-based platform for data intensive that has been initially developed for biomedical research.

Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make data intensive accessible to research scientists that do not have extensive computer programming experience. To learn more, take one of our Galaxy tours.

** For most of the images below, you can click to view a short video or get detailed documentation on the corresponding subject.**


Login panel GalaxyLogin


Basic layout GalaxyLWelcome


Get help GalaxyHelp


Upload data GalaxyUpload


Share Data GalaxyShare


Data Libraries GalaxyDataLibraries


Navigate histories GalaxyHistories


Introduction to pySpark jupyter notebook

Start a pyspark jupyter notebook

A Jupyter notebook can be started either existing dataset in your History or you can use our jupyter notebook template for python 3.

GalaxyHistories


Then you should get:

GalaxyHistories

       

Key Points