Introduction to big-data using PySpark

Introduction to Spark SQL

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • What is SPARK SQL?

  • Create a SPARK DataFrame

  • Query your DataFrame with SQL

Objectives
  • Learn about SPARK SQL

Spark SQL

Spark SQL is a component on top of Spark Core that facilitates processing of structured and semi-structured data and the integration of several data formats as source (Hive, Parquet, JSON).

It allows to transform RDDs using SQL (Structured Query Language).

To start Spark SQL within your notebook, you need to create a SQL context.

For this exercise, import a JSON file in a new history “World Cup”. You can find the historical World cup player dataset in JSON format in our Data Library named “Historical world cup player data “.

Then create a new python 3 (change kernel if set by default to python 2) jupyter notebook from this file:

from pyspark import SparkContext
from pyspark.sql import SQLContext


sc = SparkContext('local', 'Spark SQL') 
sqlc = SQLContext(sc)

We can read the JSON file we have in our history and create a DataFrame (Spark SQL has a json reader available):

players = sqlc.read.json(get(1))

# Print the schema in a tree format

players.printSchema()

" Select only the "FullName" column
players.select("FullName").show(20)
+--------------------+
|            FullName|
+--------------------+
|        Ãngel Bossio|
|        Juan Botasso|
|      Roberto Cherro|
|   Alberto Chividini|
|                    |
|                    |
|       Juan Evaristo|
|      Mario Evaristo|
|     Manuel Ferreira|
|          Luis Monti|
|                    |
|   Rodolfo Orlandini|
|Fernando Paternoster|
|   Natalio Perinetti|
|     Carlos Peucelle|
|     Edmundo Piaggio|
|  Alejandro Scopelli|
|      Carlos Spadaro|
|                    |
|                    |
+--------------------+
only showing top 20 rows

Then we can create a view of our DataFrame. The lifetime of this temporary table is tied to the SparkSession that was used to create this DataFrame.

players.registerTempTable("players")

We can then query our view; for instance to get the names of all the Teams:

sqlc.sql("select distinct Team from players").show(10)
+--------+
|    Team|
+--------+
|England |
|Paraguay|
|     POL|
|  Russia|
|     BRA|
| Senegal|
|  Sweden|
|     FRA|
|     ALG|
|  Spain |
+--------+
only showing top 10 rows

And have the full SQL possibilities to create SQL query:

# Select the teams names from 2014 only 
team2014 = sqlc.sql("select distinct Team from players where Year == 2014")
#
# The results of SQL queries are Dataframe objects.
# rdd method returns the content as an :class:`pyspark.RDD` of :class:`Row`.
teamNames = team2014.rdd.map(lambda p: "Name: " + p.Team).collect()
for name in teamNames:
    print(name)

Pandas

When in PySpark, there is also an easy option to convert Spark DataFrame to Pandas dataframe.

Pandas dataframes can also be converted to Spark dataframes.

players.toPandas().head()

       

Key Points