Pyspark Databricks Exercise

The purpose of this series is to provide some exercise of using pyspark on a cloud platform. Pros:

Load library and create a handler for spark with SparkSession

from pyspark.sql import SparkSession
from pyspark.sql import types
spark = SparkSession.builder.getOrCreate()
import pandas as pd
sc = spark.sparkContext

Practice how to create Dataframe

create a input with two people’s name and age

1	`namelist = [('Ana',25),('Bob',30)]`

Create a Dataframe from RDD ? RDD -> ROW -> DataFrame

people_rdd = sc.parallelize(list)
people_row = people_rdd.map(lambda x: Row(name=x[0],age=int(x[1])))
people_df = spark.createDataFrame(people_row)
people_df.show()

+---+----+ age|name| +---+----+ 25| Ana| 30| Bob| +---+----+

Create a DataFrame from File on the Disk? Disk -> DataFrame

Generate a file and save it to Databricks file system

1 2	`dbutils.fs.rm("/yytest/namelist.txt") dbutils.fs.put("/yytest/namelist.txt","Ana,25\nBob,30\n")`

create a schema for the file

schema = types.StructType([types.StructField("name",types.StringType(),True),types.StructField("age",types.IntegerType(),True)])
# read from the disk
people_df2 = spark.read.csv("/yytest/namelist.txt", schema)
people_df2.show()

Wrote 14 bytes. +----+---+ name|age| +----+---+ Ana| 25| Bob| 30| +----+---+

#1.3 read from sql table
# prepare for the table
people_df2.createOrReplaceTempView('people_table')
# sql table => df
people_df3 = spark.table('people_table')
people_df3.show()

+----+---+ name|age| +----+---+ Ana| 25| Bob| 30| +----+---+

Create a DataFrame from Pandas dataframe? Pandas DF -> Spark DF

First, create a pandas dataframe

1 2	`# create a pandas dataframe pddf_name = pd.DataFrame(namelist,columns=['name','age'])`

Convert the pandas dataframe to spark dataframe

1 2	`people_df4 = spark.createDataFrame(pddf_name) people_df4.show()`

+----+---+ name|age| +----+---+ Ana| 25| Bob| 30| +----+---+

Practice Basic View and Query operations on Dataframe

View and Query of DataFrame properties

1 2	`# Check the type of dataframe type(people_df)`

Out[21]: pyspark.sql.dataframe.DataFrame

1 2	`# Describe the properties of dataframe people_df.describe().show()`

+-------+------------------+----+ summary| age|name| +-------+------------------+----+ count| 2| 2| mean| 27.5|null| stddev|3.5355339059327378|null| min| 25| Ana| max| 30| Bob| +-------+------------------+----+

1 2	`# display the dataframe as a framed table on notebook display(people_df)`

age	name
25	Ana
30	Bob

View and Query of columns/row properties

1 2	`# print out schema of the dataframe people_df.printSchema()`

root -- age: long (nullable = true) -- name: string (nullable = true)

1
2
3

# print columns names, number of records of dataframe
people_df.columns
people_df.count()

Out[32]: 2