The poor rabbit chased by Python and Anaconda :p

0%

Pyspark Databricks Exercise: DataFrame Part 1

Pyspark Databricks Exercise

The purpose of this series is to provide some exercise of using pyspark on a cloud platform. Pros:

Load library and create a handler for spark with SparkSession

1
2
3
4
5
from pyspark.sql import SparkSession
from pyspark.sql import types
spark = SparkSession.builder.getOrCreate()
import pandas as pd
sc = spark.sparkContext

Practice how to create Dataframe

create a input with two people’s name and age

1
namelist = [('Ana',25),('Bob',30)]

Create a Dataframe from RDD ? RDD -> ROW -> DataFrame

1
2
3
4
people_rdd = sc.parallelize(list)
people_row = people_rdd.map(lambda x: Row(name=x[0],age=int(x[1])))
people_df = spark.createDataFrame(people_row)
people_df.show()
+---+----+ age|name| +---+----+ 25| Ana| 30| Bob| +---+----+

Create a DataFrame from File on the Disk? Disk -> DataFrame

Generate a file and save it to Databricks file system

1
2
dbutils.fs.rm("/yytest/namelist.txt")
dbutils.fs.put("/yytest/namelist.txt","Ana,25\nBob,30\n")

create a schema for the file

1
2
3
4
schema = types.StructType([types.StructField("name",types.StringType(),True),types.StructField("age",types.IntegerType(),True)])
# read from the disk
people_df2 = spark.read.csv("/yytest/namelist.txt", schema)
people_df2.show()
Wrote 14 bytes. +----+---+ name|age| +----+---+ Ana| 25| Bob| 30| +----+---+
1
2
3
4
5
6
#1.3 read from sql table
# prepare for the table
people_df2.createOrReplaceTempView('people_table')
# sql table => df
people_df3 = spark.table('people_table')
people_df3.show()
+----+---+ name|age| +----+---+ Ana| 25| Bob| 30| +----+---+

Create a DataFrame from Pandas dataframe? Pandas DF -> Spark DF

First, create a pandas dataframe

1
2
# create a pandas dataframe
pddf_name = pd.DataFrame(namelist,columns=['name','age'])

Convert the pandas dataframe to spark dataframe

1
2
people_df4 = spark.createDataFrame(pddf_name)
people_df4.show()
+----+---+ name|age| +----+---+ Ana| 25| Bob| 30| +----+---+

Practice Basic View and Query operations on Dataframe

View and Query of DataFrame properties

1
2
# Check the type of dataframe
type(people_df)
Out[21]: pyspark.sql.dataframe.DataFrame
1
2
# Describe the properties of dataframe
people_df.describe().show()
+-------+------------------+----+ summary| age|name| +-------+------------------+----+ count| 2| 2| mean| 27.5|null| stddev|3.5355339059327378|null| min| 25| Ana| max| 30| Bob| +-------+------------------+----+
1
2
# display the dataframe as a framed table on notebook
display(people_df)
agename
25Ana
30Bob

View and Query of columns/row properties

1
2
# print out schema of the dataframe
people_df.printSchema()
root -- age: long (nullable = true) -- name: string (nullable = true)
1
2
3
# print columns names, number of records of dataframe
people_df.columns
people_df.count()
Out[32]: 2