Pyspark Databricks Exercise
The purpose of this series is to provide some exercise of using pyspark on a cloud platform.
Pros:
Load library and create a handler for spark with SparkSession
1 2 3 4 5
| from pyspark.sql import SparkSession from pyspark.sql import types spark = SparkSession.builder.getOrCreate() import pandas as pd sc = spark.sparkContext
|
Practice how to create Dataframe
1
| namelist = [('Ana',25),('Bob',30)]
|
Create a Dataframe from RDD ? RDD -> ROW -> DataFrame
1 2 3 4
| people_rdd = sc.parallelize(list) people_row = people_rdd.map(lambda x: Row(name=x[0],age=int(x[1]))) people_df = spark.createDataFrame(people_row) people_df.show()
|
+---+----+
age|name|
+---+----+
25| Ana|
30| Bob|
+---+----+
Create a DataFrame from File on the Disk? Disk -> DataFrame
Generate a file and save it to Databricks file system
1 2
| dbutils.fs.rm("/yytest/namelist.txt") dbutils.fs.put("/yytest/namelist.txt","Ana,25\nBob,30\n")
|
create a schema for the file
1 2 3 4
| schema = types.StructType([types.StructField("name",types.StringType(),True),types.StructField("age",types.IntegerType(),True)]) # read from the disk people_df2 = spark.read.csv("/yytest/namelist.txt", schema) people_df2.show()
|
Wrote 14 bytes.
+----+---+
name|age|
+----+---+
Ana| 25|
Bob| 30|
+----+---+
1 2 3 4 5 6
| #1.3 read from sql table # prepare for the table people_df2.createOrReplaceTempView('people_table') # sql table => df people_df3 = spark.table('people_table') people_df3.show()
|
+----+---+
name|age|
+----+---+
Ana| 25|
Bob| 30|
+----+---+
Create a DataFrame from Pandas dataframe? Pandas DF -> Spark DF
First, create a pandas dataframe
1 2
| pddf_name = pd.DataFrame(namelist,columns=['name','age'])
|
Convert the pandas dataframe to spark dataframe
1 2
| people_df4 = spark.createDataFrame people_df4.show
|
+----+---+
name|age|
+----+---+
Ana| 25|
Bob| 30|
+----+---+
Practice Basic View and Query operations on Dataframe
View and Query of DataFrame properties
1 2
| # Check the type of dataframe type(people_df)
|
Out[21]: pyspark.sql.dataframe.DataFrame
1 2
| # Describe the properties of dataframe people_df.describe().show()
|
+-------+------------------+----+
summary| age|name|
+-------+------------------+----+
count| 2| 2|
mean| 27.5|null|
stddev|3.5355339059327378|null|
min| 25| Ana|
max| 30| Bob|
+-------+------------------+----+
1 2
| # display the dataframe as a framed table on notebook display(people_df)
|
View and Query of columns/row properties
1 2
| # print out schema of the dataframe people_df.printSchema()
|
root
-- age: long (nullable = true)
-- name: string (nullable = true)
1 2 3
| # print columns names, number of records of dataframe people_df.columns people_df.count()
|
Out[32]: 2