PySpark SQL: 向DataFrame添加常量列

发布时间:2021-11-04 | 作者:小白学苑


1、构造一个DataFrame

# List
data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
        {"Category": 'Category B', "ID": 2, "Value": 30.10},
        {"Category": 'Category C', "ID": 3, "Value": 100.01}
       ]

# 创建DataFrame
df = spark.createDataFrame(data)
df.show()
df.printSchema()

执行以上代码,输出结果如下:

+----------+---+------+
|  Category| ID| Value|
+----------+---+------+
|Category A|  1|  12.4|
|Category B|  2|  30.1|
|Category C|  3|100.01|
+----------+---+------+

root
 |-- Category: string (nullable = true)
 |-- ID: long (nullable = true)
 |-- Value: double (nullable = true)

2、使用lit 函数添加常量列

函数 lit 可用于向DataFrame添加具有常数值的列。

from datetime import date
from pyspark.sql.functions import lit

df1 = df.withColumn('ConstantColumn1', lit(1)) \
        .withColumn('ConstantColumn2', lit(date.today()))
df1.show()

执行以上代码,输出结果如下:

+----------+---+------+---------------+---------------+
|  Category| ID| Value|ConstantColumn1|ConstantColumn2|
+----------+---+------+---------------+---------------+
|Category A|  1|  12.4|              1|     2020-08-11|
|Category B|  2|  30.1|              1|     2020-08-11|
|Category C|  3|100.01|              1|     2020-08-11|
+----------+---+------+---------------+---------------+

3、通过Spark SQL添加新的常量列

df.createOrReplaceTempView("tb1")
df2 = spark.sql("select *, 1 as ConstantColumn1, current_date as ConstantColumn2 from tb1")
df2.show()

执行以上代码,输出结果如下:

+----------+---+------+---------------+---------------+
|  Category| ID| Value|ConstantColumn1|ConstantColumn2|
+----------+---+------+---------------+---------------+
|Category A|  1|  12.4|              1|     2020-08-11|
|Category B|  2|  30.1|              1|     2020-08-11|
|Category C|  3|100.01|              1|     2020-08-11|
+----------+---+------+---------------+---------------+

4、通过UDF添加新的常量列

from pyspark.sql.functions import udf

@udf("int")
def const_col():
    return 1

df1 = df.withColumn('ConstantColumn1', const_col())
df1.show()

执行以上代码,输出结果如下:

+----------+---+------+---------------+
|  Category| ID| Value|ConstantColumn1|
+----------+---+------+---------------+
|Category A|  1|  12.4|              1|
|Category B|  2|  30.1|              1|
|Category C|  3|100.01|              1|
+----------+---+------+---------------+
返回博客列表