PySpark SQL:过滤带有None或Null值的列

2021-11-05 09:26:58.0

1、构造某些列带有None值的DataFrame

from decimal import Decimal
# List
data = [{"Category": 'Category A', "ID": 1, "Value": Decimal(12.40)},
        {"Category": 'Category B', "ID": 2, "Value": Decimal(30.10)},
        {"Category": 'Category C', "ID": 3, "Value": None},
        {"Category": 'Category D', "ID": 4, "Value": Decimal(1.0)},
       ]

# 创建DataFrame
df = spark.createDataFrame(data)
df.show(truncate=False)

输出结果如下:

+----------+---+---------------------+
|Category  |ID |Value                |
+----------+---+---------------------+
|Category A|1  |12.400000000000000355|
|Category B|2  |30.100000000000001421|
|Category C|3  |null                 |
|Category D|4  |1.000000000000000000 |
+----------+---+---------------------+

2、使用标准ANSI-SQL SQL表达式来过滤列

df.filter("Value is not null").show(truncate=False)
df.where("Value is null").show(truncate=False)

输出结果如下:

+----------+---+---------------------+
|Category  |ID |Value                |
+----------+---+---------------------+
|Category A|1  |12.400000000000000355|
|Category B|2  |30.100000000000001421|
|Category D|4  |1.000000000000000000 |
+----------+---+---------------------+

+----------+---+-----+
|Category  |ID |Value|
+----------+---+-----+
|Category C|3  |null |
+----------+---+-----+

3、使用type.BooleanType列对象来过滤

如果在DataFrame中存在有boolean列,则可以直接将其作为条件传入:

df.filter(df['Value'].isNull()).show(truncate=False)
df.where(df.Value.isNotNull()).show(truncate=False)

输出结果如下:

+----------+---+-----+
|Category  |ID |Value|
+----------+---+-----+
|Category C|3  |null |
+----------+---+-----+

+----------+---+---------------------+
|Category  |ID |Value                |
+----------+---+---------------------+
|Category A|1  |12.400000000000000355|
|Category B|2  |30.100000000000001421|
|Category D|4  |1.000000000000000000 |
+----------+---+---------------------+

4、使用col或column函数将列转换为Column对象

from pyspark.sql.functions import col,column

df.filter(col('Value').isNull()).show(truncate=False)
df.filter(column('Value').isNull()).show(truncate=False)

df.where(col('Value').isNotNull()).show(truncate=False)
df.where(column('Value').isNotNull()).show(truncate=False)

输出结果如下:

+----------+---+-----+
|Category  |ID |Value|
+----------+---+-----+
|Category C|3  |null |
+----------+---+-----+

+----------+---+-----+
|Category  |ID |Value|
+----------+---+-----+
|Category C|3  |null |
+----------+---+-----+

+----------+---+---------------------+
|Category  |ID |Value                |
+----------+---+---------------------+
|Category A|1  |12.400000000000000355|
|Category B|2  |30.100000000000001421|
|Category D|4  |1.000000000000000000 |
+----------+---+---------------------+

+----------+---+---------------------+
|Category  |ID |Value                |
+----------+---+---------------------+
|Category A|1  |12.400000000000000355|
|Category B|2  |30.100000000000001421|
|Category D|4  |1.000000000000000000 |
+----------+---+---------------------+

《PySpark原理深入与编程实战》