PySpark SQL:过滤带有None或Null值的列
2021-11-05 09:26:58.0
1、构造某些列带有None值的DataFrame
from decimal import Decimal # List data = [{"Category": 'Category A', "ID": 1, "Value": Decimal(12.40)}, {"Category": 'Category B', "ID": 2, "Value": Decimal(30.10)}, {"Category": 'Category C', "ID": 3, "Value": None}, {"Category": 'Category D', "ID": 4, "Value": Decimal(1.0)}, ] # 创建DataFrame df = spark.createDataFrame(data) df.show(truncate=False)
输出结果如下:
+----------+---+---------------------+ |Category |ID |Value | +----------+---+---------------------+ |Category A|1 |12.400000000000000355| |Category B|2 |30.100000000000001421| |Category C|3 |null | |Category D|4 |1.000000000000000000 | +----------+---+---------------------+
2、使用标准ANSI-SQL SQL表达式来过滤列
df.filter("Value is not null").show(truncate=False) df.where("Value is null").show(truncate=False)
输出结果如下:
+----------+---+---------------------+ |Category |ID |Value | +----------+---+---------------------+ |Category A|1 |12.400000000000000355| |Category B|2 |30.100000000000001421| |Category D|4 |1.000000000000000000 | +----------+---+---------------------+ +----------+---+-----+ |Category |ID |Value| +----------+---+-----+ |Category C|3 |null | +----------+---+-----+
3、使用type.BooleanType列对象来过滤
如果在DataFrame中存在有boolean列,则可以直接将其作为条件传入:
df.filter(df['Value'].isNull()).show(truncate=False) df.where(df.Value.isNotNull()).show(truncate=False)
输出结果如下:
+----------+---+-----+ |Category |ID |Value| +----------+---+-----+ |Category C|3 |null | +----------+---+-----+ +----------+---+---------------------+ |Category |ID |Value | +----------+---+---------------------+ |Category A|1 |12.400000000000000355| |Category B|2 |30.100000000000001421| |Category D|4 |1.000000000000000000 | +----------+---+---------------------+
4、使用col或column函数将列转换为Column对象
from pyspark.sql.functions import col,column df.filter(col('Value').isNull()).show(truncate=False) df.filter(column('Value').isNull()).show(truncate=False) df.where(col('Value').isNotNull()).show(truncate=False) df.where(column('Value').isNotNull()).show(truncate=False)
输出结果如下:
+----------+---+-----+ |Category |ID |Value| +----------+---+-----+ |Category C|3 |null | +----------+---+-----+ +----------+---+-----+ |Category |ID |Value| +----------+---+-----+ |Category C|3 |null | +----------+---+-----+ +----------+---+---------------------+ |Category |ID |Value | +----------+---+---------------------+ |Category A|1 |12.400000000000000355| |Category B|2 |30.100000000000001421| |Category D|4 |1.000000000000000000 | +----------+---+---------------------+ +----------+---+---------------------+ |Category |ID |Value | +----------+---+---------------------+ |Category A|1 |12.400000000000000355| |Category B|2 |30.100000000000001421| |Category D|4 |1.000000000000000000 | +----------+---+---------------------+