PySpark RDD编程案例_Top N问题

本节我们应用前面所学到的知识,实现几个常见的算法场景。

【示例】给出一个员工信息名单,找出收入最高的前10名员工(Top N问题)。

样本数据 employees.csv内容如下:

ename,title,department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
张三,paramedic i/c,fire,f,salary,,91080.00,
李四,lieutenant,fire,f,salary,,114846.00,
王老五,sergeant,police,f,salary,,104628.00,
赵六,police officer,police,f,salary,,96060.00,
钱七,clerk iii,police,f,salary,,53076.00,
周扒皮,firefighter,fire,f,salary,,87006.00,
吴用,law clerk,law,f,hourly,35,,14.51

实现代码如下。

from pyspark.sql import SparkSession

# 构建SparkSession和SparkContext实例
spark = SparkSession.builder \
   .master("spark://xueai8:7077") \
   .appName("pyspark demo") \
   .getOrCreate()

sc = spark.sparkContext

# 构造RDD
inputPath = "file:///home/hduser/data/spark/employees.csv"
rdd = sc.textFile(inputPath)

# 排序函数
def sortFun(arr):
    if len(arr[6]) > 0:
        return float(arr[6])
    else:
        return 0.0

# 计算过程
sortedData = rdd \
    .filter(lambda line: not line.startswith("ename")) \
    .map(lambda line: line.split(",")) \
    .sortBy(sortFun, False)

# 取前3个
top = sortedData.take(3)

for row in top:
    print(row)

执行以上代码,得到如下的结果:

['李四', 'lieutenant', 'fire', 'f', 'salary', '', '114846.00', '']
['王老五', 'sergeant', 'police', 'f', 'salary', '', '104628.00', '']
['赵六', 'police officer', 'police', 'f', 'salary', '', '96060.00', '']

《PySpark原理深入与编程实战》