PySpark RDD编程案例_Top N问题
本节我们应用前面所学到的知识,实现几个常见的算法场景。
【示例】给出一个员工信息名单,找出收入最高的前10名员工(Top N问题)。
样本数据 employees.csv内容如下:
ename,title,department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate 张三,paramedic i/c,fire,f,salary,,91080.00, 李四,lieutenant,fire,f,salary,,114846.00, 王老五,sergeant,police,f,salary,,104628.00, 赵六,police officer,police,f,salary,,96060.00, 钱七,clerk iii,police,f,salary,,53076.00, 周扒皮,firefighter,fire,f,salary,,87006.00, 吴用,law clerk,law,f,hourly,35,,14.51
实现代码如下。
from pyspark.sql import SparkSession # 构建SparkSession和SparkContext实例 spark = SparkSession.builder \ .master("spark://xueai8:7077") \ .appName("pyspark demo") \ .getOrCreate() sc = spark.sparkContext # 构造RDD inputPath = "file:///home/hduser/data/spark/employees.csv" rdd = sc.textFile(inputPath) # 排序函数 def sortFun(arr): if len(arr[6]) > 0: return float(arr[6]) else: return 0.0 # 计算过程 sortedData = rdd \ .filter(lambda line: not line.startswith("ename")) \ .map(lambda line: line.split(",")) \ .sortBy(sortFun, False) # 取前3个 top = sortedData.take(3) for row in top: print(row)
执行以上代码,得到如下的结果:
['李四', 'lieutenant', 'fire', 'f', 'salary', '', '114846.00', ''] ['王老五', 'sergeant', 'police', 'f', 'salary', '', '104628.00', ''] ['赵六', 'police officer', 'police', 'f', 'salary', '', '96060.00', '']