Spark SQL编程案例2_电商用户评论数据集分析
本案例的数据集包含电子产品类别的大约169万条Amazon评论。我们可以直接读取JSON数据集来创建Spark SQLDataFrame。
【例】从JSON文件中读取一组订单记录进行分析。
实现代码如下:
def main(args: Array[String]): Unit = { import org.apache.spark.sql.SparkSession // 创建SparkSession的实例 val spark = SparkSession.builder() .master("local[*]") .appName("Spark Basic Example") .getOrCreate() val sc = spark.sparkContext // 数据集路径 val filePath = "src/main/resources/amazon/Electronics_5.json" // 加载数据文件,创建DataFrame val reviewsDF = spark.read.json(filePath).cache() // 简单查看 reviewsDF.printSchema() // reviewsDF.show(5) // 创建临时视图 reviewsDF.createOrReplaceTempView("reviewsTable") // 执行SQL查询,找出给出综合评分(overall)大于3的评论 val sqlStr = """ SELECT asin, overall, reviewTime, reviewerID, reviewerName FROM reviewsTable WHERE overall >= 3 """ val selectedDF = spark.sql(sqlStr) selectedDF.show(5) import spark.implicits._ // 提取数组元素 val selectedJSONArrayElementDF = reviewsDF .select($"asin", $"overall", $"helpful") .where($"helpful".getItem(0) < 3) selectedJSONArrayElementDF.show(5) }
执行以上代码,输出结果如下:
root |-- asin: string (nullable = true) |-- helpful: array (nullable = true) | |-- element: long (containsNull = true) |-- overall: double (nullable = true) |-- reviewText: string (nullable = true) |-- reviewTime: string (nullable = true) |-- reviewerID: string (nullable = true) |-- reviewerName: string (nullable = true) |-- summary: string (nullable = true) |-- unixReviewTime: long (nullable = true) +----------+-------+-----------+--------------+--------------------+ | asin|overall| reviewTime| reviewerID| reviewerName| +----------+-------+-----------+--------------+--------------------+ |0528881469| 5.0| 06 2, 2013| AO94DHGC771SJ| amazdnu| |0528881469| 3.0| 09 9, 2010|A3N7T0DY83Y4IG| C. A. Freeman| |0594451647| 5.0| 01 3, 2014|A2JXAZZI9PHK9Z|Billy G. Noland "...| |0594451647| 5.0| 05 4, 2014| AAZ084UMH8VZ2|D. L. Brown "A Kn...| |0594451647| 4.0|07 11, 2014| AEZ3CR6BKIROJ| Mark Dietter| +----------+-------+-----------+--------------+--------------------+ only showing top 5 rows +----------+-------+-------+ | asin|overall|helpful| +----------+-------+-------+ |0528881469| 5.0| [0, 0]| |0528881469| 1.0| [0, 0]| |0594451647| 2.0| [0, 0]| |0594451647| 5.0| [0, 0]| |0594451647| 4.0| [0, 0]| +----------+-------+-------+ only showing top 5 rows