Spark SQL编程案例2_电商用户评论数据集分析

本案例的数据集包含电子产品类别的大约169万条Amazon评论。我们可以直接读取JSON数据集来创建Spark SQLDataFrame。

【例】从JSON文件中读取一组订单记录进行分析。

实现代码如下:

def main(args: Array[String]): Unit = {

    import org.apache.spark.sql.SparkSession

    // 创建SparkSession的实例
    val spark = SparkSession.builder()
      .master("local[*]")
      .appName("Spark Basic Example")
      .getOrCreate()
    val sc = spark.sparkContext

    // 数据集路径
    val filePath = "src/main/resources/amazon/Electronics_5.json"

    // 加载数据文件,创建DataFrame
    val reviewsDF = spark.read.json(filePath).cache()

    // 简单查看
    reviewsDF.printSchema()
    // reviewsDF.show(5)

    // 创建临时视图
    reviewsDF.createOrReplaceTempView("reviewsTable")

    // 执行SQL查询,找出给出综合评分(overall)大于3的评论
    val sqlStr =
      """
        SELECT asin, overall, reviewTime, reviewerID, reviewerName
        FROM reviewsTable
        WHERE overall >= 3
      """
    val selectedDF = spark.sql(sqlStr)
    selectedDF.show(5)

    import spark.implicits._
    // 提取数组元素
    val selectedJSONArrayElementDF = reviewsDF
      .select($"asin", $"overall", $"helpful")
      .where($"helpful".getItem(0) < 3)
    selectedJSONArrayElementDF.show(5)

  }

执行以上代码,输出结果如下:

root
 |-- asin: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)

+----------+-------+-----------+--------------+--------------------+
|      asin|overall| reviewTime|    reviewerID|        reviewerName|
+----------+-------+-----------+--------------+--------------------+
|0528881469|    5.0| 06 2, 2013| AO94DHGC771SJ|             amazdnu|
|0528881469|    3.0| 09 9, 2010|A3N7T0DY83Y4IG|       C. A. Freeman|
|0594451647|    5.0| 01 3, 2014|A2JXAZZI9PHK9Z|Billy G. Noland "...|
|0594451647|    5.0| 05 4, 2014| AAZ084UMH8VZ2|D. L. Brown "A Kn...|
|0594451647|    4.0|07 11, 2014| AEZ3CR6BKIROJ|        Mark Dietter|
+----------+-------+-----------+--------------+--------------------+
only showing top 5 rows

+----------+-------+-------+
|      asin|overall|helpful|
+----------+-------+-------+
|0528881469|    5.0| [0, 0]|
|0528881469|    1.0| [0, 0]|
|0594451647|    2.0| [0, 0]|
|0594451647|    5.0| [0, 0]|
|0594451647|    4.0| [0, 0]|
+----------+-------+-------+
only showing top 5 rows

《Spark原理深入与编程实战》