PySpark读取Kafka主题失败:java.lang.NoClassDefFoundError

2022-02-25 18:11:32.0

问题描述

以Kafka作为流数据源,编写PySpark Structured Streaming流程序,读取Kafka指定主题,运行时失败,抛出如下异常信息:

20/06/14 12:19:18 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NoClassDefFoundError: org/apache/commons/pool2/impl/GenericKeyedObjectPoolConfig
    at org.apache.spark.sql.kafka010.KafkaDataConsumer$.(KafkaDataConsumer.scala:607)
    at org.apache.spark.sql.kafka010.KafkaDataConsumer$.(KafkaDataConsumer.scala)
    at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.(KafkaBatchPartitionReader.scala:51)
    at org.apache.spark.sql.kafka010.KafkaBatchReaderFactory$.createReader(KafkaBatchPartitionReader.scala:39)
    ......
    
Caused by: java.lang.ClassNotFoundException: org.apache.commons.pool2.impl.GenericKeyedObjectPoolConfig
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    ... 22 more

问题原因

Spark缺少commons-pool2 jar包。

解决方法

首先,下载缺失的commons-pool2-2.11.1.jar包,下载地址:https://mvnrepository.com/artifact/org.apache.commons/commons-pool2/2.11.1

然后,有两种方式:

方式一:将commons-pool2-2.11.1.jar包拷贝到Spark安装目录的jars/目录下,然后重启Spark集群。

方式二:在构建SparkSession时,配置该jar包在classpath上可见。

spark = SparkSession.builder\
    .config("spark.driver.extraClassPath", "/home/hdusers/jars/commons-pool2-2.11.1.jar") \
    ......

《PySpark原理深入与编程实战》