PySpark读取Kafka主题失败:java.lang.NoClassDefFoundError
2022-02-25 18:11:32.0
问题描述
以Kafka作为流数据源,编写PySpark Structured Streaming流程序,读取Kafka指定主题,运行时失败,抛出如下异常信息:
20/06/14 12:19:18 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.NoClassDefFoundError: org/apache/commons/pool2/impl/GenericKeyedObjectPoolConfig at org.apache.spark.sql.kafka010.KafkaDataConsumer$.(KafkaDataConsumer.scala:607) at org.apache.spark.sql.kafka010.KafkaDataConsumer$. (KafkaDataConsumer.scala) at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader. (KafkaBatchPartitionReader.scala:51) at org.apache.spark.sql.kafka010.KafkaBatchReaderFactory$.createReader(KafkaBatchPartitionReader.scala:39) ...... Caused by: java.lang.ClassNotFoundException: org.apache.commons.pool2.impl.GenericKeyedObjectPoolConfig at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) ... 22 more
问题原因
Spark缺少commons-pool2 jar包。
解决方法
首先,下载缺失的commons-pool2-2.11.1.jar包,下载地址:https://mvnrepository.com/artifact/org.apache.commons/commons-pool2/2.11.1
然后,有两种方式:
方式一:将commons-pool2-2.11.1.jar包拷贝到Spark安装目录的jars/目录下,然后重启Spark集群。
方式二:在构建SparkSession时,配置该jar包在classpath上可见。
spark = SparkSession.builder\ .config("spark.driver.extraClassPath", "/home/hdusers/jars/commons-pool2-2.11.1.jar") \ ......