Spark3中的日期写入问题

2022-10-17 10:04:02.0

问题描述

最近,当我使用Spark SQL构建的ETL写入数据到Hive ODS中时,出现如下异常信息:

Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: 
writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet INT96 files can be dangerous, 
as the files may be read by Spark 2.x or legacy versions of Hive later, 
which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. 
See more details in SPARK-31404. 
You can set spark.sql.legacy.parquet.int96RebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, 
to get maximum interoperability. Or set spark.sql.legacy.parquet.int96RebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, 
if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar.
...

问题剖析

在Spark 3.0中,关于日期有一些变化,它使用预期公历(Proleptic Gregorian calendar),而不是混合公历和儒略历(hybrid Gregorian+Julian calendar)。

因此,在对Spark 2.X的旧“遗留”格式进行读写时,会引起一些麻烦。

因此,将1900-01-01T00:00:00或之前的日期写入时间戳将抛出错误。

解决方法

这可以通过在Spark中设置spark.sql.legacy.parquet.int96RebaseModeInWrite 属性来处理。它接收三个值:

  • EXCEPTION (Default): 如果Spark看到两个日历之间的旧日期/时间戳不明确,则会导致写入失败。
  • LEGACY: 当写入Parquet文件时,Spark会将日期/时间戳从预期公历重基到传统混合(儒略历+公历)日历。
  • CORRECTED: Spark不会像现在这样做重基和写日期/时间戳。

代码如下所示:

spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "LEGACY")
spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY")

或者:

%spark.conf
spark.sql.legacy.parquet.int96RebaseModeInWrite   LEGACY
spark.sql.legacy.parquet.int96RebaseModeInRead   LEGACY

《Spark原理深入与编程实战》