您需要为 Spark 编解码器添加 Scala-3 依赖项
https://github.com/vincenzobaz/spark-scala3
libraryDependencies += "io.github.vincenzobaz" %% "spark-scala3" % "0.1.3"
并导入 Scala-3
import scala3encoders.given
而不是 Scala-2
import spark.implicits._ // FAILS
import spark.sqlContext.implicits._ // FAILS
Scala Spark Encoders.product[X](其中 X 是案例类)不断给出“No TypeTag available for X”错误
关于BigInt
,
Spark 支持 BigInteger 类型吗?
Spark确实支持JavaBigInteger
但可能会损失一些精度。如果数值BigInteger
适合一个long
(即 -2^63 和 2^63-1 之间)那么它将被 Spark 存储为LongType
。否则它将被存储为DecimalType
,但该类型仅支持 38 位精度。
为相对较小的数据提供正确的编解码器BigInt
s(适合LongType
) are
import scala3encoders.derivation.{Deserializer, Serializer}
import org.apache.spark.sql.catalyst.expressions.Expression
import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, StaticInvoke}
import org.apache.spark.sql.types.{DataType, LongType, ObjectType}
given Deserializer[BigInt] with
def inputType: DataType = LongType
def deserialize(path: Expression): Expression =
StaticInvoke(
BigInt.getClass,
ObjectType(classOf[BigInt]),
"apply",
path :: Nil,
returnNullable = false
)
given Serializer[BigInt] with
def inputType: DataType = ObjectType(classOf[BigInt])
def serialize(inputObject: Expression): Expression =
Invoke(inputObject, "longValue", LongType, returnNullable = false)
import scala3encoders.given
https://github.com/DmytroMitin/spark_stackoverflow/blob/87ef5361dd3553f8cc5ced26fed4c17c0061d6a2/src/main/scala/main.scala
(https://github.com/databricks/Spark-The-Definitive-Guide)
https://github.com/yashwanthreddyg/spark_stackoverflow/pull/1
https://gist.github.com/DmytroMitin/3c0fe6983a254b350ff9feedbb066bef
https://github.com/vincenzobaz/spark-scala3/pull/22
对于大型BigInt
s(不适合LongType
when DecimalType
是必要的)编解码器是
import scala3encoders.derivation.{Deserializer, Serializer}
import org.apache.spark.sql.catalyst.expressions.Expression
import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, StaticInvoke}
import org.apache.spark.sql.types.{DataType, DataTypes, Decimal, ObjectType}
val decimalType = DataTypes.createDecimalType(38, 0)
given Deserializer[BigInt] with
def inputType: DataType = decimalType
def deserialize(path: Expression): Expression =
Invoke(path, "toScalaBigInt", ObjectType(classOf[scala.math.BigInt]), returnNullable = false)
given Serializer[BigInt] with
def inputType: DataType = ObjectType(classOf[BigInt])
def serialize(inputObject: Expression): Expression =
StaticInvoke(
Decimal.getClass,
decimalType,
"apply",
inputObject :: Nil,
returnNullable = false
)
import scala3encoders.given
这几乎与
import org.apache.spark.sql.catalyst.DeserializerBuildHelper.createDeserializerForScalaBigInt
import org.apache.spark.sql.catalyst.SerializerBuildHelper.createSerializerForScalaBigInt
import scala3encoders.derivation.{Deserializer, Serializer}
import org.apache.spark.sql.catalyst.expressions.Expression
import org.apache.spark.sql.types.{DataType, DataTypes, ObjectType}
val decimalType = DataTypes.createDecimalType(38, 0)
given Deserializer[BigInt] with
def inputType: DataType = decimalType
def deserialize(path: Expression): Expression =
createDeserializerForScalaBigInt(path)
given Serializer[BigInt] with
def inputType: DataType = ObjectType(classOf[BigInt])
def serialize(inputObject: Expression): Expression =
createSerializerForScalaBigInt(inputObject)
import scala3encoders.given
https://gist.github.com/DmytroMitin/8124d2a4cd25c8488c00c5a32f244f64
您观察到的运行时异常意味着BigInt
parquet 文件中的 s 相对较小(适合LongType
)并且您尝试了我的大型编解码器BigInt
s (DecimalType
)
https://gist.github.com/DmytroMitin/ad77677072c1d8d5538c94cb428c8fa4 (ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java': A method named "toScalaBigInt" is not declared in any enclosing class nor any supertype, nor through a static import
)
反之亦然,对于大BigInt
s (DecimalType
)和小型编解码器BigInt
s (LongType
): https://gist.github.com/DmytroMitin/3a3a61082fbfc12447f6e926fc45c7cd (ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java': No applicable constructor/method found for actual parameters "org.apache.spark.sql.types.Decimal"; candidates are: ...
)
我们不能同时使用这两种编解码器LongType
and DecimalType
: https://gist.github.com/DmytroMitin/32040a6b702fff5c53c727616b318cb5 (Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: All input types must be the same except nullable, containsNull, valueContainsNull flags. The input types found are LongType DecimalType(38,0)
)
对于小型和大型的混合BigInt
正确的是使用编解码器DecimalType
https://gist.github.com/DmytroMitin/626e09a63a387e6ff1d7fe264fc14d6b
手动创建的方法TypeTag
s 似乎也可以工作(不使用scala3encoders
)
// libraryDependencies += scalaOrganization.value % "scala-reflect" % "2.13.10" // in Scala 3
import scala.reflect.api
import scala.reflect.runtime.universe.{NoType, Type, TypeTag, internal}
import scala.reflect.runtime.universe
inline def createTypeTag[T](mirror: api.Mirror[_ <: api.Universe with Singleton], tpe: mirror.universe.Type): mirror.universe.TypeTag[T] = {
mirror.universe.TypeTag.apply[T](mirror.asInstanceOf[api.Mirror[mirror.universe.type]],
new api.TypeCreator {
override def apply[U <: api.Universe with Singleton](m: api.Mirror[U]): m.universe.Type = {
tpe.asInstanceOf[m.universe.Type]
}
}
)
}
val rm = universe.runtimeMirror(this.getClass.getClassLoader)
// val bigIntTpe = internal.typeRef(internal.typeRef(NoType, rm.staticPackage("scala.math"), Nil), rm.staticClass("scala.math.BigInt"), Nil)
// val strTpe = internal.typeRef(internal.typeRef(NoType, rm.staticPackage("java.lang"), Nil), rm.staticClass("java.lang.String"), Nil)
val flightTpe = internal.typeRef(NoType, rm.staticClass("Flight"), Nil)
// given TypeTag[BigInt] = createTypeTag[BigInt](rm, bigIntTpe)
// given TypeTag[String] = createTypeTag[String](rm, strTpe)
given TypeTag[Flight] = createTypeTag[Flight](rm, flightTpe)
import spark.implicits._
https://gist.github.com/DmytroMitin/bb0ccd5f1c533b2baec1756da52f8824