我正在尝试以编程方式在看起来像 json 的 textFile 上强制执行 schema(json)。我尝试使用 jsonFile 但问题是从 json 文件列表创建数据帧,spark 必须对数据进行 1 次传递才能为数据帧创建模式。因此它需要解析所有需要较长时间的数据(自从我的数据被压缩且大小为 TB 以来,需要 4 个小时)。因此,我想尝试将其作为文本文件读取,并强制执行模式以单独获取感兴趣的字段,以便稍后查询结果数据帧。但我不确定如何将其映射到输入。有人可以给我一些关于如何将模式映射到像输入一样的 json 的参考吗?
input :
这是完整的架构:
records: org.apache.spark.sql.DataFrame = [country: string, countryFeatures: string, customerId: string, homeCountry: string, homeCountryFeatures: string, places: array<struct<freeTrial:boolean,placeId:string,placeRating:bigint>>, siteName: string, siteId: string, siteTypeId: string, Timestamp: bigint, Timezone: string, countryId: string, pageId: string, homeId: string, pageType: string, model: string, requestId: string, sessionId: string, inputs: array<struct<inputName:string,inputType:string,inputId:string,offerType:string,originalRating:bigint,processed:boolean,rating:bigint,score:double,methodId:string>>]
但我只对少数领域感兴趣,例如:
res45: Array[String] = Array({"requestId":"bnjinmm","siteName":"bueller","pageType":"ad","model":"prepare","inputs":[{"methodId":"436136582","inputType":"US","processed":true,"rating":0,"originalRating":1},{"methodId":"23232322","inputType":"UK","processed":falase,"rating":0,"originalRating":1}]
val records = sc.textFile("s3://testData/sample.json.gz")
val schema = StructType(Array(StructField("requestId",StringType,true),
StructField("siteName",StringType,true),
StructField("model",StringType,true),
StructField("pageType",StringType,true),
StructField("inputs", ArrayType(
StructType(
StructField("inputType",StringType,true),
StructField("originalRating",LongType,true),
StructField("processed",BooleanType,true),
StructField("rating",LongType,true),
StructField("methodId",StringType,true)
),true),true)))
val rowRDD = ??
val inputRDD = sqlContext.applySchema(rowRDD, schema)
inputRDD.registerTempTable("input")
sql("select * from input").foreach(println)
有什么办法可以映射这个吗?或者我需要使用子解析器什么的。由于限制,我想使用 textFile。
尝试过:
val records =sqlContext.read.schema(schema).json("s3://testData/test2.gz")
但不断收到错误:
<console>:37: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField)
StructField("inputs",ArrayType(StructType(StructField("inputType",StringType,true), StructField("originalRating",LongType,true), StructField("processed",BooleanType,true), StructField("rating",LongType,true), StructField("score",DoubleType,true), StructField("methodId",StringType,true)),true),true)))
^