我似乎无法写镶木地板JavaRDD<T>
其中 T 是一个说法,Person
班级。我将其定义为
public class Person implements Serializable
{
private static final long serialVersionUID = 1L;
private String name;
private String age;
private Address address;
....
with Address
:
public class Address implements Serializable
{
private static final long serialVersionUID = 1L;
private String City; private String Block;
...<getters and setters>
然后我创建一个JavaRDD
像这样:
JavaRDD<Person> people = sc.textFile("/user/johndoe/spark/data/people.txt").map(new Function<String, Person>()
{
public Person call(String line)
{
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge("2");
Address address = new Address("HomeAdd","141H");
person.setAddress(address);
return person;
}
});
注意-我是手动设置的Address
对所有人都一样。这基本上是一个嵌套的 RDD。尝试将其另存为镶木地板文件时:
DataFrame dfschemaPeople = sqlContext.createDataFrame(people, Person.class);
dfschemaPeople.write().parquet("/user/johndoe/spark/data/out/people.parquet");
地址类别为:
import java.io.Serializable;
public class Address implements Serializable
{
public Address(String city, String block)
{
super();
City = city;
Block = block;
}
private static final long serialVersionUID = 1L;
private String City;
private String Block;
//Omitting getters and setters
}
我遇到错误:
引起原因:java.lang.ClassCastException:com.test.schema.Address 无法转换为 org.apache.spark.sql.Row
我正在运行spark-1.4.1。
- 这是一个已知的错误?
- 如果我通过导入相同格式的嵌套 JSON 文件来执行相同的操作,我就可以保存到镶木地板。
- 即使我创建一个子 DataFrame,例如:
DataFrame dfSubset = sqlContext.sql("SELECT address.city FROM PersonTable");
我仍然遇到同样的错误
那么什么给出呢?如何从文本文件中读取复杂的数据结构并保存为镶木地板?看来我不能这么做。