Sequence File SerDe issue: org.apache.hadoop.io.Text appearing where they shouldn't be

James Kebinger Mon, 11 Mar 2013 18:26:53 -0700

Hello everyone, I have an odd problem I'm trying to track down. I have a
sequence file with Protobufs wrapped in writables, where we write the
length of the protobuf byte array as an int and then write the byte array.
I have two different base classes of these, and so I test for those; there
shouldn't be anything else, but in some files I fall through that code and
find an instance of Text, with nothing in it.


I've run the sequence files that hive is choking on through a regular map
reduce job that just dumps the data without any complaint, so I can't
figure out either
a) where this Text instance is coming from and
b) is there a way to ignore a row in a Serde so that I can handle this
error and move on?

I'm using cdh4.2 hive 0.10.0. I'll paste an outline of the deserialize
method and my table definition below.

Thanks in advance for any pointers anyone can provide

@Override

public Object deserialize(Writable blob) throws SerDeException {

if (blob instanceof HadoopProto) {

return ProtoBuffObjectInspectorFactory.toStruct(((HadoopProto<?, ?>)
blob).get());

}

else if (blob instanceof HadoopActivity) {

return ProtoBuffObjectInspectorFactory.toStruct(((HadoopActivity)
blob).get());


}

else {

LOG.info(String.format("field is instance of %s with value %s returning
null", blob.getClass(), blob.toString()));

}

List<Object> row = Lists.newArrayList();

row.add(null);

return row;

}



The table definition:

CREATE EXTERNAL TABLE blogupdates (timestampX BIGINT, portal_id INT,
index_url STRING, post_data  ARRAY <  STRUCT < url : STRING, author :
STRING, published_date : BIGINT, comment_count : INT >  > , platform STRING)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT SERDE 'com.hubspot.hadoop.HadoopProtoSerde'
WITH SERDEPROPERTIES ("serialization.class"=
"com.hubspot.externalblogs.data.ExternalBlogsProtos$BlogDataWalMessage")
STORED AS SEQUENCEFILE ;

Sequence File SerDe issue: org.apache.hadoop.io.Text appearing where they shouldn't be

Reply via email to