Hello everyone, I have an odd problem I'm trying to track down. I have a sequence file with Protobufs wrapped in writables, where we write the length of the protobuf byte array as an int and then write the byte array. I have two different base classes of these, and so I test for those; there shouldn't be anything else, but in some files I fall through that code and find an instance of Text, with nothing in it.
I've run the sequence files that hive is choking on through a regular map reduce job that just dumps the data without any complaint, so I can't figure out either a) where this Text instance is coming from and b) is there a way to ignore a row in a Serde so that I can handle this error and move on? I'm using cdh4.2 hive 0.10.0. I'll paste an outline of the deserialize method and my table definition below. Thanks in advance for any pointers anyone can provide @Override public Object deserialize(Writable blob) throws SerDeException { if (blob instanceof HadoopProto) { return ProtoBuffObjectInspectorFactory.toStruct(((HadoopProto<?, ?>) blob).get()); } else if (blob instanceof HadoopActivity) { return ProtoBuffObjectInspectorFactory.toStruct(((HadoopActivity) blob).get()); } else { LOG.info(String.format("field is instance of %s with value %s returning null", blob.getClass(), blob.toString())); } List<Object> row = Lists.newArrayList(); row.add(null); return row; } The table definition: CREATE EXTERNAL TABLE blogupdates (timestampX BIGINT, portal_id INT, index_url STRING, post_data ARRAY < STRUCT < url : STRING, author : STRING, published_date : BIGINT, comment_count : INT > > , platform STRING) PARTITIONED BY (year INT, month INT, day INT) ROW FORMAT SERDE 'com.hubspot.hadoop.HadoopProtoSerde' WITH SERDEPROPERTIES ("serialization.class"= "com.hubspot.externalblogs.data.ExternalBlogsProtos$BlogDataWalMessage") STORED AS SEQUENCEFILE ;