Hi, I have a lot of tweets saved as text. I created an external table on top of it to access it as textfile. I need to convert these to sequencefiles with each tweet as its own record. To do this, I created another table as a sequencefile table like so -
CREATE EXTERNAL TABLE tweetseq( tweet STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' STORED AS SEQUENCEFILE LOCATION '/user/hdfs/tweetseq' Now when I insert into this table from my original tweets table, each line gets its own record as expected. This is great. However, I don't have any record ids here. Short of writing my own UDF to make that happen, are there any obvious solutions I am missing here? PS, I need the ids to be there because mahout seq2sparse expects that. Without ids, it fails with - java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be cast to org.apache.hadoop.io.Text at org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(SequenceFileTokenizerMapper.java:37) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.Child.main(Child.java:262) Regards, S