Hi, Akhil. Thank you for your reply. I tried what you suggested. But it exists the following error.
source code is: JavaPairRDD<LongWritable,Text> distFile=sc.hadoopFile( "hdfs://cMaster:9000/wcinput/data.txt", DataInputFormat.class,LongWritable.class,Text.class); while DataInputFormat class is defined as this: class DataInputFormat extends FileInputFormat<LongWritable, Text> { @Override public RecordReader<LongWritable,Text> createRecordReader(InputSplit arg0, TaskAttemptContext arg1) throws IOException, InterruptedException { // TODO Auto-generated method stub return new DataRecordReader(); } } The DataRecordReader class is a class derived from class RecordReader<LongWritable, Text> Then I got error as follows: [image: 内嵌图片 1] Then I changed the source code to this. It seems working. Thank you again! Configuration confHadoop = new Configuration(); JavaPairRDD<LongWritable,Text> distFile=sc.newAPIHadoopFile( "hdfs://cMaster:9000/wcinput/data.txt", DataInputFormat.class,LongWritable.class,Text.class,confHadoop); 2015-06-23 15:40 GMT+08:00 Akhil Das <ak...@sigmoidanalytics.com>: > Did you happened to try this? > > > JavaPairRDD<Integer, String> hadoopFile = sc.hadoopFile( > "/sigmoid", DataInputFormat.class, LongWritable.class, > Text.class) > > > > Thanks > Best Regards > > On Tue, Jun 23, 2015 at 6:58 AM, 付雅丹 <yadanfu1...@gmail.com> wrote: > >> Hello, everyone! I'm new in spark. I have already written programs in >> Hadoop2.5.2, where I defined my own InputFormat and OutputFormat. Now I >> want to move my codes to spark using java language. The first problem I >> encountered is how to transform big txt file in local storage to RDD, which >> is compatible to my program written in hadoop. I found that there are >> functions in SparkContext which maybe helpful. But I don't know how to use >> them. >> E.G. >> >> public <K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> RDD >> <http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/rdd/RDD.html><scala.Tuple2<K,V>> >> newAPIHadoopFile(String path, >> Class<F> fClass, >> Class<K> kClass, >> Class<V> vClass, >> org.apache.hadoop.conf.Configuration conf) >> >> Get an RDD for a given Hadoop file with an arbitrary new API InputFormat >> and extra configuration options to pass to the input format. >> >> '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable >> object for each record, directly caching the returned RDD or directly >> passing it to an aggregation or shuffle operation will create many >> references to the same object. If you plan to directly cache, sort, or >> aggregate Hadoop writable objects, you should first copy them using a map >> function. >> In java, the following is wrong. >> >> /////option one >> Configuration confHadoop = new Configuration(); >> JavaPairRDD<LongWritable,Text> distFile=sc.newAPIHadoopFile( >> "hdfs://cMaster:9000/wcinput/data.txt", >> DataInputFormat,LongWritable,Text,confHadoop); >> >> /////option two >> Configuration confHadoop = new Configuration(); >> DataInputFormat input=new DataInputFormat(); >> LongWritable longType=new LongWritable(); >> Text text=new Text(); >> JavaPairRDD<LongWritable,Text> distFile=sc.newAPIHadoopFile( >> "hdfs://cMaster:9000/wcinput/data.txt", >> input,longType,text,confHadoop); >> >> Can anyone help me? Thank you so much. >> >> >