Hello, everyone! I'm new in spark. I have already written programs in Hadoop2.5.2, where I defined my own InputFormat and OutputFormat. Now I want to move my codes to spark using java language. The first problem I encountered is how to transform big txt file in local storage to RDD, which is compatible to my program written in hadoop. I found that there are functions in SparkContext which maybe helpful. But I don't know how to use them. E.G.
public <K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> RDD <http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/rdd/RDD.html><scala.Tuple2<K,V>> newAPIHadoopFile(String path, Class<F> fClass, Class<K> kClass, Class<V> vClass, org.apache.hadoop.conf.Configuration conf) Get an RDD for a given Hadoop file with an arbitrary new API InputFormat and extra configuration options to pass to the input format. '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD or directly passing it to an aggregation or shuffle operation will create many references to the same object. If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first copy them using a map function. In java, the following is wrong. /////option one Configuration confHadoop = new Configuration(); JavaPairRDD<LongWritable,Text> distFile=sc.newAPIHadoopFile( "hdfs://cMaster:9000/wcinput/data.txt", DataInputFormat,LongWritable,Text,confHadoop); /////option two Configuration confHadoop = new Configuration(); DataInputFormat input=new DataInputFormat(); LongWritable longType=new LongWritable(); Text text=new Text(); JavaPairRDD<LongWritable,Text> distFile=sc.newAPIHadoopFile( "hdfs://cMaster:9000/wcinput/data.txt", input,longType,text,confHadoop); Can anyone help me? Thank you so much.