Re: Programming with java on spark

付雅丹 Mon, 29 Jun 2015 06:52:45 -0700

Hi, Akhil. Thank you for your reply. I tried what you suggested. But it
exists the following error.


source code is:
JavaPairRDD<LongWritable,Text> distFile=sc.hadoopFile(
"hdfs://cMaster:9000/wcinput/data.txt",
DataInputFormat.class,LongWritable.class,Text.class);

while DataInputFormat class is defined as this:

class DataInputFormat extends FileInputFormat<LongWritable, Text> {
@Override
public RecordReader<LongWritable,Text> createRecordReader(InputSplit arg0,
TaskAttemptContext arg1) throws IOException, InterruptedException {
// TODO Auto-generated method stub
return new DataRecordReader();
}
}

The DataRecordReader class is a class derived from
class RecordReader<LongWritable, Text>

Then I got error as follows:

[image: 内嵌图片 1]


Then I changed the source code to this. It seems working. Thank you again!
Configuration confHadoop = new Configuration();
JavaPairRDD<LongWritable,Text> distFile=sc.newAPIHadoopFile(
"hdfs://cMaster:9000/wcinput/data.txt",
DataInputFormat.class,LongWritable.class,Text.class,confHadoop);

2015-06-23 15:40 GMT+08:00 Akhil Das <ak...@sigmoidanalytics.com>:

> Did you happened to try this?
>
>
>         JavaPairRDD<Integer, String> hadoopFile = sc.hadoopFile(
>             "/sigmoid", DataInputFormat.class, LongWritable.class,
> Text.class)
>
>
>
> Thanks
> Best Regards
>
> On Tue, Jun 23, 2015 at 6:58 AM, 付雅丹 <yadanfu1...@gmail.com> wrote:
>
>> Hello, everyone! I'm new in spark. I have already written programs in
>> Hadoop2.5.2, where I defined my own InputFormat and OutputFormat. Now I
>> want to move my codes to spark using java language. The first problem I
>> encountered is how to transform big txt file in local storage to RDD, which
>> is compatible to my program written in hadoop. I found that there are
>> functions in SparkContext which maybe helpful. But I don't know how to use
>> them.
>> E.G.
>>
>> public <K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> RDD 
>> <http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/rdd/RDD.html><scala.Tuple2<K,V>>
>>  newAPIHadoopFile(String path,
>>                                            Class<F> fClass,
>>                                            Class<K> kClass,
>>                                            Class<V> vClass,
>>                          org.apache.hadoop.conf.Configuration conf)
>>
>> Get an RDD for a given Hadoop file with an arbitrary new API InputFormat
>> and extra configuration options to pass to the input format.
>>
>> '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable
>> object for each record, directly caching the returned RDD or directly
>> passing it to an aggregation or shuffle operation will create many
>> references to the same object. If you plan to directly cache, sort, or
>> aggregate Hadoop writable objects, you should first copy them using a map
>>  function.
>> In java, the following is wrong.
>>
>> /////option one
>> Configuration confHadoop = new Configuration();
>> JavaPairRDD<LongWritable,Text> distFile=sc.newAPIHadoopFile(
>> "hdfs://cMaster:9000/wcinput/data.txt",
>> DataInputFormat,LongWritable,Text,confHadoop);
>>
>> /////option two
>> Configuration confHadoop = new Configuration();
>> DataInputFormat input=new DataInputFormat();
>> LongWritable longType=new LongWritable();
>> Text text=new Text();
>> JavaPairRDD<LongWritable,Text> distFile=sc.newAPIHadoopFile(
>> "hdfs://cMaster:9000/wcinput/data.txt",
>> input,longType,text,confHadoop);
>>
>> Can anyone help me? Thank you so much.
>>
>>
>

Re: Programming with java on spark

Reply via email to