Hello Spark Users,

My first email to spark mailing list and looking forward. I have been
working on Solr and in the past have used Java thread pooling to
parallelize Solr indexing using SolrJ.

Now i am again working on indexing data and this time from JSON files (in
100 thousands) and before I try out parallelizing the operations using
Spark (reading each JSON file, post its content to Solr) I wanted to
confirm my understanding.


By reading json files using wholeTextFiles and then posting the content to
Solr

- would be similar to what i will achieve using Java multi-threading /
thread pooling and using ExecutorFramework  and
- what additional other advantages i would get by using Spark (less code...)
- How we can parallelize/batch this further? For e.g. In my Java
multi-threaded i not only parallelize the reading / data acquisition but
also posting in batches in parallel.


Below is the code snippet to give you an idea of what i am thinking to
start initially.  Please feel free to suggest/correct my understanding and
below code structure.

SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]");

JavaSparkContext sc = new JavaSparkContext(conf);

JavaPairRDD<String,String> rdd = sc.wholeTextFiles("/../*.json");

rdd.foreach(new VoidFunction<Tuple2<String,String>>() {


@Override

public void post(Tuple2<String, String> arg0) throws Exception {

//post content to Solr

arg0._2

...

...

}

});


Thanks,

Susheel

Reply via email to