Hello Spark Users, My first email to spark mailing list and looking forward. I have been working on Solr and in the past have used Java thread pooling to parallelize Solr indexing using SolrJ.
Now i am again working on indexing data and this time from JSON files (in 100 thousands) and before I try out parallelizing the operations using Spark (reading each JSON file, post its content to Solr) I wanted to confirm my understanding. By reading json files using wholeTextFiles and then posting the content to Solr - would be similar to what i will achieve using Java multi-threading / thread pooling and using ExecutorFramework and - what additional other advantages i would get by using Spark (less code...) - How we can parallelize/batch this further? For e.g. In my Java multi-threaded i not only parallelize the reading / data acquisition but also posting in batches in parallel. Below is the code snippet to give you an idea of what i am thinking to start initially. Please feel free to suggest/correct my understanding and below code structure. SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]"); JavaSparkContext sc = new JavaSparkContext(conf); JavaPairRDD<String,String> rdd = sc.wholeTextFiles("/../*.json"); rdd.foreach(new VoidFunction<Tuple2<String,String>>() { @Override public void post(Tuple2<String, String> arg0) throws Exception { //post content to Solr arg0._2 ... ... } }); Thanks, Susheel