Hi We have configured total* 11 nodes*. Each node contains 8 cores and 32 GB RAM
*Technologies and their version:* Apache Spark 1.5.2 and YARN : 6 nodes DSE 4.7 [Cassandra 2.1.8 and Solr] : 5 nodes HDFS (Hadoop version 2.7.1) : 3 nodes *Stack:* 3 separate nodes for HDFS 3 separate nodes for Spark + YARN 2 separate seed nodes for DSE Cassandra 3 nodes share Cassandra and Spark both HDFS and Cassandra Replication factor : 3 Used DSE Solr for indexing records in cassandra. Programming Codi in Java. *Job flow:* 1. Driver program to initialize spark and cassandra with 2 seed nodes 2. Fetch json file from HDFS. 3. mappartitions on files and using FlatMap function to iterate over data 4. Each line from file represents a record. In FlatMap function, We use gson to convert json to POJO 5. Invoke solr HTTP GET request based on the fields of POJO. We invoke roughly 10 HTTP requests per POJO constructed in previous step. HTTP request have any one of 5 Cassandra IPs for distributing GET request load across nodes. 6. These POJOs are collected in an arraylist and returned to driver 7. We then invoke the mapToRow function to insert these RDDs into cassandra. *Queries:* 1. Deployment- From the deployment standpoint, does the technology stack on each node make sense? 2. How to determine the partitions size. We are currently using formula => size in MB / 16. Should we determine the number of cores, executors and memory based on data size or number of rows in the file. 3. TableWriter issue - While writing RDDs into cassandra, computation processes halt and take more time to complete. We are using YJP-profiler for monitoring these stats.How to overcome this latency. 4. Are there any performance related parameters in Spark, Cassandra, Solr which will reduce the job time Any help to increase the performance will be appreciated. Thanks -- Ashish Gadkari
