Repartition function duplicates data

F. Amara Sun, 12 Feb 2017 21:48:07 -0800

Hi,

In my spark streaming application I'm trying to partition a data stream into
multiple substreams. I read data from a Kafka producer and process the data
received real-time. The data is taken in through JavaInputDStream as a
directStream. Data is received without any loss. The need is to partition
the data I receive. So I used repartition() to do the above task but this
resulted in duplicating the same set of data across all partitions rather
than dividing (i.e., hashing) it according to the number of partitions
specified. Could anyone please explain a solution? I have shown a sample
code snippet below for your reference.




//--------------------------------------------------------
            JavaStreamingContext ssc = null;

            System.setProperty("spark.executor.memory", "512m");
            System.setProperty("spark.streaming.unpersist", "false");
            HashMap map = new HashMap();
            map.put("spark.executor.memory", "512m");
            map.put("spark.streaming.unpersist", "false");

            try {
                ssc = new JavaStreamingContext("spark://host1:7077", "app",
                        new Duration(1000), System.getenv("SPARK_HOME"),
                        JavaStreamingContext.jarOfClass(Benchmark.class),
map);
            } catch (Exception e) {
                e.printStackTrace();
            }

            Map<String, Object> kafkaParams = new HashMap<String, Object>();
            //....

            Set<String> topic = Collections.singleton("inputStream");

            final JavaInputDStream<ConsumerRecord&lt;String, byte[]>> stream
=
                    KafkaUtils.createDirectStream(ssc,
LocationStrategies.PreferConsistent(),
                            ConsumerStrategies.<String,
byte[]>Subscribe(topic, kafkaParams)
                    );


            JavaDStream<Tuple> incrementStream1 =
stream.map(incrementFunction2);
            JavaDStream<Tuple> stream2 = incrementStream1.repartition(2); 

//--------------------------------------------------------



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Repartition-function-duplicates-data-tp28383.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Repartition function duplicates data

Reply via email to