Hi,

In my spark streaming application I'm trying to partition a data stream into
multiple substreams. I read data from a Kafka producer and process the data
received real-time. The data is taken in through JavaInputDStream as a
directStream. Data is received without any loss. The need is to partition
the data I receive. So I used repartition() to do the above task but this
resulted in duplicating the same set of data across all partitions rather
than dividing (i.e., hashing) it according to the number of partitions
specified. Could anyone please explain a solution? I have shown a sample
code snippet below for your reference.



//--------------------------------------------------------
            JavaStreamingContext ssc = null;

            System.setProperty("spark.executor.memory", "512m");
            System.setProperty("spark.streaming.unpersist", "false");
            HashMap map = new HashMap();
            map.put("spark.executor.memory", "512m");
            map.put("spark.streaming.unpersist", "false");

            try {
                ssc = new JavaStreamingContext("spark://host1:7077", "app",
                        new Duration(1000), System.getenv("SPARK_HOME"),
                        JavaStreamingContext.jarOfClass(Benchmark.class),
map);
            } catch (Exception e) {
                e.printStackTrace();
            }

            Map<String, Object> kafkaParams = new HashMap<String, Object>();
            //....

            Set<String> topic = Collections.singleton("inputStream");

            final JavaInputDStream<ConsumerRecord&lt;String, byte[]>> stream
=
                    KafkaUtils.createDirectStream(ssc,
LocationStrategies.PreferConsistent(),
                            ConsumerStrategies.<String,
byte[]>Subscribe(topic, kafkaParams)
                    );


            JavaDStream<Tuple> incrementStream1 =
stream.map(incrementFunction2);
            JavaDStream<Tuple> stream2 = incrementStream1.repartition(2); 

//--------------------------------------------------------



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Repartition-function-duplicates-data-tp28383.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to