Hi, In my spark streaming application I'm trying to partition a data stream into multiple substreams. I read data from a Kafka producer and process the data received real-time. The data is taken in through JavaInputDStream as a directStream. Data is received without any loss. The need is to partition the data I receive. So I used repartition() to do the above task but this resulted in duplicating the same set of data across all partitions rather than dividing (i.e., hashing) it according to the number of partitions specified. Could anyone please explain a solution? I have shown a sample code snippet below for your reference.
//-------------------------------------------------------- JavaStreamingContext ssc = null; System.setProperty("spark.executor.memory", "512m"); System.setProperty("spark.streaming.unpersist", "false"); HashMap map = new HashMap(); map.put("spark.executor.memory", "512m"); map.put("spark.streaming.unpersist", "false"); try { ssc = new JavaStreamingContext("spark://host1:7077", "app", new Duration(1000), System.getenv("SPARK_HOME"), JavaStreamingContext.jarOfClass(Benchmark.class), map); } catch (Exception e) { e.printStackTrace(); } Map<String, Object> kafkaParams = new HashMap<String, Object>(); //.... Set<String> topic = Collections.singleton("inputStream"); final JavaInputDStream<ConsumerRecord<String, byte[]>> stream = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.<String, byte[]>Subscribe(topic, kafkaParams) ); JavaDStream<Tuple> incrementStream1 = stream.map(incrementFunction2); JavaDStream<Tuple> stream2 = incrementStream1.repartition(2); //-------------------------------------------------------- -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Repartition-function-duplicates-data-tp28383.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org