(removing dev from the to: as not relevant) it would be good to see some sample data and the cassandra schema to have a more concrete idea of the problem space.
Some thoughts: reduceByKey could still be used to 'pick' one element. example of arbitrarily choosing the first one: reduceByKey{case (e1,e2) => e1} The question to be answered is: what should happen to the multiple values that arrive for 1 key? And why are they creating duplicates in cassandra? if they have the same key, they will result in an overwrite (that's not desirable due to tombstones anyway) -kr, Gerard. On Tue, Aug 4, 2015 at 1:03 PM, Priya Ch <learnings.chitt...@gmail.com> wrote: > > > > Yes...union would be one solution. I am not doing any aggregation hence > reduceByKey would not be useful. If I use groupByKey, messages with same > key would be obtained in a partition. But groupByKey is very expensive > operation as it involves shuffle operation. My ultimate goal is to write > the messages to cassandra. if the messages with same key are handled by > different streams...there would be concurrency issues. To resolve this i > can union dstreams and apply hash parttioner so that it would bring all the > same keys to a single partition or do a groupByKey which does the same. > > As groupByKey is expensive, is there any work around for this ? > > On Thu, Jul 30, 2015 at 2:33 PM, Juan Rodríguez Hortalá < > juan.rodriguez.hort...@gmail.com> wrote: > >> Hi, >> >> Just my two cents. I understand your problem is that your problem is that >> you have messages with the same key in two different dstreams. What I would >> do would be making a union of all the dstreams with StreamingContext.union >> or several calls to DStream.union, and then I would create a pair dstream >> with the primary key as key, and then I'd use groupByKey or reduceByKey (or >> combineByKey etc) to combine the messages with the same primary key. >> >> Hope that helps. >> >> Greetings, >> >> Juan >> >> >> 2015-07-30 10:50 GMT+02:00 Priya Ch <learnings.chitt...@gmail.com>: >> >>> Hi All, >>> >>> Can someone throw insights on this ? >>> >>> On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch <learnings.chitt...@gmail.com> >>> wrote: >>> >>>> >>>> >>>> Hi TD, >>>> >>>> Thanks for the info. I have the scenario like this. >>>> >>>> I am reading the data from kafka topic. Let's say kafka has 3 >>>> partitions for the topic. In my streaming application, I would configure 3 >>>> receivers with 1 thread each such that they would receive 3 dstreams (from >>>> 3 partitions of kafka topic) and also I implement partitioner. Now there is >>>> a possibility of receiving messages with same primary key twice or more, >>>> one is at the time message is created and other times if there is an update >>>> to any fields for same message. >>>> >>>> If two messages M1 and M2 with same primary key are read by 2 receivers >>>> then even the partitioner in spark would still end up in parallel >>>> processing as there are altogether in different dstreams. How do we address >>>> in this situation ? >>>> >>>> Thanks, >>>> Padma Ch >>>> >>>> On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das <t...@databricks.com> >>>> wrote: >>>> >>>>> You have to partition that data on the Spark Streaming by the primary >>>>> key, and then make sure insert data into Cassandra atomically per key, or >>>>> per set of keys in the partition. You can use the combination of the >>>>> (batch >>>>> time, and partition Id) of the RDD inside foreachRDD as the unique id for >>>>> the data you are inserting. This will guard against multiple attempts to >>>>> run the task that inserts into Cassandra. >>>>> >>>>> See >>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations >>>>> >>>>> TD >>>>> >>>>> On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch < >>>>> learnings.chitt...@gmail.com> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I have a problem when writing streaming data to cassandra. Or >>>>>> existing product is on Oracle DB in which while wrtiting data, locks are >>>>>> maintained such that duplicates in the DB are avoided. >>>>>> >>>>>> But as spark has parallel processing architecture, if more than 1 >>>>>> thread is trying to write same data i.e with same primary key, is there >>>>>> as >>>>>> any scope to created duplicates? If yes, how to address this problem >>>>>> either >>>>>> from spark or from cassandra side ? >>>>>> >>>>>> Thanks, >>>>>> Padma Ch >>>>>> >>>>> >>>>> >>>> >>>> >>> >> > >