Re: Writing streaming data to cassandra creates duplicates

Gerard Maas Tue, 04 Aug 2015 04:25:07 -0700

(removing dev from the to: as not relevant)

it would be good to see some sample data and the cassandra schema to have a
more concrete idea of the problem space.


Some thoughts: reduceByKey could still be used to 'pick' one element.
example of arbitrarily choosing the first one: reduceByKey{case (e1,e2) =>
e1}

The question to be answered is: what should happen to the multiple values
that arrive for 1 key?

And why are they creating duplicates in cassandra? if they have the same
key, they will result in an overwrite (that's not desirable due to
tombstones anyway)

-kr, Gerard.



On Tue, Aug 4, 2015 at 1:03 PM, Priya Ch <learnings.chitt...@gmail.com>
wrote:

>
>
>
> Yes...union would be one solution. I am not doing any aggregation hence
> reduceByKey would not be useful. If I use groupByKey, messages with same
> key would be obtained in a partition. But groupByKey is very expensive
> operation as it involves shuffle operation. My ultimate goal is to write
> the messages to cassandra. if the messages with same key are handled by
> different streams...there would be concurrency issues. To resolve this i
> can union dstreams and apply hash parttioner so that it would bring all the
> same keys to a single partition or do a groupByKey which does the same.
>
> As groupByKey is expensive, is there any work around for this ?
>
> On Thu, Jul 30, 2015 at 2:33 PM, Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com> wrote:
>
>> Hi,
>>
>> Just my two cents. I understand your problem is that your problem is that
>> you have messages with the same key in two different dstreams. What I would
>> do would be making a union of all the dstreams with StreamingContext.union
>> or several calls to DStream.union, and then I would create a pair dstream
>> with the primary key as key, and then I'd use groupByKey or reduceByKey (or
>> combineByKey etc) to combine the messages with the same primary key.
>>
>> Hope that helps.
>>
>> Greetings,
>>
>> Juan
>>
>>
>> 2015-07-30 10:50 GMT+02:00 Priya Ch <learnings.chitt...@gmail.com>:
>>
>>> Hi All,
>>>
>>>  Can someone throw insights on this ?
>>>
>>> On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch <learnings.chitt...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> Hi TD,
>>>>
>>>>  Thanks for the info. I have the scenario like this.
>>>>
>>>>  I am reading the data from kafka topic. Let's say kafka has 3
>>>> partitions for the topic. In my streaming application, I would configure 3
>>>> receivers with 1 thread each such that they would receive 3 dstreams (from
>>>> 3 partitions of kafka topic) and also I implement partitioner. Now there is
>>>> a possibility of receiving messages with same primary key twice or more,
>>>> one is at the time message is created and other times if there is an update
>>>> to any fields for same message.
>>>>
>>>> If two messages M1 and M2 with same primary key are read by 2 receivers
>>>> then even the partitioner in spark would still end up in parallel
>>>> processing as there are altogether in different dstreams. How do we address
>>>> in this situation ?
>>>>
>>>> Thanks,
>>>> Padma Ch
>>>>
>>>> On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das <t...@databricks.com>
>>>> wrote:
>>>>
>>>>> You have to partition that data on the Spark Streaming by the primary
>>>>> key, and then make sure insert data into Cassandra atomically per key, or
>>>>> per set of keys in the partition. You can use the combination of the 
>>>>> (batch
>>>>> time, and partition Id) of the RDD inside foreachRDD as the unique id for
>>>>> the data you are inserting. This will guard against multiple attempts to
>>>>> run the task that inserts into Cassandra.
>>>>>
>>>>> See
>>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations
>>>>>
>>>>> TD
>>>>>
>>>>> On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch <
>>>>> learnings.chitt...@gmail.com> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>>  I have a problem when writing streaming data to cassandra. Or
>>>>>> existing product is on Oracle DB in which while wrtiting data, locks are
>>>>>> maintained such that duplicates in the DB are avoided.
>>>>>>
>>>>>> But as spark has parallel processing architecture, if more than 1
>>>>>> thread is trying to write same data i.e with same primary key, is there 
>>>>>> as
>>>>>> any scope to created duplicates? If yes, how to address this problem 
>>>>>> either
>>>>>> from spark or from cassandra side ?
>>>>>>
>>>>>> Thanks,
>>>>>> Padma Ch
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>
>

Re: Writing streaming data to cassandra creates duplicates

Reply via email to