Good to know! I am bumping the priority of this issue in my head. Thanks
for the feedback. Others seeing this thread, please comment if you think
that this is an important issue for you as well.

Not at my computer right now but I will make a Jira for this.

TD
On Jul 17, 2014 11:22 PM, "Yan Fang" <yanfang...@gmail.com> wrote:

> Thank you, TD. This is important information for us. Will keep an eye on
> that.
>
> Cheers,
>
> Fang, Yan
> yanfang...@gmail.com
> +1 (206) 849-4108
>
>
> On Thu, Jul 17, 2014 at 6:54 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Yes, this is the limitation of the current implementation. But this will
>> be improved a loooot when we have IndexedRDD
>> <https://github.com/apache/spark/pull/1297> in the Spark that allows
>> faster single value updates to a key-value (within each partition, without
>> processing the entire partition.
>>
>> Soon.....
>>
>> TD
>>
>>
>> On Thu, Jul 17, 2014 at 5:57 PM, Yan Fang <yanfang...@gmail.com> wrote:
>>
>>> Hi TD,
>>>
>>> Thank you. Yes, it behaves as you described. Sorry for missing this
>>> point.
>>>
>>> Then my only concern is in the performance side - since Spark Streaming
>>> operates on all the keys everytime a new batch comes, I think it is fine
>>> when the state size is small. When the state size becomes big, say, a few
>>> GBs, if we still go through the whole key list, would the operation be a
>>> little inefficient then? Maybe I miss some points in Spark Streaming, which
>>> consider this situation.
>>>
>>> Cheers,
>>>
>>> Fang, Yan
>>> yanfang...@gmail.com
>>> +1 (206) 849-4108
>>>
>>>
>>> On Thu, Jul 17, 2014 at 1:47 PM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>>> The updateFunction given in updateStateByKey should be called on ALL
>>>> the keys are in the state, even if there is no new data in the batch for
>>>> some key. Is that not the behavior you see?
>>>>
>>>> What do you mean by "show all the existing states"? You have access to
>>>> the latest state RDD by doing stateStream.foreachRDD(...). There you can do
>>>> whatever operation on all the key-state pairs.
>>>>
>>>> TD
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang <yanfang...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi TD,
>>>>>
>>>>> Thank you for the quick replying and backing my approach. :)
>>>>>
>>>>> 1) The example is this:
>>>>>
>>>>> 1. In the first 2 second interval, after updateStateByKey, I get a few
>>>>> keys and their states, say, ("a" -> 1, "b" -> 2, "c" -> 3)
>>>>> 2. In the following 2 second interval, I only receive "c" and "d" and
>>>>> their value. But I want to update/display the state of "a" and "b"
>>>>> accordingly.
>>>>> * It seems I have no way to "access" the "a" and "b" and get their
>>>>> states.
>>>>> * also, do I have a way to show all the existing states?
>>>>>
>>>>> I guess the approach to solve this will be similar to what you
>>>>> mentioned for 2). But the difficulty is that, if I want to display all the
>>>>> existing states, need to bundle all the rest keys to one key.
>>>>>
>>>>> Thank you.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Fang, Yan
>>>>> yanfang...@gmail.com
>>>>> +1 (206) 849-4108
>>>>>
>>>>>
>>>>> On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das <
>>>>> tathagata.das1...@gmail.com> wrote:
>>>>>
>>>>>> For accessing previous version, I would do it the same way. :)
>>>>>>
>>>>>> 1. Can you elaborate on what you mean by that with an example? What
>>>>>> do you mean by "accessing" keys?
>>>>>>
>>>>>> 2. Yeah, that is hard to do with the ability to do point lookups into
>>>>>> an RDD, which we dont support yet. You could try embedding the related 
>>>>>> key
>>>>>> in the values of the keys that need it. That is, B will is present in the
>>>>>> value of key A. Then put this transformed DStream through 
>>>>>> updateStateByKey.
>>>>>>
>>>>>> TD
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to