Re: Multiple aggregations over streaming dataframes

Arnaud Bailly Fri, 08 Jul 2016 01:01:34 -0700

Thanks for your answers. I know Kafka's model but I would rather like to
avoid having to setup both Spark and Kafka to handle my use case. I wonder
if it might be possible to handle that using Spark's standard streams ?


-- 
Arnaud Bailly

twitter: abailly
skype: arnaud-bailly
linkedin: http://fr.linkedin.com/in/arnaudbailly/

On Fri, Jul 8, 2016 at 12:00 AM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Kafka has an interesting model that might be applicable.
>
> You can think of kafka as enabling a queue system. Writes are called
> producers, and readers are called consumers. The server is called a broker.
> A “topic” is like a named queue.
>
> Producer are independent. They can write to a “topic” at will. Consumers
> (I.e. You nested aggregates) need to be independent of each other and the
> broker. The broker receives data from produces stores it using memory and
> disk. Consumer read from broker and maintain the cursor. Because the client
> maintains the cursor one consumer can not impact other produces and
> consumers.
>
> I would think the tricky part for spark would to know when the data can be
> deleted. In the Kakfa world each topic is allowed to define a TTL SLA. I.e.
> The consumer must read the data with in a limited of window of time.
>
> Andy
>
> From: Michael Armbrust <mich...@databricks.com>
> Date: Thursday, July 7, 2016 at 2:31 PM
> To: Arnaud Bailly <arnaud.oq...@gmail.com>
> Cc: Sivakumaran S <siva.kuma...@me.com>, "user @spark" <
> user@spark.apache.org>
> Subject: Re: Multiple aggregations over streaming dataframes
>
> We are planning to address this issue in the future.
>
> At a high level, we'll have to add a delta mode so that updates can be
> communicated from one operator to the next.
>
> On Thu, Jul 7, 2016 at 8:59 AM, Arnaud Bailly <arnaud.oq...@gmail.com>
> wrote:
>
>> Indeed. But nested aggregation does not work with Structured Streaming,
>> that's the point. I would like to know if there is workaround, or what's
>> the plan regarding this feature which seems to me quite useful. If the
>> implementation is not overtly complex and it is just a matter of manpower,
>> I am fine with devoting some time to it.
>>
>>
>>
>> --
>> Arnaud Bailly
>>
>> twitter: abailly
>> skype: arnaud-bailly
>> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>>
>> On Thu, Jul 7, 2016 at 2:17 PM, Sivakumaran S <siva.kuma...@me.com>
>> wrote:
>>
>>> Arnauld,
>>>
>>> You could aggregate the first table and then merge it with the second
>>> table (assuming that they are similarly structured) and then carry out the
>>> second aggregation. Unless the data is very large, I don’t see why you
>>> should persist it to disk. IMO, nested aggregation is more elegant and
>>> readable than a complex single stage.
>>>
>>> Regards,
>>>
>>> Sivakumaran
>>>
>>>
>>>
>>> On 07-Jul-2016, at 1:06 PM, Arnaud Bailly <arnaud.oq...@gmail.com>
>>> wrote:
>>>
>>> It's aggregation at multiple levels in a query: first do some
>>> aggregation on one tavle, then join with another table and do a second
>>> aggregation. I could probably rewrite the query in such a way that it does
>>> aggregation in one pass but that would obfuscate the purpose of the various
>>> stages.
>>> Le 7 juil. 2016 12:55, "Sivakumaran S" <siva.kuma...@me.com> a écrit :
>>>
>>>> Hi Arnauld,
>>>>
>>>> Sorry for the doubt, but what exactly is multiple aggregation? What is
>>>> the use case?
>>>>
>>>> Regards,
>>>>
>>>> Sivakumaran
>>>>
>>>>
>>>> On 07-Jul-2016, at 11:18 AM, Arnaud Bailly <arnaud.oq...@gmail.com>
>>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I understand multiple aggregations over streaming dataframes is not
>>>> currently supported in Spark 2.0. Is there a workaround? Out of the top of
>>>> my head I could think of having a two stage approach:
>>>>  - first query writes output to disk/memory using "complete" mode
>>>>  - second query reads from this output
>>>>
>>>> Does this makes sense?
>>>>
>>>> Furthermore, I would like to understand what are the technical hurdles
>>>> that are preventing Spark SQL from implementing multiple aggregation right
>>>> now?
>>>>
>>>> Thanks,
>>>> --
>>>> Arnaud Bailly
>>>>
>>>> twitter: abailly
>>>> skype: arnaud-bailly
>>>> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>>>>
>>>>
>>>>
>>>
>>
>

Re: Multiple aggregations over streaming dataframes

Reply via email to