Unfortunately this feature falls in a category of *incredibly useful*
features that have gotten the -1 over the years because it doesn't scale
like we want it to.  As far as basic aggregations go, it's remarkably
trivial to roll up 100K-1MM items using very little memory, so at first it
seems like an easy problem.

There's a rub though.  Duy Hai is correct, there's a big issue with
pagination.  Paginating through results right now relies on tokens & not
offsets.  Paginating through aggregated data would require some serious
changes to how this works (I think).

It might be possible to generate temporary tables / partitions of the
aggregated results that are stored on disk & replicated to other nodes in
order to make pagination work correctly, but it starts to move into a fuzzy
area if it's even worth it.

For smaller datasets (under a few hundred thousand datapoints), I wouldn't
bother with Spark, it's overkill and imo the wrong tool for the job.  Ed
Capriolo had a suggestion for me that I loved a while ago - grab all the
raw data and operate on it in memory using H2 (for JVM) or Pandas / NumPy
(python).  This at least works with every version and won't require waiting
till Cassandra 5/6/7 is out.  Perform any rollups you might want and cache
them somewhere, perhaps back into a TTL'ed C* partition.

Jon

On Tue, Jun 6, 2017 at 11:39 AM DuyHai Doan <doanduy...@gmail.com> wrote:

> The problem is not that it's not feasible from Cassandra side, it is
>
> The problem is when doing arbitrary ORDER BY, Cassandra needs to resort to
> in-memory sorting of a potentially huge amout of data --> more pressure on
> heap --> impact on cluster stability
>
> Whereas delegating this kind of job to Spark which has appropriate data
> structure to lower heap pressure (Dataframe, project tungsten) is a better
> idea.
>
> "but in the Top N use case, far more data has to be transferred to the
> client when the client has to do the sorting"
>
> --> It is not true if you co-located your Spark worker with Cassandra
> nodes. In this case, Spark reading data out of Cassandra nodes are always
> node-local
>
>
>
> On Tue, Jun 6, 2017 at 6:20 PM, Roger Fischer (CW) <rfis...@brocade.com>
> wrote:
>
>> Hi DuyHai,
>>
>>
>>
>> this is in response to the other points in your response.
>>
>>
>>
>> My application is a real-time application. It monitors devices in the
>> network and displays the top N devices for various parameters averaged over
>> a time period. A query may involve anywhere from 10 to 50k devices, and
>> anywhere from 5 to 2000 intervals. We expect a query to take less than 2
>> seconds.
>>
>>
>>
>> My impression was that Spark is aimed at larger scale analytics.
>>
>>
>>
>> I am ok with the limitation on “group by”. I am intending to use async
>> queries and token-aware load balancing to partition the query and execute
>> it in parallel on each node.
>>
>>
>>
>> Thanks…
>>
>>
>>
>> Roger
>>
>>
>>
>>
>>
>> *From:* DuyHai Doan [mailto:doanduy...@gmail.com]
>> *Sent:* Tuesday, June 06, 2017 12:31 AM
>> *To:* Roger Fischer (CW) <rfis...@brocade.com>
>> *Cc:* user@cassandra.apache.org
>> *Subject:* Re: Order by for aggregated values
>>
>>
>>
>> First Group By is only allowed on partition keys and clustering columns,
>> not on arbitrary column. The internal implementation of group by tries to
>> fetch data on clustering order to avoid having to "re-sort" them in memory
>> which would be very expensive
>>
>>
>>
>> Second, group by works best when restricted to a single partition other
>> wise it will force Cassandra to do a range scan so poor performance
>>
>>
>>
>>
>>
>> For all of those reasons I don't expect an "order by" on aggregated
>> values to be available any soon
>>
>>
>>
>> Furthermore, Cassandra is optimised for real-time transactional
>> scenarios, the group by/order by/limit is typically a classical analytics
>> scenario, I would recommend to use the appropriate tool like Spark for that
>>
>>
>>
>>
>>
>> Le 6 juin 2017 04:00, "Roger Fischer (CW)" <rfis...@brocade.com> a
>> écrit :
>>
>> Hello,
>>
>>
>>
>> is there any intent to support “order by” and “limit” on aggregated
>> values?
>>
>>
>>
>> For time series data, top n queries are quite common. Group-by was the
>> first step towards supporting such queries, but ordering by value and
>> limiting the results are also required.
>>
>>
>>
>> Thanks…
>>
>>
>>
>> Roger
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Reply via email to