The problem is not that it's not feasible from Cassandra side, it is

The problem is when doing arbitrary ORDER BY, Cassandra needs to resort to
in-memory sorting of a potentially huge amout of data --> more pressure on
heap --> impact on cluster stability

Whereas delegating this kind of job to Spark which has appropriate data
structure to lower heap pressure (Dataframe, project tungsten) is a better
idea.

"but in the Top N use case, far more data has to be transferred to the
client when the client has to do the sorting"

--> It is not true if you co-located your Spark worker with Cassandra
nodes. In this case, Spark reading data out of Cassandra nodes are always
node-local



On Tue, Jun 6, 2017 at 6:20 PM, Roger Fischer (CW) <rfis...@brocade.com>
wrote:

> Hi DuyHai,
>
>
>
> this is in response to the other points in your response.
>
>
>
> My application is a real-time application. It monitors devices in the
> network and displays the top N devices for various parameters averaged over
> a time period. A query may involve anywhere from 10 to 50k devices, and
> anywhere from 5 to 2000 intervals. We expect a query to take less than 2
> seconds.
>
>
>
> My impression was that Spark is aimed at larger scale analytics.
>
>
>
> I am ok with the limitation on “group by”. I am intending to use async
> queries and token-aware load balancing to partition the query and execute
> it in parallel on each node.
>
>
>
> Thanks…
>
>
>
> Roger
>
>
>
>
>
> *From:* DuyHai Doan [mailto:doanduy...@gmail.com]
> *Sent:* Tuesday, June 06, 2017 12:31 AM
> *To:* Roger Fischer (CW) <rfis...@brocade.com>
> *Cc:* user@cassandra.apache.org
> *Subject:* Re: Order by for aggregated values
>
>
>
> First Group By is only allowed on partition keys and clustering columns,
> not on arbitrary column. The internal implementation of group by tries to
> fetch data on clustering order to avoid having to "re-sort" them in memory
> which would be very expensive
>
>
>
> Second, group by works best when restricted to a single partition other
> wise it will force Cassandra to do a range scan so poor performance
>
>
>
>
>
> For all of those reasons I don't expect an "order by" on aggregated values
> to be available any soon
>
>
>
> Furthermore, Cassandra is optimised for real-time transactional scenarios,
> the group by/order by/limit is typically a classical analytics scenario, I
> would recommend to use the appropriate tool like Spark for that
>
>
>
>
>
> Le 6 juin 2017 04:00, "Roger Fischer (CW)" <rfis...@brocade.com> a écrit :
>
> Hello,
>
>
>
> is there any intent to support “order by” and “limit” on aggregated values?
>
>
>
> For time series data, top n queries are quite common. Group-by was the
> first step towards supporting such queries, but ordering by value and
> limiting the results are also required.
>
>
>
> Thanks…
>
>
>
> Roger
>
>
>
>
>
>
>
>
>

Reply via email to