The problem is not that it's not feasible from Cassandra side, it is The problem is when doing arbitrary ORDER BY, Cassandra needs to resort to in-memory sorting of a potentially huge amout of data --> more pressure on heap --> impact on cluster stability
Whereas delegating this kind of job to Spark which has appropriate data structure to lower heap pressure (Dataframe, project tungsten) is a better idea. "but in the Top N use case, far more data has to be transferred to the client when the client has to do the sorting" --> It is not true if you co-located your Spark worker with Cassandra nodes. In this case, Spark reading data out of Cassandra nodes are always node-local On Tue, Jun 6, 2017 at 6:20 PM, Roger Fischer (CW) <rfis...@brocade.com> wrote: > Hi DuyHai, > > > > this is in response to the other points in your response. > > > > My application is a real-time application. It monitors devices in the > network and displays the top N devices for various parameters averaged over > a time period. A query may involve anywhere from 10 to 50k devices, and > anywhere from 5 to 2000 intervals. We expect a query to take less than 2 > seconds. > > > > My impression was that Spark is aimed at larger scale analytics. > > > > I am ok with the limitation on “group by”. I am intending to use async > queries and token-aware load balancing to partition the query and execute > it in parallel on each node. > > > > Thanks… > > > > Roger > > > > > > *From:* DuyHai Doan [mailto:doanduy...@gmail.com] > *Sent:* Tuesday, June 06, 2017 12:31 AM > *To:* Roger Fischer (CW) <rfis...@brocade.com> > *Cc:* user@cassandra.apache.org > *Subject:* Re: Order by for aggregated values > > > > First Group By is only allowed on partition keys and clustering columns, > not on arbitrary column. The internal implementation of group by tries to > fetch data on clustering order to avoid having to "re-sort" them in memory > which would be very expensive > > > > Second, group by works best when restricted to a single partition other > wise it will force Cassandra to do a range scan so poor performance > > > > > > For all of those reasons I don't expect an "order by" on aggregated values > to be available any soon > > > > Furthermore, Cassandra is optimised for real-time transactional scenarios, > the group by/order by/limit is typically a classical analytics scenario, I > would recommend to use the appropriate tool like Spark for that > > > > > > Le 6 juin 2017 04:00, "Roger Fischer (CW)" <rfis...@brocade.com> a écrit : > > Hello, > > > > is there any intent to support “order by” and “limit” on aggregated values? > > > > For time series data, top n queries are quite common. Group-by was the > first step towards supporting such queries, but ordering by value and > limiting the results are also required. > > > > Thanks… > > > > Roger > > > > > > > > >