Unfortunately this feature falls in a category of *incredibly useful* features that have gotten the -1 over the years because it doesn't scale like we want it to. As far as basic aggregations go, it's remarkably trivial to roll up 100K-1MM items using very little memory, so at first it seems like an easy problem.
There's a rub though. Duy Hai is correct, there's a big issue with pagination. Paginating through results right now relies on tokens & not offsets. Paginating through aggregated data would require some serious changes to how this works (I think). It might be possible to generate temporary tables / partitions of the aggregated results that are stored on disk & replicated to other nodes in order to make pagination work correctly, but it starts to move into a fuzzy area if it's even worth it. For smaller datasets (under a few hundred thousand datapoints), I wouldn't bother with Spark, it's overkill and imo the wrong tool for the job. Ed Capriolo had a suggestion for me that I loved a while ago - grab all the raw data and operate on it in memory using H2 (for JVM) or Pandas / NumPy (python). This at least works with every version and won't require waiting till Cassandra 5/6/7 is out. Perform any rollups you might want and cache them somewhere, perhaps back into a TTL'ed C* partition. Jon On Tue, Jun 6, 2017 at 11:39 AM DuyHai Doan <doanduy...@gmail.com> wrote: > The problem is not that it's not feasible from Cassandra side, it is > > The problem is when doing arbitrary ORDER BY, Cassandra needs to resort to > in-memory sorting of a potentially huge amout of data --> more pressure on > heap --> impact on cluster stability > > Whereas delegating this kind of job to Spark which has appropriate data > structure to lower heap pressure (Dataframe, project tungsten) is a better > idea. > > "but in the Top N use case, far more data has to be transferred to the > client when the client has to do the sorting" > > --> It is not true if you co-located your Spark worker with Cassandra > nodes. In this case, Spark reading data out of Cassandra nodes are always > node-local > > > > On Tue, Jun 6, 2017 at 6:20 PM, Roger Fischer (CW) <rfis...@brocade.com> > wrote: > >> Hi DuyHai, >> >> >> >> this is in response to the other points in your response. >> >> >> >> My application is a real-time application. It monitors devices in the >> network and displays the top N devices for various parameters averaged over >> a time period. A query may involve anywhere from 10 to 50k devices, and >> anywhere from 5 to 2000 intervals. We expect a query to take less than 2 >> seconds. >> >> >> >> My impression was that Spark is aimed at larger scale analytics. >> >> >> >> I am ok with the limitation on “group by”. I am intending to use async >> queries and token-aware load balancing to partition the query and execute >> it in parallel on each node. >> >> >> >> Thanks… >> >> >> >> Roger >> >> >> >> >> >> *From:* DuyHai Doan [mailto:doanduy...@gmail.com] >> *Sent:* Tuesday, June 06, 2017 12:31 AM >> *To:* Roger Fischer (CW) <rfis...@brocade.com> >> *Cc:* user@cassandra.apache.org >> *Subject:* Re: Order by for aggregated values >> >> >> >> First Group By is only allowed on partition keys and clustering columns, >> not on arbitrary column. The internal implementation of group by tries to >> fetch data on clustering order to avoid having to "re-sort" them in memory >> which would be very expensive >> >> >> >> Second, group by works best when restricted to a single partition other >> wise it will force Cassandra to do a range scan so poor performance >> >> >> >> >> >> For all of those reasons I don't expect an "order by" on aggregated >> values to be available any soon >> >> >> >> Furthermore, Cassandra is optimised for real-time transactional >> scenarios, the group by/order by/limit is typically a classical analytics >> scenario, I would recommend to use the appropriate tool like Spark for that >> >> >> >> >> >> Le 6 juin 2017 04:00, "Roger Fischer (CW)" <rfis...@brocade.com> a >> écrit : >> >> Hello, >> >> >> >> is there any intent to support “order by” and “limit” on aggregated >> values? >> >> >> >> For time series data, top n queries are quite common. Group-by was the >> first step towards supporting such queries, but ordering by value and >> limiting the results are also required. >> >> >> >> Thanks… >> >> >> >> Roger >> >> >> >> >> >> >> >> >> > >