Thanks guys! Good to know, that my approach is basically right, but I will check that lucene indices by time.
2016-10-04 14:22 GMT+02:00 DuyHai Doan <doanduy...@gmail.com>: > "What scatter/gather? " > > http://www.slideshare.net/doanduyhai/sasi-cassandra-on- > the-full-text-search-ride-voxxed-daybelgrade-2016/23 > > "If you partition your data by user_id then you query only 1 shard to get > sorted by time visitors for a user" > > Exact, but in this case, you're using a 2nd index only for sorting right ? > For SASI it's not even possible. Maybe it can work with Statrio Lucene impl > > On Tue, Oct 4, 2016 at 2:15 PM, Dorian Hoxha <dorian.ho...@gmail.com> > wrote: > >> @DuyHai >> >> What scatter/gather? If you partition your data by user_id then you query >> only 1 shard to get sorted by time visitors for a user. >> >> On Tue, Oct 4, 2016 at 2:09 PM, DuyHai Doan <doanduy...@gmail.com> wrote: >> >>> MV is right now your best choice for this kind of sorting behavior. >>> >>> Secondary index (whatever the impl, SASI or Lucene) has a cost of >>> scatter-gather if your cluster scale out. With MV you're at least >>> guaranteed to hit a single node everytime >>> >>> On Tue, Oct 4, 2016 at 1:56 PM, Dorian Hoxha <dorian.ho...@gmail.com> >>> wrote: >>> >>>> Can you use the lucene index https://github.com/Stratio/cas >>>> sandra-lucene-index ? >>>> >>>> On Tue, Oct 4, 2016 at 1:27 PM, Benjamin Roth <benjamin.r...@jaumo.com> >>>> wrote: >>>> >>>>> Hi! >>>>> >>>>> I have a frequently used pattern which seems to be quite costly in CS. >>>>> The pattern is always the same: I have a unique key and a sorting by a >>>>> different field. >>>>> >>>>> To give an example, here a real life example from our model: >>>>> CREATE TABLE visits.visits_in ( >>>>> user_id int, >>>>> user_id_visitor int, >>>>> created timestamp, >>>>> PRIMARY KEY (user_id, user_id_visitor) >>>>> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC) >>>>> >>>>> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS >>>>> SELECT user_id, created, user_id_visitor >>>>> FROM visits.visits_in >>>>> WHERE user_id IS NOT NULL AND created IS NOT NULL AND >>>>> user_id_visitor IS NOT NULL >>>>> PRIMARY KEY (user_id, created, user_id_visitor) >>>>> WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC) >>>>> >>>>> This simply represents people, that visited my profile sorted by date >>>>> desc but only one entry per visitor. >>>>> Other examples with the same pattern could be a whats-app-like inbox >>>>> where the last message of each sender is shown by date desc. There are >>>>> lots >>>>> of examples for that pattern. >>>>> >>>>> E.g. in redis I'd just use a sorted set, where the key could be like >>>>> "visits_${user_id}", set key would be user_id_visitor and score >>>>> the created timestamp. >>>>> In MySQL I'd create the table with PK on user_id + user_id_visitor and >>>>> create an index on user_id + created >>>>> In C* i use an MV. >>>>> >>>>> Is this the most efficient approach? >>>>> I also could have done this without an MV but then the situation in >>>>> our app would be far more complex. >>>>> I know that denormalization is a common pattern in C* and I don't >>>>> hesitate to use it but in this case, it is not as simple as it's not an >>>>> append-only case but updates have to be handled correctly. >>>>> If it is the first visit of a user, it's that simple, just 2 inserts >>>>> in base table + denormalized table. But on a 2nd or 3rd visit, the 1st or >>>>> 2nd visit has to be deleted from the denormalized table before. Otherwise >>>>> the visit would not be unique any more. >>>>> Handling this case without an MV requires a lot more effort, I guess >>>>> even more effort than just using an MV. >>>>> 1. You need kind of app-side locking to deal with race conditions >>>>> 2. Read before write is required to determine if an old record has to >>>>> be deleted >>>>> 3. At least CL_QUORUM is required to make sure that read before write >>>>> is always consistent >>>>> 4. Old record has to be deleted on update >>>>> >>>>> I guess, using an MV here is more efficient as there is less roundtrip >>>>> between C* and the app to do all that and the MV does not require strong >>>>> consistency as MV updates are always local and are eventual consistent >>>>> when >>>>> the base table is. So there is also no need for distributed locks. >>>>> >>>>> I ask all this as we now use CS 3.x and have been advised that 3.x is >>>>> still not considered really production ready. >>>>> >>>>> I guess in a perfect world, this wouldn't even require an MV if SASI >>>>> indexes could be created over more than 1 column. E.g. in MySQL this case >>>>> is nothing else than a BTree. AFAIK SASI indices are also BTrees, >>>>> filtering >>>>> by Partition Key (which should to be done anyway) and sorting by a field >>>>> would perfectly do the trick. But from the docs, this is not possible >>>>> right >>>>> now. >>>>> >>>>> Does anyone see a better solution or are all my assumptions correct? >>>>> >>>>> -- >>>>> Benjamin Roth >>>>> Prokurist >>>>> >>>>> Jaumo GmbH · www.jaumo.com >>>>> Wehrstraße 46 · 73035 Göppingen · Germany >>>>> Phone +49 7161 304880-6 · Fax +49 7161 304880-1 >>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer >>>>> >>>> >>>> >>> >> > -- Benjamin Roth Prokurist Jaumo GmbH · www.jaumo.com Wehrstraße 46 · 73035 Göppingen · Germany Phone +49 7161 304880-6 · Fax +49 7161 304880-1 AG Ulm · HRB 731058 · Managing Director: Jens Kammerer