There is nothing preventing that in Cassandra, it's just a matter of how intelligent the driver API is. Submit a feature request to Astyanax or Datastax driver projects.
On Fri, Jun 20, 2014 at 2:27 PM, Marcelo Elias Del Valle < marc...@s1mbi0se.com.br> wrote: > The bad design part (just my opinion, no intention to offend) is not allow > the possibility of sending batches directly to the data nodes, without > using a coordinator. > I would choose that option. > []s > > > 2014-06-20 16:05 GMT-03:00 DuyHai Doan <doanduy...@gmail.com>: > >> Well it's kind of a trade-off. >> >> Either you send data directly to the primary replica nodes to take >> advantage of data-locality using token-aware strategy and the price to pay >> is a high number of opened connections from client side. >> >> Or you just batch data to a random node playing the coordinator role to >> dispatch requests to the right nodes. The price to pay is then spike load >> on 1 node (the coordinator) and intra-cluster bandwdith usage. >> >> The choice is yours, it has nothing to do with good or bad design. >> >> >> On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle < >> marc...@s1mbi0se.com.br> wrote: >> >>> I am using python + CQL Driver. >>> I wonder how they do... >>> These things seems little important, but they are fundamental to get a >>> good performance in Cassandra... >>> I wish there was a simpler way to query in batches. Opening a large >>> amount of connections and sending 1 message at a time seems bad to me, as >>> sometimes you want to work with small rows. >>> It's no surprise Cassandra performs better when we use average row >>> sizes. But honestly I disagree with this part of Cassandra/Driver's design. >>> []s >>> >>> >>> 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>: >>> >>> That depends on the connection pooling implementation in your driver. >>>> Astyanax will keep N connections open to each node (configurable) and route >>>> each query in a separate message over an existing connection, waiting until >>>> one becomes available if all are in use. >>>> >>>> >>>> On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle < >>>> marc...@s1mbi0se.com.br> wrote: >>>> >>>>> A question, not sure if you guys know the answer: >>>>> Supose I async query 1000 rows using token aware and suppose I have 10 >>>>> nodes. Suppose also each node would receive 100 row queries each. >>>>> How does async work in this case? Would it send each row query to each >>>>> node in a different connection? Different message? >>>>> I guess if there was a way to use batch with async, once you commit >>>>> the batch for the 1000 queries, it would create 1 connection to each host >>>>> and query 100 rows in a single message to each host. >>>>> This would decrease resource usage, am I wrong? >>>>> >>>>> []s >>>>> >>>>> >>>>> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>: >>>>> >>>>> I've found that if you have any amount of latency between your client >>>>>> and nodes, and you are executing a large batch of queries, you'll usually >>>>>> want to send them together to one node unless execution time is of no >>>>>> concern. The tradeoff is resource usage on the connected node vs. time to >>>>>> complete all the queries, because you'll need fewer client -> node >>>>>> network >>>>>> round trips. >>>>>> >>>>>> With large numbers of queries you will still want to make sure you >>>>>> split them into manageable batches before sending them, to control memory >>>>>> usage on the executing node. I've been limiting queries to batches of 100 >>>>>> keys in scenarios like this. >>>>>> >>>>>> >>>>>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael < >>>>>> michael.la...@nytimes.com> wrote: >>>>>> >>>>>>> However my extensive benchmarking this week of the python driver >>>>>>> from master shows a performance *decrease* when using 'token_aware'. >>>>>>> >>>>>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS. >>>>>>> >>>>>>> Also why do the work the coordinator will do for you: send all the >>>>>>> queries, wait for everything to come back in whatever order, and sort >>>>>>> the >>>>>>> result. >>>>>>> >>>>>>> I would rather keep my app code simple. >>>>>>> >>>>>>> But the real point is that you should benchmark in your own >>>>>>> environment. >>>>>>> >>>>>>> ml >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle < >>>>>>> marc...@s1mbi0se.com.br> wrote: >>>>>>> >>>>>>>> Yes, I am using the CQL datastax drivers. >>>>>>>> It was a good advice, thanks a lot Janathan. >>>>>>>> []s >>>>>>>> >>>>>>>> >>>>>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>>>>> >>>>>>>> The only case in which it might be better to use an IN clause is if >>>>>>>>> the entire query can be satisfied from that machine. Otherwise, go >>>>>>>>> async. >>>>>>>>> >>>>>>>>> The native driver reuses connections and intelligently manages the >>>>>>>>> pool for you. It can also multiplex queries over a single >>>>>>>>> connection. >>>>>>>>> >>>>>>>>> I am assuming you're using one of the datastax drivers for CQL, >>>>>>>>> btw. >>>>>>>>> >>>>>>>>> Jon >>>>>>>>> >>>>>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle >>>>>>>>> <marc...@s1mbi0se.com.br> wrote: >>>>>>>>> > This is interesting, I didn't know that! >>>>>>>>> > It might make sense then to use select = + async + token aware, >>>>>>>>> I will try >>>>>>>>> > to change my code. >>>>>>>>> > >>>>>>>>> > But would it be a "recomended solution" for these cases? Any >>>>>>>>> other options? >>>>>>>>> > >>>>>>>>> > I still would if this is the right use case for Cassandra, to >>>>>>>>> look for >>>>>>>>> > random keys in a huge cluster. After all, the amount of >>>>>>>>> connections to >>>>>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem? >>>>>>>>> > Or when you use async the driver reuses the connection? >>>>>>>>> > >>>>>>>>> > []s >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>>>>>> > >>>>>>>>> >> If you use async and your driver is token aware, it will go to >>>>>>>>> the >>>>>>>>> >> proper node, rather than requiring the coordinator to do so. >>>>>>>>> >> >>>>>>>>> >> Realistically you're going to have a connection open to every >>>>>>>>> server >>>>>>>>> >> anyways. It's the difference between you querying for the data >>>>>>>>> >> directly and using a coordinator as a proxy. It's faster to >>>>>>>>> just ask >>>>>>>>> >> the node with the data. >>>>>>>>> >> >>>>>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle >>>>>>>>> >> <marc...@s1mbi0se.com.br> wrote: >>>>>>>>> >> > But using async queries wouldn't be even worse than using >>>>>>>>> SELECT IN? >>>>>>>>> >> > The justification in the docs is I could query many nodes, >>>>>>>>> but I would >>>>>>>>> >> > still >>>>>>>>> >> > do it. >>>>>>>>> >> > >>>>>>>>> >> > Today, I use both async queries AND SELECT IN: >>>>>>>>> >> > >>>>>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + >>>>>>>>> ENTITY_LOOKUP + " >>>>>>>>> >> > WHERE >>>>>>>>> >> > name=%s and value in(%s)" >>>>>>>>> >> > >>>>>>>>> >> > for name, values in identifiers.items(): >>>>>>>>> >> > query = self.SELECT_ENTITY_LOOKUP % ('%s', >>>>>>>>> >> > ','.join(['%s']*len(values))) >>>>>>>>> >> > args = [name] + values >>>>>>>>> >> > query_msg = query % tuple(args) >>>>>>>>> >> > futures.append((query_msg, >>>>>>>>> self.session.execute_async(query, args))) >>>>>>>>> >> > >>>>>>>>> >> > for query_msg, future in futures: >>>>>>>>> >> > try: >>>>>>>>> >> > rows = future.result(timeout=100000) >>>>>>>>> >> > for row in rows: >>>>>>>>> >> > entity_ids.add(row.entity_id) >>>>>>>>> >> > except: >>>>>>>>> >> > logging.error("Query '%s' returned ERROR " % >>>>>>>>> (query_msg)) >>>>>>>>> >> > raise >>>>>>>>> >> > >>>>>>>>> >> > Using async just with select = would mean instead of 1 async >>>>>>>>> query >>>>>>>>> >> > (example: >>>>>>>>> >> > in (0, 1, 2)), I would do several, one for each value of >>>>>>>>> "values" array >>>>>>>>> >> > above. >>>>>>>>> >> > In my head, this would mean more connections to Cassandra and >>>>>>>>> the same >>>>>>>>> >> > amount of work, right? What would be the advantage? >>>>>>>>> >> > >>>>>>>>> >> > []s >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com >>>>>>>>> >: >>>>>>>>> >> > >>>>>>>>> >> >> Your other option is to fire off async queries. It's pretty >>>>>>>>> >> >> straightforward w/ the java or python drivers. >>>>>>>>> >> >> >>>>>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle >>>>>>>>> >> >> <marc...@s1mbi0se.com.br> wrote: >>>>>>>>> >> >> > I was taking a look at Cassandra anti-patterns list: >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html >>>>>>>>> >> >> > >>>>>>>>> >> >> > Among then is >>>>>>>>> >> >> > >>>>>>>>> >> >> > SELECT ... IN or index lookups¶ >>>>>>>>> >> >> > >>>>>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary >>>>>>>>> indexes) should >>>>>>>>> >> >> > be >>>>>>>>> >> >> > avoided except for specific scenarios. See When not to use >>>>>>>>> IN in >>>>>>>>> >> >> > SELECT >>>>>>>>> >> >> > and >>>>>>>>> >> >> > When not to use an index in Indexing in >>>>>>>>> >> >> > >>>>>>>>> >> >> > CQL for Cassandra 2.0" >>>>>>>>> >> >> > >>>>>>>>> >> >> > And Looking at the SELECT doc, I saw: >>>>>>>>> >> >> > >>>>>>>>> >> >> > When not to use IN¶ >>>>>>>>> >> >> > >>>>>>>>> >> >> > The recommendations about when not to use an index apply >>>>>>>>> to using IN >>>>>>>>> >> >> > in >>>>>>>>> >> >> > the >>>>>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE >>>>>>>>> clause is >>>>>>>>> >> >> > not >>>>>>>>> >> >> > recommended. Using IN can degrade performance because >>>>>>>>> usually many >>>>>>>>> >> >> > nodes >>>>>>>>> >> >> > must be queried. For example, in a single, local data >>>>>>>>> center cluster >>>>>>>>> >> >> > having >>>>>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency >>>>>>>>> level of >>>>>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, >>>>>>>>> but if the >>>>>>>>> >> >> > query >>>>>>>>> >> >> > uses the IN condition, the number of nodes being queried >>>>>>>>> are most >>>>>>>>> >> >> > likely >>>>>>>>> >> >> > even higher, up to 20 nodes depending on where the keys >>>>>>>>> fall in the >>>>>>>>> >> >> > token >>>>>>>>> >> >> > range." >>>>>>>>> >> >> > >>>>>>>>> >> >> > In my system, I have a column family called >>>>>>>>> "entity_lookup": >>>>>>>>> >> >> > >>>>>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1 >>>>>>>>> >> >> > WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', >>>>>>>>> >> >> > 'DC1' : 3 }; >>>>>>>>> >> >> > USE Identification1; >>>>>>>>> >> >> > >>>>>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup ( >>>>>>>>> >> >> > name varchar, >>>>>>>>> >> >> > value varchar, >>>>>>>>> >> >> > entity_id uuid, >>>>>>>>> >> >> > PRIMARY KEY ((name, value), entity_id)); >>>>>>>>> >> >> > >>>>>>>>> >> >> > And I use the following select to query it: >>>>>>>>> >> >> > >>>>>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and >>>>>>>>> value in(%s) >>>>>>>>> >> >> > >>>>>>>>> >> >> > Is this an anti-pattern? >>>>>>>>> >> >> > >>>>>>>>> >> >> > If not using SELECT IN, which other way would you recomend >>>>>>>>> for >>>>>>>>> >> >> > lookups >>>>>>>>> >> >> > like >>>>>>>>> >> >> > that? I have several values I would like to search in >>>>>>>>> cassandra and >>>>>>>>> >> >> > they >>>>>>>>> >> >> > might not be in the same particion, as above. >>>>>>>>> >> >> > >>>>>>>>> >> >> > Is Cassandra the wrong tool for lookups like that? >>>>>>>>> >> >> > >>>>>>>>> >> >> > Best regards, >>>>>>>>> >> >> > Marcelo Valle. >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> >>>>>>>>> >> >> >>>>>>>>> >> >> >>>>>>>>> >> >> -- >>>>>>>>> >> >> Jon Haddad >>>>>>>>> >> >> http://www.rustyrazorblade.com >>>>>>>>> >> >> skype: rustyrazorblade >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> -- >>>>>>>>> >> Jon Haddad >>>>>>>>> >> http://www.rustyrazorblade.com >>>>>>>>> >> skype: rustyrazorblade >>>>>>>>> > >>>>>>>>> > >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Jon Haddad >>>>>>>>> http://www.rustyrazorblade.com >>>>>>>>> skype: rustyrazorblade >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >