However my extensive benchmarking this week of the python driver from master shows a performance *decrease* when using 'token_aware'.
This is on 12-node, 2-datacenter, RF-3 cluster in AWS. Also why do the work the coordinator will do for you: send all the queries, wait for everything to come back in whatever order, and sort the result. I would rather keep my app code simple. But the real point is that you should benchmark in your own environment. ml On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle < marc...@s1mbi0se.com.br> wrote: > Yes, I am using the CQL datastax drivers. > It was a good advice, thanks a lot Janathan. > []s > > > 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: > > The only case in which it might be better to use an IN clause is if >> the entire query can be satisfied from that machine. Otherwise, go >> async. >> >> The native driver reuses connections and intelligently manages the >> pool for you. It can also multiplex queries over a single connection. >> >> I am assuming you're using one of the datastax drivers for CQL, btw. >> >> Jon >> >> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle >> <marc...@s1mbi0se.com.br> wrote: >> > This is interesting, I didn't know that! >> > It might make sense then to use select = + async + token aware, I will >> try >> > to change my code. >> > >> > But would it be a "recomended solution" for these cases? Any other >> options? >> > >> > I still would if this is the right use case for Cassandra, to look for >> > random keys in a huge cluster. After all, the amount of connections to >> > Cassandra will still be huge, right... Wouldn't it be a problem? >> > Or when you use async the driver reuses the connection? >> > >> > []s >> > >> > >> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >> > >> >> If you use async and your driver is token aware, it will go to the >> >> proper node, rather than requiring the coordinator to do so. >> >> >> >> Realistically you're going to have a connection open to every server >> >> anyways. It's the difference between you querying for the data >> >> directly and using a coordinator as a proxy. It's faster to just ask >> >> the node with the data. >> >> >> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle >> >> <marc...@s1mbi0se.com.br> wrote: >> >> > But using async queries wouldn't be even worse than using SELECT IN? >> >> > The justification in the docs is I could query many nodes, but I >> would >> >> > still >> >> > do it. >> >> > >> >> > Today, I use both async queries AND SELECT IN: >> >> > >> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP + " >> >> > WHERE >> >> > name=%s and value in(%s)" >> >> > >> >> > for name, values in identifiers.items(): >> >> > query = self.SELECT_ENTITY_LOOKUP % ('%s', >> >> > ','.join(['%s']*len(values))) >> >> > args = [name] + values >> >> > query_msg = query % tuple(args) >> >> > futures.append((query_msg, self.session.execute_async(query, >> args))) >> >> > >> >> > for query_msg, future in futures: >> >> > try: >> >> > rows = future.result(timeout=100000) >> >> > for row in rows: >> >> > entity_ids.add(row.entity_id) >> >> > except: >> >> > logging.error("Query '%s' returned ERROR " % (query_msg)) >> >> > raise >> >> > >> >> > Using async just with select = would mean instead of 1 async query >> >> > (example: >> >> > in (0, 1, 2)), I would do several, one for each value of "values" >> array >> >> > above. >> >> > In my head, this would mean more connections to Cassandra and the >> same >> >> > amount of work, right? What would be the advantage? >> >> > >> >> > []s >> >> > >> >> > >> >> > >> >> > >> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >> >> > >> >> >> Your other option is to fire off async queries. It's pretty >> >> >> straightforward w/ the java or python drivers. >> >> >> >> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle >> >> >> <marc...@s1mbi0se.com.br> wrote: >> >> >> > I was taking a look at Cassandra anti-patterns list: >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html >> >> >> > >> >> >> > Among then is >> >> >> > >> >> >> > SELECT ... IN or index lookups¶ >> >> >> > >> >> >> > SELECT ... IN and index lookups (formerly secondary indexes) >> should >> >> >> > be >> >> >> > avoided except for specific scenarios. See When not to use IN in >> >> >> > SELECT >> >> >> > and >> >> >> > When not to use an index in Indexing in >> >> >> > >> >> >> > CQL for Cassandra 2.0" >> >> >> > >> >> >> > And Looking at the SELECT doc, I saw: >> >> >> > >> >> >> > When not to use IN¶ >> >> >> > >> >> >> > The recommendations about when not to use an index apply to using >> IN >> >> >> > in >> >> >> > the >> >> >> > WHERE clause. Under most conditions, using IN in the WHERE clause >> is >> >> >> > not >> >> >> > recommended. Using IN can degrade performance because usually many >> >> >> > nodes >> >> >> > must be queried. For example, in a single, local data center >> cluster >> >> >> > having >> >> >> > 30 nodes, a replication factor of 3, and a consistency level of >> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if the >> >> >> > query >> >> >> > uses the IN condition, the number of nodes being queried are most >> >> >> > likely >> >> >> > even higher, up to 20 nodes depending on where the keys fall in >> the >> >> >> > token >> >> >> > range." >> >> >> > >> >> >> > In my system, I have a column family called "entity_lookup": >> >> >> > >> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1 >> >> >> > WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', >> >> >> > 'DC1' : 3 }; >> >> >> > USE Identification1; >> >> >> > >> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup ( >> >> >> > name varchar, >> >> >> > value varchar, >> >> >> > entity_id uuid, >> >> >> > PRIMARY KEY ((name, value), entity_id)); >> >> >> > >> >> >> > And I use the following select to query it: >> >> >> > >> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s) >> >> >> > >> >> >> > Is this an anti-pattern? >> >> >> > >> >> >> > If not using SELECT IN, which other way would you recomend for >> >> >> > lookups >> >> >> > like >> >> >> > that? I have several values I would like to search in cassandra >> and >> >> >> > they >> >> >> > might not be in the same particion, as above. >> >> >> > >> >> >> > Is Cassandra the wrong tool for lookups like that? >> >> >> > >> >> >> > Best regards, >> >> >> > Marcelo Valle. >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> Jon Haddad >> >> >> http://www.rustyrazorblade.com >> >> >> skype: rustyrazorblade >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Jon Haddad >> >> http://www.rustyrazorblade.com >> >> skype: rustyrazorblade >> > >> > >> >> >> >> -- >> Jon Haddad >> http://www.rustyrazorblade.com >> skype: rustyrazorblade >> > >