Well it's kind of a trade-off. Either you send data directly to the primary replica nodes to take advantage of data-locality using token-aware strategy and the price to pay is a high number of opened connections from client side.
Or you just batch data to a random node playing the coordinator role to dispatch requests to the right nodes. The price to pay is then spike load on 1 node (the coordinator) and intra-cluster bandwdith usage. The choice is yours, it has nothing to do with good or bad design. On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle < marc...@s1mbi0se.com.br> wrote: > I am using python + CQL Driver. > I wonder how they do... > These things seems little important, but they are fundamental to get a > good performance in Cassandra... > I wish there was a simpler way to query in batches. Opening a large amount > of connections and sending 1 message at a time seems bad to me, as > sometimes you want to work with small rows. > It's no surprise Cassandra performs better when we use average row sizes. > But honestly I disagree with this part of Cassandra/Driver's design. > []s > > > 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>: > > That depends on the connection pooling implementation in your driver. >> Astyanax will keep N connections open to each node (configurable) and route >> each query in a separate message over an existing connection, waiting until >> one becomes available if all are in use. >> >> >> On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle < >> marc...@s1mbi0se.com.br> wrote: >> >>> A question, not sure if you guys know the answer: >>> Supose I async query 1000 rows using token aware and suppose I have 10 >>> nodes. Suppose also each node would receive 100 row queries each. >>> How does async work in this case? Would it send each row query to each >>> node in a different connection? Different message? >>> I guess if there was a way to use batch with async, once you commit the >>> batch for the 1000 queries, it would create 1 connection to each host and >>> query 100 rows in a single message to each host. >>> This would decrease resource usage, am I wrong? >>> >>> []s >>> >>> >>> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>: >>> >>> I've found that if you have any amount of latency between your client >>>> and nodes, and you are executing a large batch of queries, you'll usually >>>> want to send them together to one node unless execution time is of no >>>> concern. The tradeoff is resource usage on the connected node vs. time to >>>> complete all the queries, because you'll need fewer client -> node network >>>> round trips. >>>> >>>> With large numbers of queries you will still want to make sure you >>>> split them into manageable batches before sending them, to control memory >>>> usage on the executing node. I've been limiting queries to batches of 100 >>>> keys in scenarios like this. >>>> >>>> >>>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael < >>>> michael.la...@nytimes.com> wrote: >>>> >>>>> However my extensive benchmarking this week of the python driver from >>>>> master shows a performance *decrease* when using 'token_aware'. >>>>> >>>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS. >>>>> >>>>> Also why do the work the coordinator will do for you: send all the >>>>> queries, wait for everything to come back in whatever order, and sort the >>>>> result. >>>>> >>>>> I would rather keep my app code simple. >>>>> >>>>> But the real point is that you should benchmark in your own >>>>> environment. >>>>> >>>>> ml >>>>> >>>>> >>>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle < >>>>> marc...@s1mbi0se.com.br> wrote: >>>>> >>>>>> Yes, I am using the CQL datastax drivers. >>>>>> It was a good advice, thanks a lot Janathan. >>>>>> []s >>>>>> >>>>>> >>>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>>> >>>>>> The only case in which it might be better to use an IN clause is if >>>>>>> the entire query can be satisfied from that machine. Otherwise, go >>>>>>> async. >>>>>>> >>>>>>> The native driver reuses connections and intelligently manages the >>>>>>> pool for you. It can also multiplex queries over a single >>>>>>> connection. >>>>>>> >>>>>>> I am assuming you're using one of the datastax drivers for CQL, btw. >>>>>>> >>>>>>> Jon >>>>>>> >>>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle >>>>>>> <marc...@s1mbi0se.com.br> wrote: >>>>>>> > This is interesting, I didn't know that! >>>>>>> > It might make sense then to use select = + async + token aware, I >>>>>>> will try >>>>>>> > to change my code. >>>>>>> > >>>>>>> > But would it be a "recomended solution" for these cases? Any other >>>>>>> options? >>>>>>> > >>>>>>> > I still would if this is the right use case for Cassandra, to look >>>>>>> for >>>>>>> > random keys in a huge cluster. After all, the amount of >>>>>>> connections to >>>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem? >>>>>>> > Or when you use async the driver reuses the connection? >>>>>>> > >>>>>>> > []s >>>>>>> > >>>>>>> > >>>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>>>> > >>>>>>> >> If you use async and your driver is token aware, it will go to the >>>>>>> >> proper node, rather than requiring the coordinator to do so. >>>>>>> >> >>>>>>> >> Realistically you're going to have a connection open to every >>>>>>> server >>>>>>> >> anyways. It's the difference between you querying for the data >>>>>>> >> directly and using a coordinator as a proxy. It's faster to just >>>>>>> ask >>>>>>> >> the node with the data. >>>>>>> >> >>>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle >>>>>>> >> <marc...@s1mbi0se.com.br> wrote: >>>>>>> >> > But using async queries wouldn't be even worse than using >>>>>>> SELECT IN? >>>>>>> >> > The justification in the docs is I could query many nodes, but >>>>>>> I would >>>>>>> >> > still >>>>>>> >> > do it. >>>>>>> >> > >>>>>>> >> > Today, I use both async queries AND SELECT IN: >>>>>>> >> > >>>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP >>>>>>> + " >>>>>>> >> > WHERE >>>>>>> >> > name=%s and value in(%s)" >>>>>>> >> > >>>>>>> >> > for name, values in identifiers.items(): >>>>>>> >> > query = self.SELECT_ENTITY_LOOKUP % ('%s', >>>>>>> >> > ','.join(['%s']*len(values))) >>>>>>> >> > args = [name] + values >>>>>>> >> > query_msg = query % tuple(args) >>>>>>> >> > futures.append((query_msg, self.session.execute_async(query, >>>>>>> args))) >>>>>>> >> > >>>>>>> >> > for query_msg, future in futures: >>>>>>> >> > try: >>>>>>> >> > rows = future.result(timeout=100000) >>>>>>> >> > for row in rows: >>>>>>> >> > entity_ids.add(row.entity_id) >>>>>>> >> > except: >>>>>>> >> > logging.error("Query '%s' returned ERROR " % (query_msg)) >>>>>>> >> > raise >>>>>>> >> > >>>>>>> >> > Using async just with select = would mean instead of 1 async >>>>>>> query >>>>>>> >> > (example: >>>>>>> >> > in (0, 1, 2)), I would do several, one for each value of >>>>>>> "values" array >>>>>>> >> > above. >>>>>>> >> > In my head, this would mean more connections to Cassandra and >>>>>>> the same >>>>>>> >> > amount of work, right? What would be the advantage? >>>>>>> >> > >>>>>>> >> > []s >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>>>> >> > >>>>>>> >> >> Your other option is to fire off async queries. It's pretty >>>>>>> >> >> straightforward w/ the java or python drivers. >>>>>>> >> >> >>>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle >>>>>>> >> >> <marc...@s1mbi0se.com.br> wrote: >>>>>>> >> >> > I was taking a look at Cassandra anti-patterns list: >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html >>>>>>> >> >> > >>>>>>> >> >> > Among then is >>>>>>> >> >> > >>>>>>> >> >> > SELECT ... IN or index lookups¶ >>>>>>> >> >> > >>>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary indexes) >>>>>>> should >>>>>>> >> >> > be >>>>>>> >> >> > avoided except for specific scenarios. See When not to use >>>>>>> IN in >>>>>>> >> >> > SELECT >>>>>>> >> >> > and >>>>>>> >> >> > When not to use an index in Indexing in >>>>>>> >> >> > >>>>>>> >> >> > CQL for Cassandra 2.0" >>>>>>> >> >> > >>>>>>> >> >> > And Looking at the SELECT doc, I saw: >>>>>>> >> >> > >>>>>>> >> >> > When not to use IN¶ >>>>>>> >> >> > >>>>>>> >> >> > The recommendations about when not to use an index apply to >>>>>>> using IN >>>>>>> >> >> > in >>>>>>> >> >> > the >>>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE >>>>>>> clause is >>>>>>> >> >> > not >>>>>>> >> >> > recommended. Using IN can degrade performance because >>>>>>> usually many >>>>>>> >> >> > nodes >>>>>>> >> >> > must be queried. For example, in a single, local data center >>>>>>> cluster >>>>>>> >> >> > having >>>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency level >>>>>>> of >>>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but >>>>>>> if the >>>>>>> >> >> > query >>>>>>> >> >> > uses the IN condition, the number of nodes being queried are >>>>>>> most >>>>>>> >> >> > likely >>>>>>> >> >> > even higher, up to 20 nodes depending on where the keys fall >>>>>>> in the >>>>>>> >> >> > token >>>>>>> >> >> > range." >>>>>>> >> >> > >>>>>>> >> >> > In my system, I have a column family called "entity_lookup": >>>>>>> >> >> > >>>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1 >>>>>>> >> >> > WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', >>>>>>> >> >> > 'DC1' : 3 }; >>>>>>> >> >> > USE Identification1; >>>>>>> >> >> > >>>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup ( >>>>>>> >> >> > name varchar, >>>>>>> >> >> > value varchar, >>>>>>> >> >> > entity_id uuid, >>>>>>> >> >> > PRIMARY KEY ((name, value), entity_id)); >>>>>>> >> >> > >>>>>>> >> >> > And I use the following select to query it: >>>>>>> >> >> > >>>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value >>>>>>> in(%s) >>>>>>> >> >> > >>>>>>> >> >> > Is this an anti-pattern? >>>>>>> >> >> > >>>>>>> >> >> > If not using SELECT IN, which other way would you recomend >>>>>>> for >>>>>>> >> >> > lookups >>>>>>> >> >> > like >>>>>>> >> >> > that? I have several values I would like to search in >>>>>>> cassandra and >>>>>>> >> >> > they >>>>>>> >> >> > might not be in the same particion, as above. >>>>>>> >> >> > >>>>>>> >> >> > Is Cassandra the wrong tool for lookups like that? >>>>>>> >> >> > >>>>>>> >> >> > Best regards, >>>>>>> >> >> > Marcelo Valle. >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> -- >>>>>>> >> >> Jon Haddad >>>>>>> >> >> http://www.rustyrazorblade.com >>>>>>> >> >> skype: rustyrazorblade >>>>>>> >> > >>>>>>> >> > >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> -- >>>>>>> >> Jon Haddad >>>>>>> >> http://www.rustyrazorblade.com >>>>>>> >> skype: rustyrazorblade >>>>>>> > >>>>>>> > >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jon Haddad >>>>>>> http://www.rustyrazorblade.com >>>>>>> skype: rustyrazorblade >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >