The bad design part (just my opinion, no intention to offend) is not allow the possibility of sending batches directly to the data nodes, without using a coordinator. I would choose that option. []s
2014-06-20 16:05 GMT-03:00 DuyHai Doan <doanduy...@gmail.com>: > Well it's kind of a trade-off. > > Either you send data directly to the primary replica nodes to take > advantage of data-locality using token-aware strategy and the price to pay > is a high number of opened connections from client side. > > Or you just batch data to a random node playing the coordinator role to > dispatch requests to the right nodes. The price to pay is then spike load > on 1 node (the coordinator) and intra-cluster bandwdith usage. > > The choice is yours, it has nothing to do with good or bad design. > > > On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle < > marc...@s1mbi0se.com.br> wrote: > >> I am using python + CQL Driver. >> I wonder how they do... >> These things seems little important, but they are fundamental to get a >> good performance in Cassandra... >> I wish there was a simpler way to query in batches. Opening a large >> amount of connections and sending 1 message at a time seems bad to me, as >> sometimes you want to work with small rows. >> It's no surprise Cassandra performs better when we use average row sizes. >> But honestly I disagree with this part of Cassandra/Driver's design. >> []s >> >> >> 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>: >> >> That depends on the connection pooling implementation in your driver. >>> Astyanax will keep N connections open to each node (configurable) and route >>> each query in a separate message over an existing connection, waiting until >>> one becomes available if all are in use. >>> >>> >>> On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle < >>> marc...@s1mbi0se.com.br> wrote: >>> >>>> A question, not sure if you guys know the answer: >>>> Supose I async query 1000 rows using token aware and suppose I have 10 >>>> nodes. Suppose also each node would receive 100 row queries each. >>>> How does async work in this case? Would it send each row query to each >>>> node in a different connection? Different message? >>>> I guess if there was a way to use batch with async, once you commit the >>>> batch for the 1000 queries, it would create 1 connection to each host and >>>> query 100 rows in a single message to each host. >>>> This would decrease resource usage, am I wrong? >>>> >>>> []s >>>> >>>> >>>> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>: >>>> >>>> I've found that if you have any amount of latency between your client >>>>> and nodes, and you are executing a large batch of queries, you'll usually >>>>> want to send them together to one node unless execution time is of no >>>>> concern. The tradeoff is resource usage on the connected node vs. time to >>>>> complete all the queries, because you'll need fewer client -> node network >>>>> round trips. >>>>> >>>>> With large numbers of queries you will still want to make sure you >>>>> split them into manageable batches before sending them, to control memory >>>>> usage on the executing node. I've been limiting queries to batches of 100 >>>>> keys in scenarios like this. >>>>> >>>>> >>>>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael < >>>>> michael.la...@nytimes.com> wrote: >>>>> >>>>>> However my extensive benchmarking this week of the python driver from >>>>>> master shows a performance *decrease* when using 'token_aware'. >>>>>> >>>>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS. >>>>>> >>>>>> Also why do the work the coordinator will do for you: send all the >>>>>> queries, wait for everything to come back in whatever order, and sort the >>>>>> result. >>>>>> >>>>>> I would rather keep my app code simple. >>>>>> >>>>>> But the real point is that you should benchmark in your own >>>>>> environment. >>>>>> >>>>>> ml >>>>>> >>>>>> >>>>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle < >>>>>> marc...@s1mbi0se.com.br> wrote: >>>>>> >>>>>>> Yes, I am using the CQL datastax drivers. >>>>>>> It was a good advice, thanks a lot Janathan. >>>>>>> []s >>>>>>> >>>>>>> >>>>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>>>> >>>>>>> The only case in which it might be better to use an IN clause is if >>>>>>>> the entire query can be satisfied from that machine. Otherwise, go >>>>>>>> async. >>>>>>>> >>>>>>>> The native driver reuses connections and intelligently manages the >>>>>>>> pool for you. It can also multiplex queries over a single >>>>>>>> connection. >>>>>>>> >>>>>>>> I am assuming you're using one of the datastax drivers for CQL, btw. >>>>>>>> >>>>>>>> Jon >>>>>>>> >>>>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle >>>>>>>> <marc...@s1mbi0se.com.br> wrote: >>>>>>>> > This is interesting, I didn't know that! >>>>>>>> > It might make sense then to use select = + async + token aware, I >>>>>>>> will try >>>>>>>> > to change my code. >>>>>>>> > >>>>>>>> > But would it be a "recomended solution" for these cases? Any >>>>>>>> other options? >>>>>>>> > >>>>>>>> > I still would if this is the right use case for Cassandra, to >>>>>>>> look for >>>>>>>> > random keys in a huge cluster. After all, the amount of >>>>>>>> connections to >>>>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem? >>>>>>>> > Or when you use async the driver reuses the connection? >>>>>>>> > >>>>>>>> > []s >>>>>>>> > >>>>>>>> > >>>>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>>>>> > >>>>>>>> >> If you use async and your driver is token aware, it will go to >>>>>>>> the >>>>>>>> >> proper node, rather than requiring the coordinator to do so. >>>>>>>> >> >>>>>>>> >> Realistically you're going to have a connection open to every >>>>>>>> server >>>>>>>> >> anyways. It's the difference between you querying for the data >>>>>>>> >> directly and using a coordinator as a proxy. It's faster to >>>>>>>> just ask >>>>>>>> >> the node with the data. >>>>>>>> >> >>>>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle >>>>>>>> >> <marc...@s1mbi0se.com.br> wrote: >>>>>>>> >> > But using async queries wouldn't be even worse than using >>>>>>>> SELECT IN? >>>>>>>> >> > The justification in the docs is I could query many nodes, but >>>>>>>> I would >>>>>>>> >> > still >>>>>>>> >> > do it. >>>>>>>> >> > >>>>>>>> >> > Today, I use both async queries AND SELECT IN: >>>>>>>> >> > >>>>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + >>>>>>>> ENTITY_LOOKUP + " >>>>>>>> >> > WHERE >>>>>>>> >> > name=%s and value in(%s)" >>>>>>>> >> > >>>>>>>> >> > for name, values in identifiers.items(): >>>>>>>> >> > query = self.SELECT_ENTITY_LOOKUP % ('%s', >>>>>>>> >> > ','.join(['%s']*len(values))) >>>>>>>> >> > args = [name] + values >>>>>>>> >> > query_msg = query % tuple(args) >>>>>>>> >> > futures.append((query_msg, >>>>>>>> self.session.execute_async(query, args))) >>>>>>>> >> > >>>>>>>> >> > for query_msg, future in futures: >>>>>>>> >> > try: >>>>>>>> >> > rows = future.result(timeout=100000) >>>>>>>> >> > for row in rows: >>>>>>>> >> > entity_ids.add(row.entity_id) >>>>>>>> >> > except: >>>>>>>> >> > logging.error("Query '%s' returned ERROR " % (query_msg)) >>>>>>>> >> > raise >>>>>>>> >> > >>>>>>>> >> > Using async just with select = would mean instead of 1 async >>>>>>>> query >>>>>>>> >> > (example: >>>>>>>> >> > in (0, 1, 2)), I would do several, one for each value of >>>>>>>> "values" array >>>>>>>> >> > above. >>>>>>>> >> > In my head, this would mean more connections to Cassandra and >>>>>>>> the same >>>>>>>> >> > amount of work, right? What would be the advantage? >>>>>>>> >> > >>>>>>>> >> > []s >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com >>>>>>>> >: >>>>>>>> >> > >>>>>>>> >> >> Your other option is to fire off async queries. It's pretty >>>>>>>> >> >> straightforward w/ the java or python drivers. >>>>>>>> >> >> >>>>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle >>>>>>>> >> >> <marc...@s1mbi0se.com.br> wrote: >>>>>>>> >> >> > I was taking a look at Cassandra anti-patterns list: >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html >>>>>>>> >> >> > >>>>>>>> >> >> > Among then is >>>>>>>> >> >> > >>>>>>>> >> >> > SELECT ... IN or index lookups¶ >>>>>>>> >> >> > >>>>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary >>>>>>>> indexes) should >>>>>>>> >> >> > be >>>>>>>> >> >> > avoided except for specific scenarios. See When not to use >>>>>>>> IN in >>>>>>>> >> >> > SELECT >>>>>>>> >> >> > and >>>>>>>> >> >> > When not to use an index in Indexing in >>>>>>>> >> >> > >>>>>>>> >> >> > CQL for Cassandra 2.0" >>>>>>>> >> >> > >>>>>>>> >> >> > And Looking at the SELECT doc, I saw: >>>>>>>> >> >> > >>>>>>>> >> >> > When not to use IN¶ >>>>>>>> >> >> > >>>>>>>> >> >> > The recommendations about when not to use an index apply to >>>>>>>> using IN >>>>>>>> >> >> > in >>>>>>>> >> >> > the >>>>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE >>>>>>>> clause is >>>>>>>> >> >> > not >>>>>>>> >> >> > recommended. Using IN can degrade performance because >>>>>>>> usually many >>>>>>>> >> >> > nodes >>>>>>>> >> >> > must be queried. For example, in a single, local data >>>>>>>> center cluster >>>>>>>> >> >> > having >>>>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency >>>>>>>> level of >>>>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but >>>>>>>> if the >>>>>>>> >> >> > query >>>>>>>> >> >> > uses the IN condition, the number of nodes being queried >>>>>>>> are most >>>>>>>> >> >> > likely >>>>>>>> >> >> > even higher, up to 20 nodes depending on where the keys >>>>>>>> fall in the >>>>>>>> >> >> > token >>>>>>>> >> >> > range." >>>>>>>> >> >> > >>>>>>>> >> >> > In my system, I have a column family called "entity_lookup": >>>>>>>> >> >> > >>>>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1 >>>>>>>> >> >> > WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', >>>>>>>> >> >> > 'DC1' : 3 }; >>>>>>>> >> >> > USE Identification1; >>>>>>>> >> >> > >>>>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup ( >>>>>>>> >> >> > name varchar, >>>>>>>> >> >> > value varchar, >>>>>>>> >> >> > entity_id uuid, >>>>>>>> >> >> > PRIMARY KEY ((name, value), entity_id)); >>>>>>>> >> >> > >>>>>>>> >> >> > And I use the following select to query it: >>>>>>>> >> >> > >>>>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value >>>>>>>> in(%s) >>>>>>>> >> >> > >>>>>>>> >> >> > Is this an anti-pattern? >>>>>>>> >> >> > >>>>>>>> >> >> > If not using SELECT IN, which other way would you recomend >>>>>>>> for >>>>>>>> >> >> > lookups >>>>>>>> >> >> > like >>>>>>>> >> >> > that? I have several values I would like to search in >>>>>>>> cassandra and >>>>>>>> >> >> > they >>>>>>>> >> >> > might not be in the same particion, as above. >>>>>>>> >> >> > >>>>>>>> >> >> > Is Cassandra the wrong tool for lookups like that? >>>>>>>> >> >> > >>>>>>>> >> >> > Best regards, >>>>>>>> >> >> > Marcelo Valle. >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> > >>>>>>>> >> >> >>>>>>>> >> >> >>>>>>>> >> >> >>>>>>>> >> >> -- >>>>>>>> >> >> Jon Haddad >>>>>>>> >> >> http://www.rustyrazorblade.com >>>>>>>> >> >> skype: rustyrazorblade >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> -- >>>>>>>> >> Jon Haddad >>>>>>>> >> http://www.rustyrazorblade.com >>>>>>>> >> skype: rustyrazorblade >>>>>>>> > >>>>>>>> > >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Jon Haddad >>>>>>>> http://www.rustyrazorblade.com >>>>>>>> skype: rustyrazorblade >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >