I am using python + CQL Driver. I wonder how they do... These things seems little important, but they are fundamental to get a good performance in Cassandra... I wish there was a simpler way to query in batches. Opening a large amount of connections and sending 1 message at a time seems bad to me, as sometimes you want to work with small rows. It's no surprise Cassandra performs better when we use average row sizes. But honestly I disagree with this part of Cassandra/Driver's design. []s
2014-06-20 14:37 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>: > That depends on the connection pooling implementation in your driver. > Astyanax will keep N connections open to each node (configurable) and route > each query in a separate message over an existing connection, waiting until > one becomes available if all are in use. > > > On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle < > marc...@s1mbi0se.com.br> wrote: > >> A question, not sure if you guys know the answer: >> Supose I async query 1000 rows using token aware and suppose I have 10 >> nodes. Suppose also each node would receive 100 row queries each. >> How does async work in this case? Would it send each row query to each >> node in a different connection? Different message? >> I guess if there was a way to use batch with async, once you commit the >> batch for the 1000 queries, it would create 1 connection to each host and >> query 100 rows in a single message to each host. >> This would decrease resource usage, am I wrong? >> >> []s >> >> >> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>: >> >> I've found that if you have any amount of latency between your client and >>> nodes, and you are executing a large batch of queries, you'll usually want >>> to send them together to one node unless execution time is of no concern. >>> The tradeoff is resource usage on the connected node vs. time to complete >>> all the queries, because you'll need fewer client -> node network round >>> trips. >>> >>> With large numbers of queries you will still want to make sure you split >>> them into manageable batches before sending them, to control memory usage >>> on the executing node. I've been limiting queries to batches of 100 keys in >>> scenarios like this. >>> >>> >>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael < >>> michael.la...@nytimes.com> wrote: >>> >>>> However my extensive benchmarking this week of the python driver from >>>> master shows a performance *decrease* when using 'token_aware'. >>>> >>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS. >>>> >>>> Also why do the work the coordinator will do for you: send all the >>>> queries, wait for everything to come back in whatever order, and sort the >>>> result. >>>> >>>> I would rather keep my app code simple. >>>> >>>> But the real point is that you should benchmark in your own environment. >>>> >>>> ml >>>> >>>> >>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle < >>>> marc...@s1mbi0se.com.br> wrote: >>>> >>>>> Yes, I am using the CQL datastax drivers. >>>>> It was a good advice, thanks a lot Janathan. >>>>> []s >>>>> >>>>> >>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>> >>>>> The only case in which it might be better to use an IN clause is if >>>>>> the entire query can be satisfied from that machine. Otherwise, go >>>>>> async. >>>>>> >>>>>> The native driver reuses connections and intelligently manages the >>>>>> pool for you. It can also multiplex queries over a single connection. >>>>>> >>>>>> I am assuming you're using one of the datastax drivers for CQL, btw. >>>>>> >>>>>> Jon >>>>>> >>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle >>>>>> <marc...@s1mbi0se.com.br> wrote: >>>>>> > This is interesting, I didn't know that! >>>>>> > It might make sense then to use select = + async + token aware, I >>>>>> will try >>>>>> > to change my code. >>>>>> > >>>>>> > But would it be a "recomended solution" for these cases? Any other >>>>>> options? >>>>>> > >>>>>> > I still would if this is the right use case for Cassandra, to look >>>>>> for >>>>>> > random keys in a huge cluster. After all, the amount of connections >>>>>> to >>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem? >>>>>> > Or when you use async the driver reuses the connection? >>>>>> > >>>>>> > []s >>>>>> > >>>>>> > >>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>>> > >>>>>> >> If you use async and your driver is token aware, it will go to the >>>>>> >> proper node, rather than requiring the coordinator to do so. >>>>>> >> >>>>>> >> Realistically you're going to have a connection open to every >>>>>> server >>>>>> >> anyways. It's the difference between you querying for the data >>>>>> >> directly and using a coordinator as a proxy. It's faster to just >>>>>> ask >>>>>> >> the node with the data. >>>>>> >> >>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle >>>>>> >> <marc...@s1mbi0se.com.br> wrote: >>>>>> >> > But using async queries wouldn't be even worse than using SELECT >>>>>> IN? >>>>>> >> > The justification in the docs is I could query many nodes, but I >>>>>> would >>>>>> >> > still >>>>>> >> > do it. >>>>>> >> > >>>>>> >> > Today, I use both async queries AND SELECT IN: >>>>>> >> > >>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " + ENTITY_LOOKUP >>>>>> + " >>>>>> >> > WHERE >>>>>> >> > name=%s and value in(%s)" >>>>>> >> > >>>>>> >> > for name, values in identifiers.items(): >>>>>> >> > query = self.SELECT_ENTITY_LOOKUP % ('%s', >>>>>> >> > ','.join(['%s']*len(values))) >>>>>> >> > args = [name] + values >>>>>> >> > query_msg = query % tuple(args) >>>>>> >> > futures.append((query_msg, self.session.execute_async(query, >>>>>> args))) >>>>>> >> > >>>>>> >> > for query_msg, future in futures: >>>>>> >> > try: >>>>>> >> > rows = future.result(timeout=100000) >>>>>> >> > for row in rows: >>>>>> >> > entity_ids.add(row.entity_id) >>>>>> >> > except: >>>>>> >> > logging.error("Query '%s' returned ERROR " % (query_msg)) >>>>>> >> > raise >>>>>> >> > >>>>>> >> > Using async just with select = would mean instead of 1 async >>>>>> query >>>>>> >> > (example: >>>>>> >> > in (0, 1, 2)), I would do several, one for each value of >>>>>> "values" array >>>>>> >> > above. >>>>>> >> > In my head, this would mean more connections to Cassandra and >>>>>> the same >>>>>> >> > amount of work, right? What would be the advantage? >>>>>> >> > >>>>>> >> > []s >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: >>>>>> >> > >>>>>> >> >> Your other option is to fire off async queries. It's pretty >>>>>> >> >> straightforward w/ the java or python drivers. >>>>>> >> >> >>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle >>>>>> >> >> <marc...@s1mbi0se.com.br> wrote: >>>>>> >> >> > I was taking a look at Cassandra anti-patterns list: >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > >>>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html >>>>>> >> >> > >>>>>> >> >> > Among then is >>>>>> >> >> > >>>>>> >> >> > SELECT ... IN or index lookups¶ >>>>>> >> >> > >>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary indexes) >>>>>> should >>>>>> >> >> > be >>>>>> >> >> > avoided except for specific scenarios. See When not to use IN >>>>>> in >>>>>> >> >> > SELECT >>>>>> >> >> > and >>>>>> >> >> > When not to use an index in Indexing in >>>>>> >> >> > >>>>>> >> >> > CQL for Cassandra 2.0" >>>>>> >> >> > >>>>>> >> >> > And Looking at the SELECT doc, I saw: >>>>>> >> >> > >>>>>> >> >> > When not to use IN¶ >>>>>> >> >> > >>>>>> >> >> > The recommendations about when not to use an index apply to >>>>>> using IN >>>>>> >> >> > in >>>>>> >> >> > the >>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE >>>>>> clause is >>>>>> >> >> > not >>>>>> >> >> > recommended. Using IN can degrade performance because usually >>>>>> many >>>>>> >> >> > nodes >>>>>> >> >> > must be queried. For example, in a single, local data center >>>>>> cluster >>>>>> >> >> > having >>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency level >>>>>> of >>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but >>>>>> if the >>>>>> >> >> > query >>>>>> >> >> > uses the IN condition, the number of nodes being queried are >>>>>> most >>>>>> >> >> > likely >>>>>> >> >> > even higher, up to 20 nodes depending on where the keys fall >>>>>> in the >>>>>> >> >> > token >>>>>> >> >> > range." >>>>>> >> >> > >>>>>> >> >> > In my system, I have a column family called "entity_lookup": >>>>>> >> >> > >>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1 >>>>>> >> >> > WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', >>>>>> >> >> > 'DC1' : 3 }; >>>>>> >> >> > USE Identification1; >>>>>> >> >> > >>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup ( >>>>>> >> >> > name varchar, >>>>>> >> >> > value varchar, >>>>>> >> >> > entity_id uuid, >>>>>> >> >> > PRIMARY KEY ((name, value), entity_id)); >>>>>> >> >> > >>>>>> >> >> > And I use the following select to query it: >>>>>> >> >> > >>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and value >>>>>> in(%s) >>>>>> >> >> > >>>>>> >> >> > Is this an anti-pattern? >>>>>> >> >> > >>>>>> >> >> > If not using SELECT IN, which other way would you recomend for >>>>>> >> >> > lookups >>>>>> >> >> > like >>>>>> >> >> > that? I have several values I would like to search in >>>>>> cassandra and >>>>>> >> >> > they >>>>>> >> >> > might not be in the same particion, as above. >>>>>> >> >> > >>>>>> >> >> > Is Cassandra the wrong tool for lookups like that? >>>>>> >> >> > >>>>>> >> >> > Best regards, >>>>>> >> >> > Marcelo Valle. >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> -- >>>>>> >> >> Jon Haddad >>>>>> >> >> http://www.rustyrazorblade.com >>>>>> >> >> skype: rustyrazorblade >>>>>> >> > >>>>>> >> > >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> -- >>>>>> >> Jon Haddad >>>>>> >> http://www.rustyrazorblade.com >>>>>> >> skype: rustyrazorblade >>>>>> > >>>>>> > >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jon Haddad >>>>>> http://www.rustyrazorblade.com >>>>>> skype: rustyrazorblade >>>>>> >>>>> >>>>> >>>> >>> >> >