Re: Best way to do a multi_get using CQL

Jeremy Jongsma Fri, 20 Jun 2014 12:31:37 -0700

There is nothing preventing that in Cassandra, it's just a matter of how
intelligent the driver API is. Submit a feature request to Astyanax or
Datastax driver projects.



On Fri, Jun 20, 2014 at 2:27 PM, Marcelo Elias Del Valle <
marc...@s1mbi0se.com.br> wrote:

> The bad design part (just my opinion, no intention to offend) is not allow
> the possibility of sending batches directly to the data nodes, without
> using a coordinator.
> I would choose that option.
> []s
>
>
> 2014-06-20 16:05 GMT-03:00 DuyHai Doan <doanduy...@gmail.com>:
>
>> Well it's kind of a trade-off.
>>
>>  Either you send data directly to the primary replica nodes to take
>> advantage of data-locality using token-aware strategy and the price to pay
>> is a high number of opened connections from client side.
>>
>> Or you just batch data to a random node playing the coordinator role to
>> dispatch requests to the right nodes. The price to pay is then spike load
>> on 1 node (the coordinator) and intra-cluster bandwdith usage.
>>
>>  The choice is yours, it has nothing to do with good or bad design.
>>
>>
>> On Fri, Jun 20, 2014 at 8:55 PM, Marcelo Elias Del Valle <
>> marc...@s1mbi0se.com.br> wrote:
>>
>>> I am using python + CQL Driver.
>>> I wonder how they do...
>>> These things seems little important, but they are fundamental to get a
>>> good performance in Cassandra...
>>> I wish there was a simpler way to query in batches. Opening a large
>>> amount of connections and sending 1 message at a time seems bad to me, as
>>> sometimes you want to work with small rows.
>>> It's no surprise Cassandra performs better when we use average row
>>> sizes. But honestly I disagree with this part of Cassandra/Driver's design.
>>> []s
>>>
>>>
>>> 2014-06-20 14:37 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>:
>>>
>>> That depends on the connection pooling implementation in your driver.
>>>> Astyanax will keep N connections open to each node (configurable) and route
>>>> each query in a separate message over an existing connection, waiting until
>>>> one becomes available if all are in use.
>>>>
>>>>
>>>> On Fri, Jun 20, 2014 at 12:32 PM, Marcelo Elias Del Valle <
>>>> marc...@s1mbi0se.com.br> wrote:
>>>>
>>>>> A question, not sure if you guys know the answer:
>>>>> Supose I async query 1000 rows using token aware and suppose I have 10
>>>>> nodes. Suppose also each node would receive 100 row queries each.
>>>>> How does async work in this case? Would it send each row query to each
>>>>> node in a different connection? Different message?
>>>>> I guess if there was a way to use batch with async, once you commit
>>>>> the batch for the 1000 queries, it would create 1 connection to each host
>>>>> and query 100 rows in a single message to each host.
>>>>> This would decrease resource usage, am I wrong?
>>>>>
>>>>> []s
>>>>>
>>>>>
>>>>> 2014-06-20 12:12 GMT-03:00 Jeremy Jongsma <jer...@barchart.com>:
>>>>>
>>>>> I've found that if you have any amount of latency between your client
>>>>>> and nodes, and you are executing a large batch of queries, you'll usually
>>>>>> want to send them together to one node unless execution time is of no
>>>>>> concern. The tradeoff is resource usage on the connected node vs. time to
>>>>>> complete all the queries, because you'll need fewer client -> node 
>>>>>> network
>>>>>> round trips.
>>>>>>
>>>>>> With large numbers of queries you will still want to make sure you
>>>>>> split them into manageable batches before sending them, to control memory
>>>>>> usage on the executing node. I've been limiting queries to batches of 100
>>>>>> keys in scenarios like this.
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 20, 2014 at 5:59 AM, Laing, Michael <
>>>>>> michael.la...@nytimes.com> wrote:
>>>>>>
>>>>>>> However my extensive benchmarking this week of the python driver
>>>>>>> from master shows a performance *decrease* when using 'token_aware'.
>>>>>>>
>>>>>>> This is on 12-node, 2-datacenter, RF-3 cluster in AWS.
>>>>>>>
>>>>>>> Also why do the work the coordinator will do for you: send all the
>>>>>>> queries, wait for everything to come back in whatever order, and sort 
>>>>>>> the
>>>>>>> result.
>>>>>>>
>>>>>>> I would rather keep my app code simple.
>>>>>>>
>>>>>>> But the real point is that you should benchmark in your own
>>>>>>> environment.
>>>>>>>
>>>>>>> ml
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle <
>>>>>>> marc...@s1mbi0se.com.br> wrote:
>>>>>>>
>>>>>>>> Yes, I am using the CQL datastax drivers.
>>>>>>>> It was a good advice, thanks a lot Janathan.
>>>>>>>> []s
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-06-20 0:28 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>>>>>>>>
>>>>>>>> The only case in which it might be better to use an IN clause is if
>>>>>>>>> the entire query can be satisfied from that machine.  Otherwise, go
>>>>>>>>> async.
>>>>>>>>>
>>>>>>>>> The native driver reuses connections and intelligently manages the
>>>>>>>>> pool for you.  It can also multiplex queries over a single
>>>>>>>>> connection.
>>>>>>>>>
>>>>>>>>> I am assuming you're using one of the datastax drivers for CQL,
>>>>>>>>> btw.
>>>>>>>>>
>>>>>>>>> Jon
>>>>>>>>>
>>>>>>>>> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
>>>>>>>>> <marc...@s1mbi0se.com.br> wrote:
>>>>>>>>> > This is interesting, I didn't know that!
>>>>>>>>> > It might make sense then to use select = + async + token aware,
>>>>>>>>> I will try
>>>>>>>>> > to change my code.
>>>>>>>>> >
>>>>>>>>> > But would it be a "recomended solution" for these cases? Any
>>>>>>>>> other options?
>>>>>>>>> >
>>>>>>>>> > I still would if this is the right use case for Cassandra, to
>>>>>>>>> look for
>>>>>>>>> > random keys in a huge cluster. After all, the amount of
>>>>>>>>> connections to
>>>>>>>>> > Cassandra will still be huge, right... Wouldn't it be a problem?
>>>>>>>>> > Or when you use async the driver reuses the connection?
>>>>>>>>> >
>>>>>>>>> > []s
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>>>>>>>>> >
>>>>>>>>> >> If you use async and your driver is token aware, it will go to
>>>>>>>>> the
>>>>>>>>> >> proper node, rather than requiring the coordinator to do so.
>>>>>>>>> >>
>>>>>>>>> >> Realistically you're going to have a connection open to every
>>>>>>>>> server
>>>>>>>>> >> anyways.  It's the difference between you querying for the data
>>>>>>>>> >> directly and using a coordinator as a proxy.  It's faster to
>>>>>>>>> just ask
>>>>>>>>> >> the node with the data.
>>>>>>>>> >>
>>>>>>>>> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>>>>>>>>> >> <marc...@s1mbi0se.com.br> wrote:
>>>>>>>>> >> > But using async queries wouldn't be even worse than using
>>>>>>>>> SELECT IN?
>>>>>>>>> >> > The justification in the docs is I could query many nodes,
>>>>>>>>> but I would
>>>>>>>>> >> > still
>>>>>>>>> >> > do it.
>>>>>>>>> >> >
>>>>>>>>> >> > Today, I use both async queries AND SELECT IN:
>>>>>>>>> >> >
>>>>>>>>> >> > SELECT_ENTITY_LOOKUP = "SELECT entity_id FROM " +
>>>>>>>>> ENTITY_LOOKUP + "
>>>>>>>>> >> > WHERE
>>>>>>>>> >> > name=%s and value in(%s)"
>>>>>>>>> >> >
>>>>>>>>> >> > for name, values in identifiers.items():
>>>>>>>>> >> >    query = self.SELECT_ENTITY_LOOKUP % ('%s',
>>>>>>>>> >> > ','.join(['%s']*len(values)))
>>>>>>>>> >> >    args = [name] + values
>>>>>>>>> >> >    query_msg = query % tuple(args)
>>>>>>>>> >> >    futures.append((query_msg,
>>>>>>>>> self.session.execute_async(query, args)))
>>>>>>>>> >> >
>>>>>>>>> >> > for query_msg, future in futures:
>>>>>>>>> >> >    try:
>>>>>>>>> >> >       rows = future.result(timeout=100000)
>>>>>>>>> >> >       for row in rows:
>>>>>>>>> >> >         entity_ids.add(row.entity_id)
>>>>>>>>> >> >    except:
>>>>>>>>> >> >       logging.error("Query '%s' returned ERROR " %
>>>>>>>>> (query_msg))
>>>>>>>>> >> >       raise
>>>>>>>>> >> >
>>>>>>>>> >> > Using async just with select = would mean instead of 1 async
>>>>>>>>> query
>>>>>>>>> >> > (example:
>>>>>>>>> >> > in (0, 1, 2)), I would do several, one for each value of
>>>>>>>>> "values" array
>>>>>>>>> >> > above.
>>>>>>>>> >> > In my head, this would mean more connections to Cassandra and
>>>>>>>>> the same
>>>>>>>>> >> > amount of work, right? What would be the advantage?
>>>>>>>>> >> >
>>>>>>>>> >> > []s
>>>>>>>>> >> >
>>>>>>>>> >> >
>>>>>>>>> >> >
>>>>>>>>> >> >
>>>>>>>>> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com
>>>>>>>>> >:
>>>>>>>>> >> >
>>>>>>>>> >> >> Your other option is to fire off async queries.  It's pretty
>>>>>>>>> >> >> straightforward w/ the java or python drivers.
>>>>>>>>> >> >>
>>>>>>>>> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle
>>>>>>>>> >> >> <marc...@s1mbi0se.com.br> wrote:
>>>>>>>>> >> >> > I was taking a look at Cassandra anti-patterns list:
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Among then is
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > SELECT ... IN or index lookups¶
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > SELECT ... IN and index lookups (formerly secondary
>>>>>>>>> indexes) should
>>>>>>>>> >> >> > be
>>>>>>>>> >> >> > avoided except for specific scenarios. See When not to use
>>>>>>>>> IN in
>>>>>>>>> >> >> > SELECT
>>>>>>>>> >> >> > and
>>>>>>>>> >> >> > When not to use an index in Indexing in
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > CQL for Cassandra 2.0"
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > And Looking at the SELECT doc, I saw:
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > When not to use IN¶
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > The recommendations about when not to use an index apply
>>>>>>>>> to using IN
>>>>>>>>> >> >> > in
>>>>>>>>> >> >> > the
>>>>>>>>> >> >> > WHERE clause. Under most conditions, using IN in the WHERE
>>>>>>>>> clause is
>>>>>>>>> >> >> > not
>>>>>>>>> >> >> > recommended. Using IN can degrade performance because
>>>>>>>>> usually many
>>>>>>>>> >> >> > nodes
>>>>>>>>> >> >> > must be queried. For example, in a single, local data
>>>>>>>>> center cluster
>>>>>>>>> >> >> > having
>>>>>>>>> >> >> > 30 nodes, a replication factor of 3, and a consistency
>>>>>>>>> level of
>>>>>>>>> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes,
>>>>>>>>> but if the
>>>>>>>>> >> >> > query
>>>>>>>>> >> >> > uses the IN condition, the number of nodes being queried
>>>>>>>>> are most
>>>>>>>>> >> >> > likely
>>>>>>>>> >> >> > even higher, up to 20 nodes depending on where the keys
>>>>>>>>> fall in the
>>>>>>>>> >> >> > token
>>>>>>>>> >> >> > range."
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > In my system, I have a column family called
>>>>>>>>> "entity_lookup":
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>>>>>>>>> >> >> >   WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
>>>>>>>>> >> >> >   'DC1' : 3 };
>>>>>>>>> >> >> > USE Identification1;
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>>>>>>>>> >> >> >   name varchar,
>>>>>>>>> >> >> >   value varchar,
>>>>>>>>> >> >> >   entity_id uuid,
>>>>>>>>> >> >> >   PRIMARY KEY ((name, value), entity_id));
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > And I use the following select to query it:
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > SELECT entity_id FROM entity_lookup WHERE name=%s and
>>>>>>>>> value in(%s)
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Is this an anti-pattern?
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > If not using SELECT IN, which other way would you recomend
>>>>>>>>> for
>>>>>>>>> >> >> > lookups
>>>>>>>>> >> >> > like
>>>>>>>>> >> >> > that? I have several values I would like to search in
>>>>>>>>> cassandra and
>>>>>>>>> >> >> > they
>>>>>>>>> >> >> > might not be in the same particion, as above.
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Is Cassandra the wrong tool for lookups like that?
>>>>>>>>> >> >> >
>>>>>>>>> >> >> > Best regards,
>>>>>>>>> >> >> > Marcelo Valle.
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >> >
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> --
>>>>>>>>> >> >> Jon Haddad
>>>>>>>>> >> >> http://www.rustyrazorblade.com
>>>>>>>>> >> >> skype: rustyrazorblade
>>>>>>>>> >> >
>>>>>>>>> >> >
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> --
>>>>>>>>> >> Jon Haddad
>>>>>>>>> >> http://www.rustyrazorblade.com
>>>>>>>>> >> skype: rustyrazorblade
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jon Haddad
>>>>>>>>> http://www.rustyrazorblade.com
>>>>>>>>> skype: rustyrazorblade
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Best way to do a multi_get using CQL

Reply via email to