Re: Cassandra Performance Benchmarking.

Pradeep Kumar Mantha Thu, 17 Jan 2013 16:48:32 -0800

Thanks Tyler.

I just moved the pool and cf which store the connection pool and CF
information to have global scope.


Increased the server_list values from 1 to 4. ( i think i can increase
them max to 12 since I have 12 data nodes )

when I created 8 threads  using python threading package , I see the
below error.

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/common/usg/python/2.7.1-20110310/lib64/python2.7/threading.py",
line 530, in __bootstrap_inner
    self.run()
  File "my_cc.py", line 20, in run
    start_cassandra_client(self.name)
  File "my_cc.py", line 33, in start_cassandra_client
    cf.get(key)
  File 
"/global/homes/p/pmantha/mypython_repo/lib/python2.7/site-packages/pycassa/columnfamily.py",
line 652, in get
    read_consistency_level or self.read_consistency_level)
  File 
"/global/homes/p/pmantha/mypython_repo/lib/python2.7/site-packages/pycassa/pool.py",
line 553, in execute
    conn = self.get()
  File 
"/global/homes/p/pmantha/mypython_repo/lib/python2.7/site-packages/pycassa/pool.py",
line 536, in get
    raise NoConnectionAvailable(message)
NoConnectionAvailable: ConnectionPool limit of size 5 reached, unable
to obtain connection after 30 seconds


Please have a look at the script attached.. and let me know if I need
to change something.. Please bear with me, if I do something terribly
wrong..

I am running the script on a 8 processor node.

thanks
pradeep

On Thu, Jan 17, 2013 at 4:18 PM, Tyler Hobbs <ty...@datastax.com> wrote:
> ConnectionPools and ColumnFamilies are thread-safe in pycassa, and it's best
> to share them across multiple threads.  Of course, when you do that, make
> sure to make the ConnectionPool large enough to support all of the threads
> making queries concurrently.  I'm also not sure if you're just omitting
> this, but pycassa's ConnectionPool will only open connections to servers you
> explicitly include in server_list; there's no autodiscovery of other nodes
> going on.
>
> Depending on your network latency, you'll top out on python performance with
> a fairly low number of threads due to the GIL.  It's best to use multiple
> processes if you really want to benchmark something.
>
>
> On Thu, Jan 17, 2013 at 6:05 PM, Pradeep Kumar Mantha <pradeep...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> Thanks. I would like to benchmark cassandra with our application so
>> that we understand the details of how the actual benchmarking is done.
>> Not sure, how easy it would be to integrate YCSB with our application.
>>
>> So, i am trying different client interfaces to cassandra.
>>
>> I found
>>
>> for 12 Data Nodes Cassandra cluster and 1 Client Node which run 32
>> threads ( each querying X number of queries ).
>>
>> cassandra-cli     took 133 seconds
>> pycassa took 521 seconds.
>>
>> Here is the python pycassa code used to query and passed to each
>> thread....
>>
>> def start_cassandra_client(Threadname):
>>         pool = pycassa.ConnectionPool('Blast',
>> server_list=['xxx.xx.xx.xx'])
>>         cf = pycassa.ColumnFamily(pool, 'Blast_NR')
>>         inp_file=open("pycassa_100%_query")
>>         for key in inp_file:
>>                 key=key.strip()
>>                 cf.get(key)
>>
>> Does Java clients like Hector/Astynax help here.. I am more
>> comfortable with Python than Java and our existing application is also
>> in Python.
>>
>> thanks
>> pradeep
>>
>>
>> On Thu, Jan 17, 2013 at 2:08 PM, Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>> > Wow you managed to do a load test through the cassandra-cli. There
>> > should be
>> > a merit badge for that.
>> >
>> > You should use the built in stress tool or YCSB.
>> >
>> > The CLI has to do much more string conversion then a normal client would
>> > and
>> > it is not built for performance. You will definitely get better numbers
>> > through other means.
>> >
>> > On Thu, Jan 17, 2013 at 2:10 PM, Pradeep Kumar Mantha
>> > <pradeep...@gmail.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> I am trying to maximize execution of the number of read queries/second.
>> >>
>> >> Here is my cluster configuration.
>> >>
>> >> Replication - Default
>> >> 12 Data Nodes.
>> >> 16 Client Nodes - used for querying.
>> >>
>> >> Each client node executes 32 threads - each thread executes 76896 read
>> >> queries using  cassandra-cli tool.
>> >>        i.e all the read queries are stored in a file and that file is
>> >> given to cassandra-cli tool ( using -f option ) which is executed by a
>> >> thread.
>> >> so, total number of queries for 16 client Nodes is 16 * 32 * 76896.
>> >>
>> >> The read queries on each client node submitted at the same time. The
>> >> time taken for 16 * 32 * 76896 read queries is nearly 742 seconds -
>> >> which is nearly 53k transactions/second.
>> >>
>> >> I would like to know if there is any other way/tool through which I
>> >> can improve the number of transactions/second.
>> >> Is the performance affected by cassandra-cli tool?
>> >>
>> >> thanks
>> >> pradeep
>> >
>> >
>
>
>
>
> --
> Tyler Hobbs
> DataStax

pycassa_client.py
Description: Binary data

Re: Cassandra Performance Benchmarking.

Reply via email to