Re: Cassandra Performance Benchmarking.

Tyler Hobbs Fri, 18 Jan 2013 11:16:50 -0800

The fact that it's still exactly 521 seconds is very suspicious.  I can't
debug your script over the mailing list, but do some sanity checks to make
sure there's not a bottleneck somewhere you don't expect.



On Fri, Jan 18, 2013 at 12:44 PM, Pradeep Kumar Mantha <pradeep...@gmail.com
> wrote:

> Hi,
>
> Thanks Tyler.
>
> Below is the *global* connection pool I am trying to use, where the
> server_list contains all the ips of 12 DataNodes I am using and
> pool_size is the number of threads  and I just set to timeout to 60 to
> avoid connection retry errors.
>
> pool = pycassa.ConnectionPool('Blast',
> server_list=server_list,pool_size=32,timeout=60)
>
>
> It seems the performance is still stuck at 521 seconds.. which is 177
> seconds for cassandra-cli.
>
> Am I still missing something?
>
> thanks
> Pradeep
>
>
>
> On Fri, Jan 18, 2013 at 7:12 AM, Tyler Hobbs <ty...@datastax.com> wrote:
> > You just need to increase the ConnectionPool size to handle the number of
> > threads you have using it concurrently.  Set the pool_size kwarg to at
> least
> > the number of threads you're using.
> >
> >
> > On Thu, Jan 17, 2013 at 6:46 PM, Pradeep Kumar Mantha <
> pradeep...@gmail.com>
> > wrote:
> >>
> >> Thanks Tyler.
> >>
> >> I just moved the pool and cf which store the connection pool and CF
> >> information to have global scope.
> >>
> >> Increased the server_list values from 1 to 4. ( i think i can increase
> >> them max to 12 since I have 12 data nodes )
> >>
> >> when I created 8 threads  using python threading package , I see the
> >> below error.
> >>
> >> Exception in thread Thread-3:
> >> Traceback (most recent call last):
> >>   File
> >> "/usr/common/usg/python/2.7.1-20110310/lib64/python2.7/threading.py",
> >> line 530, in __bootstrap_inner
> >>     self.run()
> >>   File "my_cc.py", line 20, in run
> >>     start_cassandra_client(self.name)
> >>   File "my_cc.py", line 33, in start_cassandra_client
> >>     cf.get(key)
> >>   File
> >>
> "/global/homes/p/pmantha/mypython_repo/lib/python2.7/site-packages/pycassa/columnfamily.py",
> >> line 652, in get
> >>     read_consistency_level or self.read_consistency_level)
> >>   File
> >>
> "/global/homes/p/pmantha/mypython_repo/lib/python2.7/site-packages/pycassa/pool.py",
> >> line 553, in execute
> >>     conn = self.get()
> >>   File
> >>
> "/global/homes/p/pmantha/mypython_repo/lib/python2.7/site-packages/pycassa/pool.py",
> >> line 536, in get
> >>     raise NoConnectionAvailable(message)
> >> NoConnectionAvailable: ConnectionPool limit of size 5 reached, unable
> >> to obtain connection after 30 seconds
> >>
> >>
> >> Please have a look at the script attached.. and let me know if I need
> >> to change something.. Please bear with me, if I do something terribly
> >> wrong..
> >>
> >> I am running the script on a 8 processor node.
> >>
> >> thanks
> >> pradeep
> >>
> >> On Thu, Jan 17, 2013 at 4:18 PM, Tyler Hobbs <ty...@datastax.com>
> wrote:
> >> > ConnectionPools and ColumnFamilies are thread-safe in pycassa, and
> it's
> >> > best
> >> > to share them across multiple threads.  Of course, when you do that,
> >> > make
> >> > sure to make the ConnectionPool large enough to support all of the
> >> > threads
> >> > making queries concurrently.  I'm also not sure if you're just
> omitting
> >> > this, but pycassa's ConnectionPool will only open connections to
> servers
> >> > you
> >> > explicitly include in server_list; there's no autodiscovery of other
> >> > nodes
> >> > going on.
> >> >
> >> > Depending on your network latency, you'll top out on python
> performance
> >> > with
> >> > a fairly low number of threads due to the GIL.  It's best to use
> >> > multiple
> >> > processes if you really want to benchmark something.
> >> >
> >> >
> >> > On Thu, Jan 17, 2013 at 6:05 PM, Pradeep Kumar Mantha
> >> > <pradeep...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> Thanks. I would like to benchmark cassandra with our application so
> >> >> that we understand the details of how the actual benchmarking is
> done.
> >> >> Not sure, how easy it would be to integrate YCSB with our
> application.
> >> >>
> >> >> So, i am trying different client interfaces to cassandra.
> >> >>
> >> >> I found
> >> >>
> >> >> for 12 Data Nodes Cassandra cluster and 1 Client Node which run 32
> >> >> threads ( each querying X number of queries ).
> >> >>
> >> >> cassandra-cli     took 133 seconds
> >> >> pycassa took 521 seconds.
> >> >>
> >> >> Here is the python pycassa code used to query and passed to each
> >> >> thread....
> >> >>
> >> >> def start_cassandra_client(Threadname):
> >> >>         pool = pycassa.ConnectionPool('Blast',
> >> >> server_list=['xxx.xx.xx.xx'])
> >> >>         cf = pycassa.ColumnFamily(pool, 'Blast_NR')
> >> >>         inp_file=open("pycassa_100%_query")
> >> >>         for key in inp_file:
> >> >>                 key=key.strip()
> >> >>                 cf.get(key)
> >> >>
> >> >> Does Java clients like Hector/Astynax help here.. I am more
> >> >> comfortable with Python than Java and our existing application is
> also
> >> >> in Python.
> >> >>
> >> >> thanks
> >> >> pradeep
> >> >>
> >> >>
> >> >> On Thu, Jan 17, 2013 at 2:08 PM, Edward Capriolo
> >> >> <edlinuxg...@gmail.com>
> >> >> wrote:
> >> >> > Wow you managed to do a load test through the cassandra-cli. There
> >> >> > should be
> >> >> > a merit badge for that.
> >> >> >
> >> >> > You should use the built in stress tool or YCSB.
> >> >> >
> >> >> > The CLI has to do much more string conversion then a normal client
> >> >> > would
> >> >> > and
> >> >> > it is not built for performance. You will definitely get better
> >> >> > numbers
> >> >> > through other means.
> >> >> >
> >> >> > On Thu, Jan 17, 2013 at 2:10 PM, Pradeep Kumar Mantha
> >> >> > <pradeep...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> I am trying to maximize execution of the number of read
> >> >> >> queries/second.
> >> >> >>
> >> >> >> Here is my cluster configuration.
> >> >> >>
> >> >> >> Replication - Default
> >> >> >> 12 Data Nodes.
> >> >> >> 16 Client Nodes - used for querying.
> >> >> >>
> >> >> >> Each client node executes 32 threads - each thread executes 76896
> >> >> >> read
> >> >> >> queries using  cassandra-cli tool.
> >> >> >>        i.e all the read queries are stored in a file and that file
> >> >> >> is
> >> >> >> given to cassandra-cli tool ( using -f option ) which is executed
> by
> >> >> >> a
> >> >> >> thread.
> >> >> >> so, total number of queries for 16 client Nodes is 16 * 32 *
> 76896.
> >> >> >>
> >> >> >> The read queries on each client node submitted at the same time.
> The
> >> >> >> time taken for 16 * 32 * 76896 read queries is nearly 742 seconds
> -
> >> >> >> which is nearly 53k transactions/second.
> >> >> >>
> >> >> >> I would like to know if there is any other way/tool through which
> I
> >> >> >> can improve the number of transactions/second.
> >> >> >> Is the performance affected by cassandra-cli tool?
> >> >> >>
> >> >> >> thanks
> >> >> >> pradeep
> >> >> >
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Tyler Hobbs
> >> > DataStax
> >
> >
> >
> >
> > --
> > Tyler Hobbs
> > DataStax
>



-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Cassandra Performance Benchmarking.

Reply via email to