Hey, I'm looking at querying around 500,000 rows that I need to pull into a
Pandas data frame for processing.  Currently testing this on a single
cassandra node it takes around 21 seconds:

https://gist.github.com/sontek/4ca95f5c5aa539663eaf

I tried introducing multiprocessing so I could use 4 processes at a time to
query this and I got it down to 14 seconds:

https://gist.github.com/sontek/542f13307ef9679c0094

Although shaving off 7 seconds is great it still isn't really where I would
like to be in regards to performance, for this many rows I'd really like to
get down to a max of 1-2 seconds query time.

What types of optimization's can I make to improve the read performance
when querying a large set of data?  Will this timing speed up linearly as I
add more nodes?

This is what the schema looks like currently:

https://gist.github.com/sontek/d6fa3fc1b6d085ad3fa4


I'm not tied to the current schema at all, its mostly just a replication of
what we have in SQL Server. I'm more interested in what things I can change
to make querying it faster.

Thanks,
John

Reply via email to