also might want to go through a thread here in with subject "High latencies for simple queries"
On Wed, Apr 22, 2015 at 1:55 PM, Anishek Agarwal <anis...@gmail.com> wrote: > I think these will help speed up > > - removing compression > - you have lot of independent columns mentioned. If you are always going > to query all of them together one other thing that will help is have a full > json(or some custom obj representation) of the value data and change the > model to just have survey_id, hour_created,respondent_id, *json_value* > > On Wed, Apr 22, 2015 at 1:09 PM, John Anderson <son...@gmail.com> wrote: > >> Hey, I'm looking at querying around 500,000 rows that I need to pull into >> a Pandas data frame for processing. Currently testing this on a single >> cassandra node it takes around 21 seconds: >> >> https://gist.github.com/sontek/4ca95f5c5aa539663eaf >> >> I tried introducing multiprocessing so I could use 4 processes at a time >> to query this and I got it down to 14 seconds: >> >> https://gist.github.com/sontek/542f13307ef9679c0094 >> >> Although shaving off 7 seconds is great it still isn't really where I >> would like to be in regards to performance, for this many rows I'd really >> like to get down to a max of 1-2 seconds query time. >> >> What types of optimization's can I make to improve the read performance >> when querying a large set of data? Will this timing speed up linearly as I >> add more nodes? >> >> This is what the schema looks like currently: >> >> https://gist.github.com/sontek/d6fa3fc1b6d085ad3fa4 >> >> >> I'm not tied to the current schema at all, its mostly just a replication >> of what we have in SQL Server. I'm more interested in what things I can >> change to make querying it faster. >> >> Thanks, >> John >> > >