Re: Timeout error in fetching million rows as results using clustering keys

Eric Stevens Wed, 18 Mar 2015 06:58:45 -0700

>From your description, it sounds like you have a single partition key with
millions of clustered values on the same partition.  That's a very wide
partition.  You may very likely be causing a lot of memory pressure in your
Cassandra node (especially at 4G) while trying to execute the query.
Although the hard upper limit is 2 billion values per partition key, the
practical limit is much lower, sometimes more like 100k.  Also with very
wide partitions, you cannot take advantage of Cassandra's distributed
nature for reads, only one node will be involved in the read, so one node
will perform as well as a million nodes.


If bounding by area is a common task, then it might make sense to put area
or at least part of area into the partition key (bucket by area / 10 or /
100 or something) just to distribute the data around your cluster a little
better.  It makes your query path a little more involved, but it buys you
parallelism (you could execute all area buckets in a given query
simultaneously, and if your cluster is large enough only typically one node
is involved for each area bucket).

I wonder what your write pattern is like to fill the data in for a given
case ID.  Are you appending to the same partition key over a long period of
time?  If so, you may be scattering the data for a given partition key over
a large number of SSTables, and slowing down the read dramatically.  If
you're using size tiered compaction, do nodetool compact on that table and
wait for the node to settle down (0 outstanding/pending tasks in nodetool
compactionstats), then see if performance improves (you may also be able to
use nodetool cfhistograms to see how many sstables are being involved in a
read typically, but if all your queries are timing out, I'm not sure if
that will be an accurate reflection or not).

> It may fetch different data each time from billions of rows.
> My expectations were that Cassandra can handle a million rows easily.

I have a data set several orders of magnitude larger than what you're
talking about WRT your final data size, and with appropriate query and
storage patterns, Cassandra can definitely handle this kind of data.

One final note, your column names are pretty long.  You pay to store each
column name each time you store that column.  On small data sets it doesn't
matter, but at billions of rows it starts to add up.  There's negligible
(but nonzero) performance cost, but over time you may find that you have to
scale out just because you're filling up disks. See
http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningUserData_t.html


On Wed, Mar 18, 2015 at 6:19 AM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> Cassandra can certainly handle millions and even billions of rows, but...
> it is a very clear anti-pattern to design a single query to return more
> than a relatively small number of rows except through paging. How small?
> Low hundreds is probably a reasonable limit. It is also an anti-pattern to
> filter or analyze a large number of rows in a single query - that's why
> there are so many crazy restrictions and the requirement to use ALLOW
> FILTERING - to reinforce that Cassandra is designed for short and
> performant queries, not large-scale retrieval of a large number of rows. As
> a general rule, the user of ALLOW FILTERING is an anti-pattern and a yellow
> flag that you are doing something wrong.
>
> As a minor point, check your partition key - you should try to "bucket"
> rows that will tend to be accessed together so that they have locality so
> that they can be fetched together.
>
> Rather than using a raw x and y coordinate range, consider indexing by a
> "chunk" number and then you can query by chunk number for direct access to
> the partition and row key, without the need for inequality filtering.
>
>
> -- Jack Krupansky
>
> On Wed, Mar 18, 2015 at 3:22 AM, Mehak Mehta <meme...@cs.stonybrook.edu>
> wrote:
>
>> Hi Jens,
>>
>> I have tried with fetch size of 10000 still its not giving any results.
>> My expectations were that Cassandra can handle a million rows easily.
>>
>> Is there any mistake in the way I am defining the keys or querying them.
>>
>> Thanks
>> Mehak
>>
>> On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil <jens.ran...@tink.se> wrote:
>>
>>> Hi,
>>>
>>> Try setting fetchsize before querying. Assuming you don't set it too
>>> high, and you don't have too many tombstones, that should do it.
>>>
>>> Cheers,
>>> Jens
>>>
>>> –
>>> Skickat från Mailbox <https://www.dropbox.com/mailbox>
>>>
>>>
>>> On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta <meme...@cs.stonybrook.edu>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have requirement to fetch million row as result of my query which is
>>>> giving timeout errors.
>>>> I am fetching results by selecting clustering columns, then why the
>>>> queries are taking so long. I can change the timeout settings but I need
>>>> the data to fetched faster as per my requirement.
>>>>
>>>> My table definition is:
>>>> *CREATE TABLE images.results (uuid uuid, analysis_execution_id varchar,
>>>> analysis_execution_uuid uuid, x  double, y double, loc varchar, w double, h
>>>> double, normalized varchar, type varchar, filehost varchar, filename
>>>> varchar, image_uuid uuid, image_uri varchar, image_caseid varchar,
>>>> image_mpp_x double, image_mpp_y double, image_width double, image_height
>>>> double, objective double, cancer_type varchar,  Area float, submit_date
>>>> timestamp, points list<double>,  PRIMARY KEY ((image_caseid),Area,uuid));*
>>>>
>>>> Here each row is uniquely identified on the basis of unique uuid. But
>>>> since my data is generally queried based upon *image_caseid *I have
>>>> made it partition key.
>>>> I am currently using Java Datastax api to fetch the results. But the
>>>> query is taking a lot of time resulting in timeout errors:
>>>>
>>>>  Exception in thread "main"
>>>> com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s)
>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>> (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for
>>>> server response))
>>>>  at
>>>> com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
>>>>  at
>>>> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
>>>>  at
>>>> com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
>>>>  at
>>>> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
>>>>  at QueryDB.queryArea(TestQuery.java:59)
>>>>  at TestQuery.main(TestQuery.java:35)
>>>> Caused by:
>>>> com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s)
>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>> (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for
>>>> server response))
>>>>  at
>>>> com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
>>>>  at
>>>> com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
>>>>  at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>  at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>  at java.lang.Thread.run(Thread.java:744)
>>>>
>>>> Also when I try the same query on console even while using limit of
>>>> 2000 rows:
>>>>
>>>> cqlsh:images> select count(*) from results where
>>>> image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area<100 and Area>20 limit 2000;
>>>> errors={}, last_host=127.0.0.1
>>>>
>>>> Thanks and Regards,
>>>> Mehak
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Timeout error in fetching million rows as results using clustering keys

Reply via email to