Re: Timeout error in fetching million rows as results using clustering keys

Ali Akhtar Wed, 18 Mar 2015 03:07:49 -0700

4g also seems small for the kind of load you are trying to handle (billions
of rows) etc.


I would also try adding more nodes to the cluster.

On Wed, Mar 18, 2015 at 2:53 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:

> Yeah, it may be that the process is being limited by swap. This page:
>
>
> https://gist.github.com/aliakhtar/3649e412787034156cbb#file-cassandra-install-sh-L42
>
> Lines 42 - 48 list a few settings that you could try out for increasing /
> reducing the memory limits (assuming you're on linux).
>
> Also, are you using an SSD? If so make sure the IO scheduler is noop or
> deadline .
>
> On Wed, Mar 18, 2015 at 2:48 PM, Mehak Mehta <meme...@cs.stonybrook.edu>
> wrote:
>
>> Currently Cassandra java process is taking 1% of cpu (total 8% is being
>> used) and 14.3% memory (out of total 4G memory).
>> As you can see there is not much load from other processes.
>>
>> Should I try changing default parameters of memory in Cassandra settings.
>>
>> On Wed, Mar 18, 2015 at 5:33 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>
>>> What's your memory / CPU usage at? And how much ram + cpu do you have on
>>> this server?
>>>
>>>
>>>
>>> On Wed, Mar 18, 2015 at 2:31 PM, Mehak Mehta <meme...@cs.stonybrook.edu>
>>> wrote:
>>>
>>>> Currently there is only single node which I am calling directly with
>>>> around 150000 rows. Full data will be in around billions per node.
>>>> The code is working only for size 100/200. Also the consecutive
>>>> fetching is taking around 5-10 secs.
>>>>
>>>> I have a parallel script which is inserting the data while I am reading
>>>> it. When I stopped the script it worked for 500/1000 but not more than
>>>> that.
>>>>
>>>>
>>>>
>>>> On Wed, Mar 18, 2015 at 5:08 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>> wrote:
>>>>
>>>>>  If even 500-1000 isn't working, then your cassandra node might not be
>>>>> up.
>>>>>
>>>>> 1) Try running nodetool status from shell on your cassandra server,
>>>>> make sure the nodes are up.
>>>>>
>>>>> 2) Are you calling this on the same server where cassandra is running?
>>>>> Its trying to connect to localhost . If you're running it on a different
>>>>> server, try passing in the direct ip of your cassandra server.
>>>>>
>>>>> On Wed, Mar 18, 2015 at 2:05 PM, Mehak Mehta <
>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>
>>>>>> Data won't change much but queries will be different.
>>>>>> I am not working on the rendering tool myself so I don't know much
>>>>>> details about it.
>>>>>>
>>>>>> Also as suggested by you I tried to fetch data in size of 500 or 1000
>>>>>> with java driver auto pagination.
>>>>>> It fails when the number of records are high (around 100000) with
>>>>>> following error:
>>>>>>
>>>>>> Exception in thread "main"
>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s)
>>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>>> (com.datastax.driver.core.exceptions.DriverException: Timed out waiting 
>>>>>> for
>>>>>> server response))
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 18, 2015 at 4:47 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> How often does the data change?
>>>>>>>
>>>>>>> I would still recommend a caching of some kind, but without knowing
>>>>>>> more details (how often the data is changing, what you're doing with 
>>>>>>> the 1m
>>>>>>> rows after getting them, etc) I can't recommend a solution.
>>>>>>>
>>>>>>> I did see your other thread. I would also vote for elasticsearch /
>>>>>>> solr , they are more suited for the kind of analytics you seem to be 
>>>>>>> doing.
>>>>>>> Cassandra is more for storing data, it isn't all that great for complex
>>>>>>> queries / analytics.
>>>>>>>
>>>>>>> If you want to stick to cassandra, you might have better luck if you
>>>>>>> made your range columns part of the primary key, so something like 
>>>>>>> PRIMARY
>>>>>>> KEY(caseId, x, y)
>>>>>>>
>>>>>>> On Wed, Mar 18, 2015 at 1:41 PM, Mehak Mehta <
>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>
>>>>>>>> The rendering tool renders a portion a very large image. It may
>>>>>>>> fetch different data each time from billions of rows.
>>>>>>>> So I don't think I can cache such large results. Since same results
>>>>>>>> will rarely fetched again.
>>>>>>>>
>>>>>>>> Also do you know how I can do 2d range queries using Cassandra.
>>>>>>>> Some other users suggested me using Solr.
>>>>>>>> But is there any way I can achieve that without using any other
>>>>>>>> technology.
>>>>>>>>
>>>>>>>> On Wed, Mar 18, 2015 at 4:33 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Sorry, meant to say "that way when you have to render, you can
>>>>>>>>> just display the latest cache."
>>>>>>>>>
>>>>>>>>> On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar <ali.rac...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I would probably do this in a background thread and cache the
>>>>>>>>>> results, that way when you have to render, you can just cache the 
>>>>>>>>>> latest
>>>>>>>>>> results.
>>>>>>>>>>
>>>>>>>>>> I don't know why Cassandra can't seem to be able to fetch large
>>>>>>>>>> batch sizes, I've also run into these timeouts but reducing the 
>>>>>>>>>> batch size
>>>>>>>>>> to 2k seemed to work for me.
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta <
>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> We have UI interface which needs this data for rendering.
>>>>>>>>>>> So efficiency of pulling this data matters a lot. It should be
>>>>>>>>>>> fetched within a minute.
>>>>>>>>>>> Is there a way to achieve such efficiency
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar <
>>>>>>>>>>> ali.rac...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Perhaps just fetch them in batches of 1000 or 2000? For 1m
>>>>>>>>>>>> rows, it seems like the difference would only be a few minutes. Do 
>>>>>>>>>>>> you have
>>>>>>>>>>>> to do this all the time, or only once in a while?
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta <
>>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> yes it works for 1000 but not more than that.
>>>>>>>>>>>>> How can I fetch all rows using this efficiently?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar <
>>>>>>>>>>>>> ali.rac...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Have you tried a smaller fetch size, such as 5k - 2k ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta <
>>>>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have tried with fetch size of 10000 still its not giving
>>>>>>>>>>>>>>> any results.
>>>>>>>>>>>>>>> My expectations were that Cassandra can handle a million
>>>>>>>>>>>>>>> rows easily.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there any mistake in the way I am defining the keys or
>>>>>>>>>>>>>>> querying them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Mehak
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil <
>>>>>>>>>>>>>>> jens.ran...@tink.se> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Try setting fetchsize before querying. Assuming you don't
>>>>>>>>>>>>>>>> set it too high, and you don't have too many tombstones, that 
>>>>>>>>>>>>>>>> should do it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Jens
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> –
>>>>>>>>>>>>>>>> Skickat från Mailbox <https://www.dropbox.com/mailbox>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta <
>>>>>>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have requirement to fetch million row as result of my
>>>>>>>>>>>>>>>>> query which is giving timeout errors.
>>>>>>>>>>>>>>>>> I am fetching results by selecting clustering columns,
>>>>>>>>>>>>>>>>> then why the queries are taking so long. I can change the 
>>>>>>>>>>>>>>>>> timeout settings
>>>>>>>>>>>>>>>>> but I need the data to fetched faster as per my requirement.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> My table definition is:
>>>>>>>>>>>>>>>>> *CREATE TABLE images.results (uuid uuid,
>>>>>>>>>>>>>>>>> analysis_execution_id varchar, analysis_execution_uuid uuid, 
>>>>>>>>>>>>>>>>> x  double, y
>>>>>>>>>>>>>>>>> double, loc varchar, w double, h double, normalized varchar, 
>>>>>>>>>>>>>>>>> type varchar,
>>>>>>>>>>>>>>>>> filehost varchar, filename varchar, image_uuid uuid, 
>>>>>>>>>>>>>>>>> image_uri varchar,
>>>>>>>>>>>>>>>>> image_caseid varchar, image_mpp_x double, image_mpp_y double, 
>>>>>>>>>>>>>>>>> image_width
>>>>>>>>>>>>>>>>> double, image_height double, objective double, cancer_type 
>>>>>>>>>>>>>>>>> varchar,  Area
>>>>>>>>>>>>>>>>> float, submit_date timestamp, points list<double>,  PRIMARY 
>>>>>>>>>>>>>>>>> KEY
>>>>>>>>>>>>>>>>> ((image_caseid),Area,uuid));*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here each row is uniquely identified on the basis of
>>>>>>>>>>>>>>>>> unique uuid. But since my data is generally queried based 
>>>>>>>>>>>>>>>>> upon *image_caseid
>>>>>>>>>>>>>>>>> *I have made it partition key.
>>>>>>>>>>>>>>>>> I am currently using Java Datastax api to fetch the
>>>>>>>>>>>>>>>>> results. But the query is taking a lot of time resulting in 
>>>>>>>>>>>>>>>>> timeout errors:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  Exception in thread "main"
>>>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException: 
>>>>>>>>>>>>>>>>> All host(s)
>>>>>>>>>>>>>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>>>>>>>>>>>>>> (com.datastax.driver.core.exceptions.DriverException: Timed 
>>>>>>>>>>>>>>>>> out waiting for
>>>>>>>>>>>>>>>>> server response))
>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>> com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
>>>>>>>>>>>>>>>>>  at QueryDB.queryArea(TestQuery.java:59)
>>>>>>>>>>>>>>>>>  at TestQuery.main(TestQuery.java:35)
>>>>>>>>>>>>>>>>> Caused by:
>>>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException: 
>>>>>>>>>>>>>>>>> All host(s)
>>>>>>>>>>>>>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>>>>>>>>>>>>>> (com.datastax.driver.core.exceptions.DriverException: Timed 
>>>>>>>>>>>>>>>>> out waiting for
>>>>>>>>>>>>>>>>> server response))
>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>> com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>> com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also when I try the same query on console even while using
>>>>>>>>>>>>>>>>> limit of 2000 rows:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> cqlsh:images> select count(*) from results where
>>>>>>>>>>>>>>>>> image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area<100 and 
>>>>>>>>>>>>>>>>> Area>20 limit 2000;
>>>>>>>>>>>>>>>>> errors={}, last_host=127.0.0.1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>>>>>>> Mehak
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Timeout error in fetching million rows as results using clustering keys

Reply via email to