Re: Timeout error in fetching million rows as results using clustering keys

Mehak Mehta Wed, 18 Mar 2015 03:13:07 -0700

ya I have cluster total 10 nodes but I am just testing with one node
currently.
Total data for all nodes will exceed 5 billion rows. But I may have memory
on other nodes.


On Wed, Mar 18, 2015 at 6:06 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:

> 4g also seems small for the kind of load you are trying to handle
> (billions of rows) etc.
>
> I would also try adding more nodes to the cluster.
>
> On Wed, Mar 18, 2015 at 2:53 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>
>> Yeah, it may be that the process is being limited by swap. This page:
>>
>>
>> https://gist.github.com/aliakhtar/3649e412787034156cbb#file-cassandra-install-sh-L42
>>
>> Lines 42 - 48 list a few settings that you could try out for increasing /
>> reducing the memory limits (assuming you're on linux).
>>
>> Also, are you using an SSD? If so make sure the IO scheduler is noop or
>> deadline .
>>
>> On Wed, Mar 18, 2015 at 2:48 PM, Mehak Mehta <meme...@cs.stonybrook.edu>
>> wrote:
>>
>>> Currently Cassandra java process is taking 1% of cpu (total 8% is being
>>> used) and 14.3% memory (out of total 4G memory).
>>> As you can see there is not much load from other processes.
>>>
>>> Should I try changing default parameters of memory in Cassandra settings.
>>>
>>> On Wed, Mar 18, 2015 at 5:33 AM, Ali Akhtar <ali.rac...@gmail.com>
>>> wrote:
>>>
>>>> What's your memory / CPU usage at? And how much ram + cpu do you have
>>>> on this server?
>>>>
>>>>
>>>>
>>>> On Wed, Mar 18, 2015 at 2:31 PM, Mehak Mehta <meme...@cs.stonybrook.edu
>>>> > wrote:
>>>>
>>>>> Currently there is only single node which I am calling directly with
>>>>> around 150000 rows. Full data will be in around billions per node.
>>>>> The code is working only for size 100/200. Also the consecutive
>>>>> fetching is taking around 5-10 secs.
>>>>>
>>>>> I have a parallel script which is inserting the data while I am
>>>>> reading it. When I stopped the script it worked for 500/1000 but not more
>>>>> than that.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 18, 2015 at 5:08 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>  If even 500-1000 isn't working, then your cassandra node might not
>>>>>> be up.
>>>>>>
>>>>>> 1) Try running nodetool status from shell on your cassandra server,
>>>>>> make sure the nodes are up.
>>>>>>
>>>>>> 2) Are you calling this on the same server where cassandra is
>>>>>> running? Its trying to connect to localhost . If you're running it on a
>>>>>> different server, try passing in the direct ip of your cassandra server.
>>>>>>
>>>>>> On Wed, Mar 18, 2015 at 2:05 PM, Mehak Mehta <
>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>
>>>>>>> Data won't change much but queries will be different.
>>>>>>> I am not working on the rendering tool myself so I don't know much
>>>>>>> details about it.
>>>>>>>
>>>>>>> Also as suggested by you I tried to fetch data in size of 500 or
>>>>>>> 1000 with java driver auto pagination.
>>>>>>> It fails when the number of records are high (around 100000) with
>>>>>>> following error:
>>>>>>>
>>>>>>> Exception in thread "main"
>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException: All 
>>>>>>> host(s)
>>>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>>>> (com.datastax.driver.core.exceptions.DriverException: Timed out waiting 
>>>>>>> for
>>>>>>> server response))
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 18, 2015 at 4:47 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> How often does the data change?
>>>>>>>>
>>>>>>>> I would still recommend a caching of some kind, but without knowing
>>>>>>>> more details (how often the data is changing, what you're doing with 
>>>>>>>> the 1m
>>>>>>>> rows after getting them, etc) I can't recommend a solution.
>>>>>>>>
>>>>>>>> I did see your other thread. I would also vote for elasticsearch /
>>>>>>>> solr , they are more suited for the kind of analytics you seem to be 
>>>>>>>> doing.
>>>>>>>> Cassandra is more for storing data, it isn't all that great for complex
>>>>>>>> queries / analytics.
>>>>>>>>
>>>>>>>> If you want to stick to cassandra, you might have better luck if
>>>>>>>> you made your range columns part of the primary key, so something like
>>>>>>>> PRIMARY KEY(caseId, x, y)
>>>>>>>>
>>>>>>>> On Wed, Mar 18, 2015 at 1:41 PM, Mehak Mehta <
>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>
>>>>>>>>> The rendering tool renders a portion a very large image. It may
>>>>>>>>> fetch different data each time from billions of rows.
>>>>>>>>> So I don't think I can cache such large results. Since same
>>>>>>>>> results will rarely fetched again.
>>>>>>>>>
>>>>>>>>> Also do you know how I can do 2d range queries using Cassandra.
>>>>>>>>> Some other users suggested me using Solr.
>>>>>>>>> But is there any way I can achieve that without using any other
>>>>>>>>> technology.
>>>>>>>>>
>>>>>>>>> On Wed, Mar 18, 2015 at 4:33 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry, meant to say "that way when you have to render, you can
>>>>>>>>>> just display the latest cache."
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar <ali.rac...@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> I would probably do this in a background thread and cache the
>>>>>>>>>>> results, that way when you have to render, you can just cache the 
>>>>>>>>>>> latest
>>>>>>>>>>> results.
>>>>>>>>>>>
>>>>>>>>>>> I don't know why Cassandra can't seem to be able to fetch large
>>>>>>>>>>> batch sizes, I've also run into these timeouts but reducing the 
>>>>>>>>>>> batch size
>>>>>>>>>>> to 2k seemed to work for me.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta <
>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> We have UI interface which needs this data for rendering.
>>>>>>>>>>>> So efficiency of pulling this data matters a lot. It should be
>>>>>>>>>>>> fetched within a minute.
>>>>>>>>>>>> Is there a way to achieve such efficiency
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar <
>>>>>>>>>>>> ali.rac...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Perhaps just fetch them in batches of 1000 or 2000? For 1m
>>>>>>>>>>>>> rows, it seems like the difference would only be a few minutes. 
>>>>>>>>>>>>> Do you have
>>>>>>>>>>>>> to do this all the time, or only once in a while?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta <
>>>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> yes it works for 1000 but not more than that.
>>>>>>>>>>>>>> How can I fetch all rows using this efficiently?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar <
>>>>>>>>>>>>>> ali.rac...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Have you tried a smaller fetch size, such as 5k - 2k ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta <
>>>>>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have tried with fetch size of 10000 still its not giving
>>>>>>>>>>>>>>>> any results.
>>>>>>>>>>>>>>>> My expectations were that Cassandra can handle a million
>>>>>>>>>>>>>>>> rows easily.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is there any mistake in the way I am defining the keys or
>>>>>>>>>>>>>>>> querying them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Mehak
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil <
>>>>>>>>>>>>>>>> jens.ran...@tink.se> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Try setting fetchsize before querying. Assuming you don't
>>>>>>>>>>>>>>>>> set it too high, and you don't have too many tombstones, that 
>>>>>>>>>>>>>>>>> should do it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Jens
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> –
>>>>>>>>>>>>>>>>> Skickat från Mailbox <https://www.dropbox.com/mailbox>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta <
>>>>>>>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have requirement to fetch million row as result of my
>>>>>>>>>>>>>>>>>> query which is giving timeout errors.
>>>>>>>>>>>>>>>>>> I am fetching results by selecting clustering columns,
>>>>>>>>>>>>>>>>>> then why the queries are taking so long. I can change the 
>>>>>>>>>>>>>>>>>> timeout settings
>>>>>>>>>>>>>>>>>> but I need the data to fetched faster as per my requirement.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> My table definition is:
>>>>>>>>>>>>>>>>>> *CREATE TABLE images.results (uuid uuid,
>>>>>>>>>>>>>>>>>> analysis_execution_id varchar, analysis_execution_uuid uuid, 
>>>>>>>>>>>>>>>>>> x  double, y
>>>>>>>>>>>>>>>>>> double, loc varchar, w double, h double, normalized varchar, 
>>>>>>>>>>>>>>>>>> type varchar,
>>>>>>>>>>>>>>>>>> filehost varchar, filename varchar, image_uuid uuid, 
>>>>>>>>>>>>>>>>>> image_uri varchar,
>>>>>>>>>>>>>>>>>> image_caseid varchar, image_mpp_x double, image_mpp_y 
>>>>>>>>>>>>>>>>>> double, image_width
>>>>>>>>>>>>>>>>>> double, image_height double, objective double, cancer_type 
>>>>>>>>>>>>>>>>>> varchar,  Area
>>>>>>>>>>>>>>>>>> float, submit_date timestamp, points list<double>,  PRIMARY 
>>>>>>>>>>>>>>>>>> KEY
>>>>>>>>>>>>>>>>>> ((image_caseid),Area,uuid));*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Here each row is uniquely identified on the basis of
>>>>>>>>>>>>>>>>>> unique uuid. But since my data is generally queried based 
>>>>>>>>>>>>>>>>>> upon *image_caseid
>>>>>>>>>>>>>>>>>> *I have made it partition key.
>>>>>>>>>>>>>>>>>> I am currently using Java Datastax api to fetch the
>>>>>>>>>>>>>>>>>> results. But the query is taking a lot of time resulting in 
>>>>>>>>>>>>>>>>>> timeout errors:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Exception in thread "main"
>>>>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException:
>>>>>>>>>>>>>>>>>>  All host(s)
>>>>>>>>>>>>>>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>>>>>>>>>>>>>>> (com.datastax.driver.core.exceptions.DriverException: Timed 
>>>>>>>>>>>>>>>>>> out waiting for
>>>>>>>>>>>>>>>>>> server response))
>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>> com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
>>>>>>>>>>>>>>>>>>  at QueryDB.queryArea(TestQuery.java:59)
>>>>>>>>>>>>>>>>>>  at TestQuery.main(TestQuery.java:35)
>>>>>>>>>>>>>>>>>> Caused by:
>>>>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException:
>>>>>>>>>>>>>>>>>>  All host(s)
>>>>>>>>>>>>>>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>>>>>>>>>>>>>>> (com.datastax.driver.core.exceptions.DriverException: Timed 
>>>>>>>>>>>>>>>>>> out waiting for
>>>>>>>>>>>>>>>>>> server response))
>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>> com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>> com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>>>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Also when I try the same query on console even while
>>>>>>>>>>>>>>>>>> using limit of 2000 rows:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> cqlsh:images> select count(*) from results where
>>>>>>>>>>>>>>>>>> image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area<100 and 
>>>>>>>>>>>>>>>>>> Area>20 limit 2000;
>>>>>>>>>>>>>>>>>> errors={}, last_host=127.0.0.1
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>>>>>>>> Mehak
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Timeout error in fetching million rows as results using clustering keys

Reply via email to