Re: Timeout error in fetching million rows as results using clustering keys

Mehak Mehta Wed, 18 Mar 2015 02:50:35 -0700

Currently Cassandra java process is taking 1% of cpu (total 8% is being
used) and 14.3% memory (out of total 4G memory).
As you can see there is not much load from other processes.


Should I try changing default parameters of memory in Cassandra settings.

On Wed, Mar 18, 2015 at 5:33 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:

> What's your memory / CPU usage at? And how much ram + cpu do you have on
> this server?
>
>
>
> On Wed, Mar 18, 2015 at 2:31 PM, Mehak Mehta <meme...@cs.stonybrook.edu>
> wrote:
>
>> Currently there is only single node which I am calling directly with
>> around 150000 rows. Full data will be in around billions per node.
>> The code is working only for size 100/200. Also the consecutive fetching
>> is taking around 5-10 secs.
>>
>> I have a parallel script which is inserting the data while I am reading
>> it. When I stopped the script it worked for 500/1000 but not more than
>> that.
>>
>>
>>
>> On Wed, Mar 18, 2015 at 5:08 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>
>>>  If even 500-1000 isn't working, then your cassandra node might not be
>>> up.
>>>
>>> 1) Try running nodetool status from shell on your cassandra server, make
>>> sure the nodes are up.
>>>
>>> 2) Are you calling this on the same server where cassandra is running?
>>> Its trying to connect to localhost . If you're running it on a different
>>> server, try passing in the direct ip of your cassandra server.
>>>
>>> On Wed, Mar 18, 2015 at 2:05 PM, Mehak Mehta <meme...@cs.stonybrook.edu>
>>> wrote:
>>>
>>>> Data won't change much but queries will be different.
>>>> I am not working on the rendering tool myself so I don't know much
>>>> details about it.
>>>>
>>>> Also as suggested by you I tried to fetch data in size of 500 or 1000
>>>> with java driver auto pagination.
>>>> It fails when the number of records are high (around 100000) with
>>>> following error:
>>>>
>>>> Exception in thread "main"
>>>> com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s)
>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>> (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for
>>>> server response))
>>>>
>>>>
>>>> On Wed, Mar 18, 2015 at 4:47 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>> wrote:
>>>>
>>>>> How often does the data change?
>>>>>
>>>>> I would still recommend a caching of some kind, but without knowing
>>>>> more details (how often the data is changing, what you're doing with the 
>>>>> 1m
>>>>> rows after getting them, etc) I can't recommend a solution.
>>>>>
>>>>> I did see your other thread. I would also vote for elasticsearch /
>>>>> solr , they are more suited for the kind of analytics you seem to be 
>>>>> doing.
>>>>> Cassandra is more for storing data, it isn't all that great for complex
>>>>> queries / analytics.
>>>>>
>>>>> If you want to stick to cassandra, you might have better luck if you
>>>>> made your range columns part of the primary key, so something like PRIMARY
>>>>> KEY(caseId, x, y)
>>>>>
>>>>> On Wed, Mar 18, 2015 at 1:41 PM, Mehak Mehta <
>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>
>>>>>> The rendering tool renders a portion a very large image. It may fetch
>>>>>> different data each time from billions of rows.
>>>>>> So I don't think I can cache such large results. Since same results
>>>>>> will rarely fetched again.
>>>>>>
>>>>>> Also do you know how I can do 2d range queries using Cassandra. Some
>>>>>> other users suggested me using Solr.
>>>>>> But is there any way I can achieve that without using any other
>>>>>> technology.
>>>>>>
>>>>>> On Wed, Mar 18, 2015 at 4:33 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Sorry, meant to say "that way when you have to render, you can just
>>>>>>> display the latest cache."
>>>>>>>
>>>>>>> On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar <ali.rac...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I would probably do this in a background thread and cache the
>>>>>>>> results, that way when you have to render, you can just cache the 
>>>>>>>> latest
>>>>>>>> results.
>>>>>>>>
>>>>>>>> I don't know why Cassandra can't seem to be able to fetch large
>>>>>>>> batch sizes, I've also run into these timeouts but reducing the batch 
>>>>>>>> size
>>>>>>>> to 2k seemed to work for me.
>>>>>>>>
>>>>>>>> On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta <
>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>
>>>>>>>>> We have UI interface which needs this data for rendering.
>>>>>>>>> So efficiency of pulling this data matters a lot. It should be
>>>>>>>>> fetched within a minute.
>>>>>>>>> Is there a way to achieve such efficiency
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Perhaps just fetch them in batches of 1000 or 2000? For 1m rows,
>>>>>>>>>> it seems like the difference would only be a few minutes. Do you 
>>>>>>>>>> have to do
>>>>>>>>>> this all the time, or only once in a while?
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta <
>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> yes it works for 1000 but not more than that.
>>>>>>>>>>> How can I fetch all rows using this efficiently?
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar <
>>>>>>>>>>> ali.rac...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Have you tried a smaller fetch size, such as 5k - 2k ?
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta <
>>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have tried with fetch size of 10000 still its not giving any
>>>>>>>>>>>>> results.
>>>>>>>>>>>>> My expectations were that Cassandra can handle a million rows
>>>>>>>>>>>>> easily.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there any mistake in the way I am defining the keys or
>>>>>>>>>>>>> querying them.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Mehak
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil <
>>>>>>>>>>>>> jens.ran...@tink.se> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Try setting fetchsize before querying. Assuming you don't set
>>>>>>>>>>>>>> it too high, and you don't have too many tombstones, that should 
>>>>>>>>>>>>>> do it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Jens
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> –
>>>>>>>>>>>>>> Skickat från Mailbox <https://www.dropbox.com/mailbox>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta <
>>>>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have requirement to fetch million row as result of my
>>>>>>>>>>>>>>> query which is giving timeout errors.
>>>>>>>>>>>>>>> I am fetching results by selecting clustering columns, then
>>>>>>>>>>>>>>> why the queries are taking so long. I can change the timeout 
>>>>>>>>>>>>>>> settings but I
>>>>>>>>>>>>>>> need the data to fetched faster as per my requirement.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My table definition is:
>>>>>>>>>>>>>>> *CREATE TABLE images.results (uuid uuid,
>>>>>>>>>>>>>>> analysis_execution_id varchar, analysis_execution_uuid uuid, x  
>>>>>>>>>>>>>>> double, y
>>>>>>>>>>>>>>> double, loc varchar, w double, h double, normalized varchar, 
>>>>>>>>>>>>>>> type varchar,
>>>>>>>>>>>>>>> filehost varchar, filename varchar, image_uuid uuid, image_uri 
>>>>>>>>>>>>>>> varchar,
>>>>>>>>>>>>>>> image_caseid varchar, image_mpp_x double, image_mpp_y double, 
>>>>>>>>>>>>>>> image_width
>>>>>>>>>>>>>>> double, image_height double, objective double, cancer_type 
>>>>>>>>>>>>>>> varchar,  Area
>>>>>>>>>>>>>>> float, submit_date timestamp, points list<double>,  PRIMARY KEY
>>>>>>>>>>>>>>> ((image_caseid),Area,uuid));*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here each row is uniquely identified on the basis of unique
>>>>>>>>>>>>>>> uuid. But since my data is generally queried based upon 
>>>>>>>>>>>>>>> *image_caseid
>>>>>>>>>>>>>>> *I have made it partition key.
>>>>>>>>>>>>>>> I am currently using Java Datastax api to fetch the results.
>>>>>>>>>>>>>>> But the query is taking a lot of time resulting in timeout 
>>>>>>>>>>>>>>> errors:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Exception in thread "main"
>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException: 
>>>>>>>>>>>>>>> All host(s)
>>>>>>>>>>>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>>>>>>>>>>>> (com.datastax.driver.core.exceptions.DriverException: Timed out 
>>>>>>>>>>>>>>> waiting for
>>>>>>>>>>>>>>> server response))
>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>> com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
>>>>>>>>>>>>>>>  at QueryDB.queryArea(TestQuery.java:59)
>>>>>>>>>>>>>>>  at TestQuery.main(TestQuery.java:35)
>>>>>>>>>>>>>>> Caused by:
>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException: 
>>>>>>>>>>>>>>> All host(s)
>>>>>>>>>>>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>>>>>>>>>>>> (com.datastax.driver.core.exceptions.DriverException: Timed out 
>>>>>>>>>>>>>>> waiting for
>>>>>>>>>>>>>>> server response))
>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>> com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>> com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also when I try the same query on console even while using
>>>>>>>>>>>>>>> limit of 2000 rows:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> cqlsh:images> select count(*) from results where
>>>>>>>>>>>>>>> image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area<100 and Area>20 
>>>>>>>>>>>>>>> limit 2000;
>>>>>>>>>>>>>>> errors={}, last_host=127.0.0.1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>>>>> Mehak
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Timeout error in fetching million rows as results using clustering keys

Reply via email to