Re: Timeout error in fetching million rows as results using clustering keys

Mehak Mehta Wed, 18 Mar 2015 02:33:08 -0700

Currently there is only single node which I am calling directly with around
150000 rows. Full data will be in around billions per node.
The code is working only for size 100/200. Also the consecutive fetching is
taking around 5-10 secs.


I have a parallel script which is inserting the data while I am reading it.
When I stopped the script it worked for 500/1000 but not more than that.



On Wed, Mar 18, 2015 at 5:08 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:

>  If even 500-1000 isn't working, then your cassandra node might not be up.
>
> 1) Try running nodetool status from shell on your cassandra server, make
> sure the nodes are up.
>
> 2) Are you calling this on the same server where cassandra is running? Its
> trying to connect to localhost . If you're running it on a different
> server, try passing in the direct ip of your cassandra server.
>
> On Wed, Mar 18, 2015 at 2:05 PM, Mehak Mehta <meme...@cs.stonybrook.edu>
> wrote:
>
>> Data won't change much but queries will be different.
>> I am not working on the rendering tool myself so I don't know much
>> details about it.
>>
>> Also as suggested by you I tried to fetch data in size of 500 or 1000
>> with java driver auto pagination.
>> It fails when the number of records are high (around 100000) with
>> following error:
>>
>> Exception in thread "main"
>> com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s)
>> tried for query failed (tried: localhost/127.0.0.1:9042
>> (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for
>> server response))
>>
>>
>> On Wed, Mar 18, 2015 at 4:47 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>
>>> How often does the data change?
>>>
>>> I would still recommend a caching of some kind, but without knowing more
>>> details (how often the data is changing, what you're doing with the 1m rows
>>> after getting them, etc) I can't recommend a solution.
>>>
>>> I did see your other thread. I would also vote for elasticsearch / solr
>>> , they are more suited for the kind of analytics you seem to be doing.
>>> Cassandra is more for storing data, it isn't all that great for complex
>>> queries / analytics.
>>>
>>> If you want to stick to cassandra, you might have better luck if you
>>> made your range columns part of the primary key, so something like PRIMARY
>>> KEY(caseId, x, y)
>>>
>>> On Wed, Mar 18, 2015 at 1:41 PM, Mehak Mehta <meme...@cs.stonybrook.edu>
>>> wrote:
>>>
>>>> The rendering tool renders a portion a very large image. It may fetch
>>>> different data each time from billions of rows.
>>>> So I don't think I can cache such large results. Since same results
>>>> will rarely fetched again.
>>>>
>>>> Also do you know how I can do 2d range queries using Cassandra. Some
>>>> other users suggested me using Solr.
>>>> But is there any way I can achieve that without using any other
>>>> technology.
>>>>
>>>> On Wed, Mar 18, 2015 at 4:33 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>> wrote:
>>>>
>>>>> Sorry, meant to say "that way when you have to render, you can just
>>>>> display the latest cache."
>>>>>
>>>>> On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar <ali.rac...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I would probably do this in a background thread and cache the
>>>>>> results, that way when you have to render, you can just cache the latest
>>>>>> results.
>>>>>>
>>>>>> I don't know why Cassandra can't seem to be able to fetch large batch
>>>>>> sizes, I've also run into these timeouts but reducing the batch size to 
>>>>>> 2k
>>>>>> seemed to work for me.
>>>>>>
>>>>>> On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta <
>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>
>>>>>>> We have UI interface which needs this data for rendering.
>>>>>>> So efficiency of pulling this data matters a lot. It should be
>>>>>>> fetched within a minute.
>>>>>>> Is there a way to achieve such efficiency
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Perhaps just fetch them in batches of 1000 or 2000? For 1m rows, it
>>>>>>>> seems like the difference would only be a few minutes. Do you have to 
>>>>>>>> do
>>>>>>>> this all the time, or only once in a while?
>>>>>>>>
>>>>>>>> On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta <
>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>
>>>>>>>>> yes it works for 1000 but not more than that.
>>>>>>>>> How can I fetch all rows using this efficiently?
>>>>>>>>>
>>>>>>>>> On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Have you tried a smaller fetch size, such as 5k - 2k ?
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta <
>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>
>>>>>>>>>>> I have tried with fetch size of 10000 still its not giving any
>>>>>>>>>>> results.
>>>>>>>>>>> My expectations were that Cassandra can handle a million rows
>>>>>>>>>>> easily.
>>>>>>>>>>>
>>>>>>>>>>> Is there any mistake in the way I am defining the keys or
>>>>>>>>>>> querying them.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Mehak
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil <
>>>>>>>>>>> jens.ran...@tink.se> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Try setting fetchsize before querying. Assuming you don't set
>>>>>>>>>>>> it too high, and you don't have too many tombstones, that should 
>>>>>>>>>>>> do it.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Jens
>>>>>>>>>>>>
>>>>>>>>>>>> –
>>>>>>>>>>>> Skickat från Mailbox <https://www.dropbox.com/mailbox>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta <
>>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have requirement to fetch million row as result of my query
>>>>>>>>>>>>> which is giving timeout errors.
>>>>>>>>>>>>> I am fetching results by selecting clustering columns, then
>>>>>>>>>>>>> why the queries are taking so long. I can change the timeout 
>>>>>>>>>>>>> settings but I
>>>>>>>>>>>>> need the data to fetched faster as per my requirement.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My table definition is:
>>>>>>>>>>>>> *CREATE TABLE images.results (uuid uuid, analysis_execution_id
>>>>>>>>>>>>> varchar, analysis_execution_uuid uuid, x  double, y double, loc 
>>>>>>>>>>>>> varchar, w
>>>>>>>>>>>>> double, h double, normalized varchar, type varchar, filehost 
>>>>>>>>>>>>> varchar,
>>>>>>>>>>>>> filename varchar, image_uuid uuid, image_uri varchar, 
>>>>>>>>>>>>> image_caseid varchar,
>>>>>>>>>>>>> image_mpp_x double, image_mpp_y double, image_width double, 
>>>>>>>>>>>>> image_height
>>>>>>>>>>>>> double, objective double, cancer_type varchar,  Area float, 
>>>>>>>>>>>>> submit_date
>>>>>>>>>>>>> timestamp, points list<double>,  PRIMARY KEY 
>>>>>>>>>>>>> ((image_caseid),Area,uuid));*
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here each row is uniquely identified on the basis of unique
>>>>>>>>>>>>> uuid. But since my data is generally queried based upon 
>>>>>>>>>>>>> *image_caseid
>>>>>>>>>>>>> *I have made it partition key.
>>>>>>>>>>>>> I am currently using Java Datastax api to fetch the results.
>>>>>>>>>>>>> But the query is taking a lot of time resulting in timeout errors:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Exception in thread "main"
>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException: All 
>>>>>>>>>>>>> host(s)
>>>>>>>>>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>>>>>>>>>> (com.datastax.driver.core.exceptions.DriverException: Timed out 
>>>>>>>>>>>>> waiting for
>>>>>>>>>>>>> server response))
>>>>>>>>>>>>>  at
>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
>>>>>>>>>>>>>  at
>>>>>>>>>>>>> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
>>>>>>>>>>>>>  at
>>>>>>>>>>>>> com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
>>>>>>>>>>>>>  at
>>>>>>>>>>>>> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
>>>>>>>>>>>>>  at QueryDB.queryArea(TestQuery.java:59)
>>>>>>>>>>>>>  at TestQuery.main(TestQuery.java:35)
>>>>>>>>>>>>> Caused by:
>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException: All 
>>>>>>>>>>>>> host(s)
>>>>>>>>>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>>>>>>>>>> (com.datastax.driver.core.exceptions.DriverException: Timed out 
>>>>>>>>>>>>> waiting for
>>>>>>>>>>>>> server response))
>>>>>>>>>>>>>  at
>>>>>>>>>>>>> com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
>>>>>>>>>>>>>  at
>>>>>>>>>>>>> com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
>>>>>>>>>>>>>  at
>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>>>>>>>>  at
>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also when I try the same query on console even while using
>>>>>>>>>>>>> limit of 2000 rows:
>>>>>>>>>>>>>
>>>>>>>>>>>>> cqlsh:images> select count(*) from results where
>>>>>>>>>>>>> image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area<100 and Area>20 
>>>>>>>>>>>>> limit 2000;
>>>>>>>>>>>>> errors={}, last_host=127.0.0.1
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>>> Mehak
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Timeout error in fetching million rows as results using clustering keys

Reply via email to