Re: Missing data

Jean Tremblay Mon, 15 Jun 2015 10:20:26 -0700

Thanks Bryan.
I believe I have a different problem with the Datastax 2.1.6 driver.
My problem is not that I make huge selects.
My problem seems more to occur on some inserts. I inserts MANY rows and with 
the version 2.1.6 of the driver I seem to be loosing some records.


But thanks anyway I will remember your mail when I bump into the select problem.

Cheers

Jean


On 15 Jun 2015, at 19:13 , Bryan Holladay 
<holla...@longsight.com<mailto:holla...@longsight.com>> wrote:

Theres your problem, you're using the DataStax java driver :) I just ran into 
this issue in the last week and it was incredibly frustrating. If you are doing 
a simple loop on a "select * " query, then the DataStax java driver will only 
process 2^31 rows (e.g. the Java Integer Max (2,147,483,647)) before it stops 
w/o any error or output in the logs. The fact that you said you only had about 
2 billion rows but you are seeing missing data is a red flag.

I found the only way around this is to do your "select *" in chunks based on 
the token range (see this gist for an example: 
https://gist.github.com/baholladay/21eb4c61ea8905302195 )
Just loop for every 100million rows and make a new query "select * from TABLE 
where token(key) > lastToken"

Thanks,
Bryan




On Mon, Jun 15, 2015 at 12:50 PM, Jean Tremblay 
<jean.tremb...@zen-innovations.com<mailto:jean.tremb...@zen-innovations.com>> 
wrote:
Dear all,

I identified a bit more closely the root cause of my missing data.

The problem is occurring when I use

<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>2.1.6</version>
</dependency>

on my client against Cassandra 2.1.6.

I did not have the problem when I was using the driver 2.1.4 with C* 2.1.4.
Interestingly enough I don’t have the problem with the driver 2.1.4 with C* 
2.1.6.  !!!!!!

So as far as I can locate the problem, I would say that the version 2.1.6 of 
the driver is not working properly and is loosing some of my records.!!!

——————

As far as my tombstones are concerned I don’t understand their origin.
I removed all location in my code where I delete items, and I do not use TTL 
anywhere ( I don’t need this feature in my project).

And yet I have many tombstones building up.

Is there another origin for tombstone beside TTL, and deleting items? Could the 
compaction of LeveledCompactionStrategy be the origin of them?

@Carlos thanks for your guidance.

Kind regards

Jean



On 15 Jun 2015, at 11:17 , Carlos Rolo 
<r...@pythian.com<mailto:r...@pythian.com>> wrote:

Hi Jean,

The problem of that Warning is that you are reading too many tombstones per 
request.

If you do have Tombstones without doing DELETE it because you probably TTL'ed 
the data when inserting (By mistake? Or did you set default_time_to_live in 
your table?). You can use nodetool cfstats to see how many tombstones per read 
slice you have. This is, probably, also the cause of your missing data. Data 
was tombstoned, so it is not available.



Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: 
linkedin.com/in/carlosjuzarterolo<http://linkedin.com/in/carlosjuzarterolo>
Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 
x1649<tel:%2B1%20613%20565%208696%20x1649>
www.pythian.com<http://www.pythian.com/>

On Mon, Jun 15, 2015 at 10:54 AM, Jean Tremblay 
<jean.tremb...@zen-innovations.com<mailto:jean.tremb...@zen-innovations.com>> 
wrote:
Hi,

I have reloaded the data in my cluster of 3 nodes RF: 2.
I have loaded about 2 billion rows in one table.
I use LeveledCompactionStrategy on my table.
I use version 2.1.6.
I use the default cassandra.yaml, only the ip address for seeds and throughput 
has been change.

I loaded my data with simple insert statements. This took a bit more than one 
day to load the data… and one more day to compact the data on all nodes.
For me this is quite acceptable since I should not be doing this again.
I have done this with previous versions like 2.1.3 and others and I basically 
had absolutely no problems.

Now I read the log files on the client side, there I see no warning and no 
errors.
On the nodes side there I see many WARNING, all related with tombstones, but 
there are no ERRORS.

My problem is that I see some *many missing records* in the DB, and I have 
never observed this with previous versions.

1) Is this a know problem?
2) Do you have any idea how I could track down this problem?
3) What is the meaning of this WARNING (the only type of ERROR | WARN  I could 
find)?

WARN  [SharedPool-Worker-2] 2015-06-15 10:12:00,866 SliceQueryFilter.java:319 - 
Read 2990 live and 16016 tombstone cells in gttdata.alltrades_co_rep_pcode for 
key: D:07 (see tombstone_warn_threshold). 5000 columns were requested, 
slices=[388:201001-388:201412:!]


4) Is it possible to have Tombstone when we make no DELETE statements?

I’m lost…

Thanks for your help.



--

Re: Missing data

Reply via email to