Re: Phantom growth resulting automatically node shutdown

2018-04-19 Thread horschi
Did you check the number of files in your data folder before & after the
restart?

I have seen cases where cassandra would keep creating sstables, which
disappeared on restart.

regards,
Christian


On Thu, Apr 19, 2018 at 12:18 PM, Fernando Neves 
wrote:

> I am facing one issue with our Cassandra cluster.
>
> Details: Cassandra 3.0.14, 12 nodes, 7.4TB(JBOD) disk size in each node,
> ~3.5TB used physical data in each node, ~42TB whole cluster and default
> compaction setup. This size maintain the same because after the retention
> period some tables are dropped.
>
> Issue: Nodetool status is not showing the correct used size in the output.
> It keeps increasing the used size without limit until automatically node
> shutdown or until our sequential scheduled restart(workaround 3 times
> week). After the restart, nodetool shows the correct used space but for few
> days.
> Did anybody have similar problem? Is it a bug?
>
> Stackoverflow: https://stackoverflow.com/ques
> tions/49668692/cassandra-nodetool-status-is-not-showing-correct-used-space
>
>


Re: Repair of 5GB data vs. disk throughput does not make sense

2018-04-26 Thread horschi
Hi Thomas,

I don't think I have seen compaction ever being faster.

For me, tables with small values usually are around 5 MB/s with a single
compaction. With larger blobs (few KB per blob) I have seen 16MB/s. Both
with "nodetool setcompactionthroughput 0".

I don't think its disk related either. I think parsing the data simply
utilizes the CPU or perhaps the issue is GC related? But I have never dug
into it, I just observed low IO-wait percentages in top.

regards,
Christian




On Thu, Apr 26, 2018 at 7:39 PM, Jonathan Haddad  wrote:

> I can't say for sure, because I haven't measured it, but I've seen a
> combination of readahead + large chunk size with compression cause serious
> issues with read amplification, although I'm not sure if or how it would
> apply here.  Likely depends on the size of your partitions and the
> fragmentation of the sstables, although at only 5GB I'm really surprised to
> hear 32GB read in, that seems a bit absurd.
>
> Definitely something to dig deeper into.
>
> On Thu, Apr 26, 2018 at 5:02 AM Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
>> Hello,
>>
>>
>>
>> yet another question/issue with repair.
>>
>>
>>
>> Cassandra 2.1.18, 3 nodes, RF=3, vnode=256, data volume ~ 5G per node
>> only. A repair (nodetool repair -par) issued on a single node at this data
>> volume takes around 36min with an AVG of ~ 15MByte/s disk throughput
>> (read+write) for the entire time-frame, thus processing ~ 32GByte from a
>> disk perspective so ~ 6 times of the real data volume reported by nodetool
>> status. Does this make any sense? This is with 4 compaction threads and
>> compaction throughput = 64. Similar results doing this test a few times,
>> where most/all inconsistent data should be already sorted out by previous
>> runs.
>>
>>
>>
>> I know there is e.g. reaper, but the above is a simple use case simply
>> after a single failed node recovers beyond the 3h hinted handoff window.
>> How should this finish in a timely manner for > 500G on a recovering node?
>>
>>
>>
>> I have to admit this is with NFS as storage. I know, NFS might not be the
>> best idea, but with the above test at ~ 5GB data volume, we see an IOPS
>> rate at ~ 700 at a disk latency of ~ 15ms, thus I wouldn’t treat it as that
>> bad. This all is using/running Cassandra on-premise (at the customer, so
>> not hosted by us), so while we can make recommendations storage-wise (of
>> course preferring local disks), it may and will happen that NFS is being in
>> use then.
>>
>>
>>
>> Why we are using -par in combination with NFS is a different story and
>> related to this issue: https://issues.apache.org/
>> jira/browse/CASSANDRA-8743. Without switching from sequential to
>> parallel repair, we basically kill Cassandra.
>>
>>
>>
>> Throughput-wise, I also don’t think it is related to NFS, cause we see
>> similar repair throughput values with AWS EBS (gp2, SSD based) running
>> regular repairs on small-sized CFs.
>>
>>
>>
>> Thanks for any input.
>>
>> Thomas
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>>
>


CAS operation does not return value on failure

2016-05-04 Thread horschi
Hi,

I am doing some testing on CAS operations and I am frequently having the
issue that my resultset says wasApplied()==false, but it does not contain
any value.


This behaviour of course leads to the following Exception when I try to
read it:

Caused by: java.lang.IllegalArgumentException: value is not a column
defined in this metadata
at
com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273)
at
com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279)
at
com.datastax.driver.core.ArrayBackedRow.getIndexOf(ArrayBackedRow.java:68)
at
com.datastax.driver.core.AbstractGettableData.getBytes(AbstractGettableData.java:131)



My questions now are:

Is it to be expected that a failing CAS operation sometimes does this?

if yes: Shouldn't there a possibility on the driver side to handle this in
a better was, e.g. add a "hasColumn()" method or something to the ResultSet?

if no: Is that perhaps a symptom to a greater issue in cassandra?


kind regards,
Christian

PS: I also appreciate general feedback on the entire C* CAS topic :-)


Re: CAS operation does not return value on failure

2016-05-05 Thread horschi
Hi Jack,

I thought that it is Cassandra that fills the value on CAS failures. So the
question if it is to be expected to have wasApplied()==false and not have
any value in the ResultSet should belong here.

So my question for this mailing list would be:

Is it correct behaviour that C* returns wasApplied()==false but not any
value? My expectation was that there always is a value in such a case.

kind regards,
Christian


On Wed, May 4, 2016 at 6:00 PM, Jack Krupansky 
wrote:

> Probably better to ask this on the Java driver user list.
>
>
> -- Jack Krupansky
>
> On Wed, May 4, 2016 at 11:46 AM, horschi  wrote:
>
>> Hi,
>>
>> I am doing some testing on CAS operations and I am frequently having the
>> issue that my resultset says wasApplied()==false, but it does not contain
>> any value.
>>
>>
>> This behaviour of course leads to the following Exception when I try to
>> read it:
>>
>> Caused by: java.lang.IllegalArgumentException: value is not a column
>> defined in this metadata
>> at
>> com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273)
>> at
>> com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279)
>> at
>> com.datastax.driver.core.ArrayBackedRow.getIndexOf(ArrayBackedRow.java:68)
>> at
>> com.datastax.driver.core.AbstractGettableData.getBytes(AbstractGettableData.java:131)
>>
>>
>>
>> My questions now are:
>>
>> Is it to be expected that a failing CAS operation sometimes does this?
>>
>> if yes: Shouldn't there a possibility on the driver side to handle this
>> in a better was, e.g. add a "hasColumn()" method or something to the
>> ResultSet?
>>
>> if no: Is that perhaps a symptom to a greater issue in cassandra?
>>
>>
>> kind regards,
>> Christian
>>
>> PS: I also appreciate general feedback on the entire C* CAS topic :-)
>>
>>
>>
>


Re: CAS operation does not return value on failure

2016-05-09 Thread horschi
Hi Jack,

sorry to keep you busy :-)

There definitely is a column named "value" in the table. And most of the
time this codepath works fine, even when my CAS update fails. But in very
rare cases I get a ResultSet that contains applied=false but does not
contain any value column.


I just ran my test again and found a empty ResultSet for the following
query:

delete from "Lock" where lockname=:lockname and id=:id if value=:value

--> ResultSet contains only [applied]=false, but no lockname, id or value.


Am I correct in my assumption that this should not be?

kind regards,
Christian



On Fri, May 6, 2016 at 1:20 AM, Jack Krupansky 
wrote:

> "value" in that message is the name of a column that is expect to be in
> your table schema - the message is simply complaining that you have no
> column named "value" in that table.
> The error concerns the table schema, not any actual data in either the
> statement or the table.
>
> "metadata" is simply referring to your table schema.
>
> Does your table schema have a "value" column?
> Does your preared statement refer to a "value" column, or are you
> supplying that name when executing the prepared statement?
>
> The "datastax.driver.core" in the exception trace class names indicates
> that the error is detected in the Java driver, not Cassandra.
>
>
>
> -- Jack Krupansky
>
> On Thu, May 5, 2016 at 6:45 PM, horschi  wrote:
>
>> Hi Jack,
>>
>> I thought that it is Cassandra that fills the value on CAS failures. So
>> the question if it is to be expected to have wasApplied()==false and not
>> have any value in the ResultSet should belong here.
>>
>> So my question for this mailing list would be:
>>
>> Is it correct behaviour that C* returns wasApplied()==false but not any
>> value? My expectation was that there always is a value in such a case.
>>
>> kind regards,
>> Christian
>>
>>
>> On Wed, May 4, 2016 at 6:00 PM, Jack Krupansky 
>> wrote:
>>
>>> Probably better to ask this on the Java driver user list.
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Wed, May 4, 2016 at 11:46 AM, horschi  wrote:
>>>
>>>> Hi,
>>>>
>>>> I am doing some testing on CAS operations and I am frequently having
>>>> the issue that my resultset says wasApplied()==false, but it does not
>>>> contain any value.
>>>>
>>>>
>>>> This behaviour of course leads to the following Exception when I try to
>>>> read it:
>>>>
>>>> Caused by: java.lang.IllegalArgumentException: value is not a column
>>>> defined in this metadata
>>>> at
>>>> com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273)
>>>> at
>>>> com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279)
>>>> at
>>>> com.datastax.driver.core.ArrayBackedRow.getIndexOf(ArrayBackedRow.java:68)
>>>> at
>>>> com.datastax.driver.core.AbstractGettableData.getBytes(AbstractGettableData.java:131)
>>>>
>>>>
>>>>
>>>> My questions now are:
>>>>
>>>> Is it to be expected that a failing CAS operation sometimes does this?
>>>>
>>>> if yes: Shouldn't there a possibility on the driver side to handle this
>>>> in a better was, e.g. add a "hasColumn()" method or something to the
>>>> ResultSet?
>>>>
>>>> if no: Is that perhaps a symptom to a greater issue in cassandra?
>>>>
>>>>
>>>> kind regards,
>>>> Christian
>>>>
>>>> PS: I also appreciate general feedback on the entire C* CAS topic :-)
>>>>
>>>>
>>>>
>>>
>>
>


Re: CAS operation does not return value on failure

2016-05-09 Thread horschi
I just retried with Cassandra 3.0.5 and it performs much better. Not a
single of these illegal results.

I guess my recommendation for anyone using CAS is: Upgrade to >= 3.x :-)

On Wed, May 4, 2016 at 5:46 PM, horschi  wrote:

> Hi,
>
> I am doing some testing on CAS operations and I am frequently having the
> issue that my resultset says wasApplied()==false, but it does not contain
> any value.
>
>
> This behaviour of course leads to the following Exception when I try to
> read it:
>
> Caused by: java.lang.IllegalArgumentException: value is not a column
> defined in this metadata
> at
> com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273)
> at
> com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279)
> at
> com.datastax.driver.core.ArrayBackedRow.getIndexOf(ArrayBackedRow.java:68)
> at
> com.datastax.driver.core.AbstractGettableData.getBytes(AbstractGettableData.java:131)
>
>
>
> My questions now are:
>
> Is it to be expected that a failing CAS operation sometimes does this?
>
> if yes: Shouldn't there a possibility on the driver side to handle this in
> a better was, e.g. add a "hasColumn()" method or something to the ResultSet?
>
> if no: Is that perhaps a symptom to a greater issue in cassandra?
>
>
> kind regards,
> Christian
>
> PS: I also appreciate general feedback on the entire C* CAS topic :-)
>
>
>


Re: CAS operation does not return value on failure

2016-05-09 Thread horschi
Update: It was actually the driver update (from 2.1.9 to 3.0.1) that solved
the issue. I reverted by C* Server back to 2.2 and my test is still ok.

On Mon, May 9, 2016 at 1:28 PM, horschi  wrote:

> I just retried with Cassandra 3.0.5 and it performs much better. Not a
> single of these illegal results.
>
> I guess my recommendation for anyone using CAS is: Upgrade to >= 3.x :-)
>
> On Wed, May 4, 2016 at 5:46 PM, horschi  wrote:
>
>> Hi,
>>
>> I am doing some testing on CAS operations and I am frequently having the
>> issue that my resultset says wasApplied()==false, but it does not contain
>> any value.
>>
>>
>> This behaviour of course leads to the following Exception when I try to
>> read it:
>>
>> Caused by: java.lang.IllegalArgumentException: value is not a column
>> defined in this metadata
>> at
>> com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273)
>> at
>> com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279)
>> at
>> com.datastax.driver.core.ArrayBackedRow.getIndexOf(ArrayBackedRow.java:68)
>> at
>> com.datastax.driver.core.AbstractGettableData.getBytes(AbstractGettableData.java:131)
>>
>>
>>
>> My questions now are:
>>
>> Is it to be expected that a failing CAS operation sometimes does this?
>>
>> if yes: Shouldn't there a possibility on the driver side to handle this
>> in a better was, e.g. add a "hasColumn()" method or something to the
>> ResultSet?
>>
>> if no: Is that perhaps a symptom to a greater issue in cassandra?
>>
>>
>> kind regards,
>> Christian
>>
>> PS: I also appreciate general feedback on the entire C* CAS topic :-)
>>
>>
>>
>


C* 2.2.7 ?

2016-06-21 Thread horschi
Hi,

are there any plans to release 2.2.7 any time soon?

kind regards,
Christian


Re: C* 2.2.7 ?

2016-06-29 Thread horschi
Awesome! There is a lot of good stuff in 2.2.7 :-)

On Wed, Jun 29, 2016 at 5:37 PM, Tyler Hobbs  wrote:

> 2.2.7 just got tentatively tagged yesterday.  So, there should be a vote
> on releasing it shortly.
>
> On Wed, Jun 29, 2016 at 8:24 AM, Dominik Keil 
> wrote:
>
>> +1
>>
>> there's some bugs fixed we might be or sure are affected by and the
>> change log has become quite large already mind voting von 2.2.7 soon?
>>
>>
>> Am 21.06.2016 um 15:31 schrieb horschi:
>>
>> Hi,
>>
>> are there any plans to release 2.2.7 any time soon?
>>
>> kind regards,
>> Christian
>>
>>
>> --
>> *Dominik Keil*
>> Phone: + 49 (0) 621 150 207 31
>> Mobile: + 49 (0) 151 626 602 14
>>
>> Movilizer GmbH
>> Konrad-Zuse-Ring 30
>> 68163 Mannheim
>> Germany
>>
>> movilizer.com
>>
>> [image: Visit company website] <http://movilizer.com/>
>> *Reinvent Your Mobile Enterprise*
>>
>> *-Movilizer is moving*
>> After June 27th 2016 Movilizer's new headquarter will be
>>
>>
>>
>>
>> *EASTSITE VIIIKonrad-Zuse-Ring 3068163 Mannheim*
>>
>> <http://movilizer.com/training>
>> <http://movilizer.com/training>
>>
>> *Be the first to know:*
>> Twitter <https://twitter.com/Movilizer> | LinkedIn
>> <https://www.linkedin.com/company/movilizer-gmbh> | Facebook
>> <https://www.facebook.com/Movilizer> | stack overflow
>> <http://stackoverflow.com/questions/tagged/movilizer>
>>
>> Company's registered office: Mannheim HRB: 700323 / Country Court:
>> Mannheim Managing Directors: Alberto Zamora, Jörg Bernauer, Oliver Lesche
>> Please inform us immediately if this e-mail and/or any attachment was
>> transmitted incompletely or was not intelligible.
>>
>> This e-mail and any attachment is for authorized use by the intended
>> recipient(s) only. It may contain proprietary material, confidential
>> information and/or be subject to legal privilege. It should not be
>> copied, disclosed to, retained or used by any other party. If you are not
>> an intended recipient then please promptly delete this e-mail and any
>> attachment and all copies and inform the sender.
>
>
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>


Re: [RELEASE] Apache Cassandra 3.0.8 released

2016-07-07 Thread horschi
Same for 2.2.7.

On Thu, Jul 7, 2016 at 10:49 AM, Julien Anguenot 
wrote:

> Hey,
>
> The Debian packages do not seem to have been published. Normal?
>
> Thank you.
>
>J.
>
> On Jul 6, 2016, at 4:20 PM, Jake Luciani  wrote:
>
> The Cassandra team is pleased to announce the release of Apache Cassandra
> version 3.0.8.
>
> Apache Cassandra is a fully distributed database. It is the right choice
> when you need scalability and high availability without compromising
> performance.
>
>  http://cassandra.apache.org/
>
> Downloads of source and binary distributions are listed in our download
> section:
>
>  http://cassandra.apache.org/download/
>
> This version is a bug fix release[1] on the 3.0 series. As always, please
> pay
> attention to the release notes[2] and Let us know[3] if you were to
> encounter
> any problem.
>
> Enjoy!
>
> [1]: http://goo.gl/DQpe4d (CHANGES.txt)
> [2]: http://goo.gl/UISX1K (NEWS.txt)
> [3]: https://issues.apache.org/jira/browse/CASSANDRA
>
>
>


Re: [RELEASE] Apache Cassandra 3.0.8 released

2016-07-08 Thread horschi
2.2.7 also works. Thanks!

On Fri, Jul 8, 2016 at 9:15 AM, Julien Anguenot  wrote:

> All good now. Thanks.
>
>J.
>
> --
> Julien Anguenot (@anguenot)
>
> On Jul 8, 2016, at 4:10 AM, Jake Luciani  wrote:
>
> Sorry, I totally missed that.  Uploading now.
>
> On Thu, Jul 7, 2016 at 4:51 AM, horschi  wrote:
>
>> Same for 2.2.7.
>>
>> On Thu, Jul 7, 2016 at 10:49 AM, Julien Anguenot 
>> wrote:
>>
>>> Hey,
>>>
>>> The Debian packages do not seem to have been published. Normal?
>>>
>>> Thank you.
>>>
>>>J.
>>>
>>> On Jul 6, 2016, at 4:20 PM, Jake Luciani  wrote:
>>>
>>> The Cassandra team is pleased to announce the release of Apache Cassandra
>>> version 3.0.8.
>>>
>>> Apache Cassandra is a fully distributed database. It is the right choice
>>> when you need scalability and high availability without compromising
>>> performance.
>>>
>>>  http://cassandra.apache.org/
>>>
>>> Downloads of source and binary distributions are listed in our download
>>> section:
>>>
>>>  http://cassandra.apache.org/download/
>>>
>>> This version is a bug fix release[1] on the 3.0 series. As always,
>>> please pay
>>> attention to the release notes[2] and Let us know[3] if you were to
>>> encounter
>>> any problem.
>>>
>>> Enjoy!
>>>
>>> [1]: http://goo.gl/DQpe4d (CHANGES.txt)
>>> [2]: http://goo.gl/UISX1K (NEWS.txt)
>>> [3]: https://issues.apache.org/jira/browse/CASSANDRA
>>>
>>>
>>>
>>
>
>


Re: Stale value appears after consecutive TRUNCATE

2016-08-08 Thread horschi
Hi Yuji,

can you reproduce the behaviour with a single node?

The reason I ask is because I probably have the same issue with my
automated tests (which run truncate between every test), which run on my
local laptop.

Maybe around 5 tests randomly fail out of my 1800. I can see that the
failed tests sometimes show data from other tests, which I think must be
because of a failed truncate. This behaviour is seems to be CQL only, or at
least has gotten worse with CQL. I did not experience this with Thrift.

regards,
Christian



On Mon, Aug 8, 2016 at 7:34 AM, Yuji Ito  wrote:

> Hi all,
>
> I have a question about clearing table and commit log replay.
> After some tables were truncated consecutively, I got some stale values.
> This problem doesn't occur when I clear keyspaces with DROP (and CREATE).
>
> I'm testing the following test with node failure.
> Some stale values appear at checking phase.
>
> Test iteration:
> 1. initialize tables as below
> 2. request a lot of read/write concurrently
> 3. check all records
> 4. repeat from the beginning
>
> I use C* 2.2.6. There are 3 nodes (replication_factor: 3).
> Each node kills cassandra process at random intervals and restarts it
> immediately.
>
> My initialization:
> 1. clear tables with TRUNCATE
> 2. INSERT initial records
> 3. check if all values are correct
>
> If any phase fails (because of node failure), the initialization starts
> all over again.
> So, tables are sometimes truncated consecutively.
> Though the check in the initialization is OK, stale data appears when I
> execute "SELECT * FROM mykeyspace.mytable;" after a lot of requests are
> completed.
>
> The problem is likely to occur when the ReplayPosition's value in
> "truncated_at" is initialized as below after an empty table is truncated.
>
> Column Family ID: truncated_at
> ----: 0x
> 0156597cd4c7
> (this value was acquired just after phase 1 in my initialization)
>
> I guess some unexpected replays occur.
> Does anyone know the behavior?
>
> Thanks,
> Yuji
>


Re: Stale value appears after consecutive TRUNCATE

2016-08-10 Thread horschi
Hi Yuji,

ok, perhaps you are seeing a different issue than I do.

Have you tried with durable_writes=False? If the issue is caused by the
commitlog, then it should work if you disable durable_writes.

Cheers,
Christian



On Tue, Aug 9, 2016 at 3:04 PM, Yuji Ito  wrote:

> Thanks Christian
>
> can you reproduce the behaviour with a single node?
>
> I tried my test with a single node. But I can't.
>
> This behaviour is seems to be CQL only, or at least has gotten worse with
>> CQL. I did not experience this with Thrift.
>
> I truncate tables with CQL. I've never tried with Thrift.
>
> I think that my problem can happen when truncating even succeeds.
> That's because I check all records after truncating.
>
> I checked the source code.
> ReplayPosition.segment and position become -1 and 0 (ReplayPosition.NONE)
> in dscardSSTables() at truncating a table when there is no SSTable.
> I guess that ReplayPosition.segment shouldn't be -1 at truncating a table
> in this case.
> replayMutation() can request unexpected replay mutations because of this
> segment's value.
>
> Is there anyone familiar with truncate and replay?
>
> Regards,
> Yuji
>
>
> On Mon, Aug 8, 2016 at 6:36 PM, horschi  wrote:
>
>> Hi Yuji,
>>
>> can you reproduce the behaviour with a single node?
>>
>> The reason I ask is because I probably have the same issue with my
>> automated tests (which run truncate between every test), which run on my
>> local laptop.
>>
>> Maybe around 5 tests randomly fail out of my 1800. I can see that the
>> failed tests sometimes show data from other tests, which I think must be
>> because of a failed truncate. This behaviour is seems to be CQL only, or at
>> least has gotten worse with CQL. I did not experience this with Thrift.
>>
>> regards,
>> Christian
>>
>>
>>
>> On Mon, Aug 8, 2016 at 7:34 AM, Yuji Ito  wrote:
>>
>>> Hi all,
>>>
>>> I have a question about clearing table and commit log replay.
>>> After some tables were truncated consecutively, I got some stale values.
>>> This problem doesn't occur when I clear keyspaces with DROP (and CREATE).
>>>
>>> I'm testing the following test with node failure.
>>> Some stale values appear at checking phase.
>>>
>>> Test iteration:
>>> 1. initialize tables as below
>>> 2. request a lot of read/write concurrently
>>> 3. check all records
>>> 4. repeat from the beginning
>>>
>>> I use C* 2.2.6. There are 3 nodes (replication_factor: 3).
>>> Each node kills cassandra process at random intervals and restarts it
>>> immediately.
>>>
>>> My initialization:
>>> 1. clear tables with TRUNCATE
>>> 2. INSERT initial records
>>> 3. check if all values are correct
>>>
>>> If any phase fails (because of node failure), the initialization starts
>>> all over again.
>>> So, tables are sometimes truncated consecutively.
>>> Though the check in the initialization is OK, stale data appears when I
>>> execute "SELECT * FROM mykeyspace.mytable;" after a lot of requests are
>>> completed.
>>>
>>> The problem is likely to occur when the ReplayPosition's value in
>>> "truncated_at" is initialized as below after an empty table is truncated.
>>>
>>> Column Family ID: truncated_at
>>> ----: 0x
>>> 0156597cd4c7
>>> (this value was acquired just after phase 1 in my initialization)
>>>
>>> I guess some unexpected replays occur.
>>> Does anyone know the behavior?
>>>
>>> Thanks,
>>> Yuji
>>>
>>
>>
>


Re: Stale value appears after consecutive TRUNCATE

2016-08-25 Thread horschi
Hi Yuji,

I tried your script a couple of times. I did not experience any stale
values. (On my Linux laptop)

regards,
Ch

On Mon, Aug 15, 2016 at 7:29 AM, Yuji Ito  wrote:

> Hi,
>
> I can reproduce the problem with the following script.
> I got rows which should be truncated.
> If truncating is executed only once, the problem doesn't occur.
>
> The test for multi nodes (replication_factor:3, kill & restart C*
> processes in all nodes) can also reproduce it.
>
> test script:
> 
>
> ip=xxx.xxx.xxx.xxx
>
> echo "0. prepare a table"
> cqlsh $ip -e "drop keyspace testdb;"
> cqlsh $ip -e "CREATE KEYSPACE testdb WITH replication = {'class':
> 'SimpleStrategy', 'replication_factor': '1'};"
> cqlsh $ip -e "CREATE TABLE testdb.testtbl (key int PRIMARY KEY, val int);"
>
> echo "1. insert rows"
> for key in $(seq 1 10)
> do
> cqlsh $ip -e "insert into testdb.testtbl (key, val) values($key, 1000)
> IF NOT EXISTS;" >> /dev/null 2>&1
> done
>
> echo "2. truncate the table twice"
> cqlsh $ip -e "consistency all; truncate table testdb.testtbl"
> cqlsh $ip -e "consistency all; truncate table testdb.testtbl"
>
> echo "3. kill C* process"
> ps auxww | grep "CassandraDaemon" | awk '{if ($13 ~ /cassand/) print $2}'
> | xargs sudo kill -9
>
> echo "4. restart C* process"
> sudo /etc/init.d/cassandra start
> sleep 20
>
> echo "5. check the table"
> cqlsh $ip -e "select * from testdb.testtbl;"
>
> 
>
> test result:
> 
>
> 0. prepare a table
> 1. insert rows
> 2. truncate the table twice
> Consistency level set to ALL.
> Consistency level set to ALL.
> 3. kill C* process
> 4. restart C* process
> Starting Cassandra: OK
> 5. check the table
>
>  key | val
> -+--
>5 | 1000
>   10 | 1000
>1 | 1000
>8 | 1000
>    2 | 1000
>4 | 1000
>7 | 1000
>6 | 1000
>9 | 1000
>3 | 1000
>
> (10 rows)
>
> 
>
>
> Thanks Christian,
>
> I tried with durable_writes=False.
> It failed. I guessed this failure was caused by another problem.
> I use SimpleStrategy.
> A keyspace using the SimpleStrategy isn't permitted to use
> durable_writes=False.
>
>
> Regards,
> Yuji
>
> On Thu, Aug 11, 2016 at 12:41 AM, horschi  wrote:
>
>> Hi Yuji,
>>
>> ok, perhaps you are seeing a different issue than I do.
>>
>> Have you tried with durable_writes=False? If the issue is caused by the
>> commitlog, then it should work if you disable durable_writes.
>>
>> Cheers,
>> Christian
>>
>>
>>
>> On Tue, Aug 9, 2016 at 3:04 PM, Yuji Ito  wrote:
>>
>>> Thanks Christian
>>>
>>> can you reproduce the behaviour with a single node?
>>>
>>> I tried my test with a single node. But I can't.
>>>
>>> This behaviour is seems to be CQL only, or at least has gotten worse
>>>> with CQL. I did not experience this with Thrift.
>>>
>>> I truncate tables with CQL. I've never tried with Thrift.
>>>
>>> I think that my problem can happen when truncating even succeeds.
>>> That's because I check all records after truncating.
>>>
>>> I checked the source code.
>>> ReplayPosition.segment and position become -1 and 0
>>> (ReplayPosition.NONE) in dscardSSTables() at truncating a table when there
>>> is no SSTable.
>>> I guess that ReplayPosition.segment shouldn't be -1 at truncating a
>>> table in this case.
>>> replayMutation() can request unexpected replay mutations because of this
>>> segment's value.
>>>
>>> Is there anyone familiar with truncate and replay?
>>>
>>> Regards,
>>> Yuji
>>>
>>>
>>> On Mon, Aug 8, 2016 at 6:36 PM, horschi  wrote:
>>>
>>>> Hi Yuji,
>>>>
>>>> can you reproduce the behaviour with a single node?
>>>>
>>>> The reason I ask is because I probably have the same issue with my
>>>> automated tests (which run truncate between every test), which run on my
>>>> local laptop.
>>>>
>>>> Maybe around 5 tests randomly fail out of my 1800. I can see that the
>>>> failed tests sometimes show data from other tests, which I think must be
>>>> because of a failed truncate. This behaviour is seems to be CQL only, or at
>>>> least

Re: Stale value appears after consecutive TRUNCATE

2016-08-25 Thread horschi
(running C* 2.2.7)

On Thu, Aug 25, 2016 at 11:10 AM, horschi  wrote:

> Hi Yuji,
>
> I tried your script a couple of times. I did not experience any stale
> values. (On my Linux laptop)
>
> regards,
> Ch
>
> On Mon, Aug 15, 2016 at 7:29 AM, Yuji Ito  wrote:
>
>> Hi,
>>
>> I can reproduce the problem with the following script.
>> I got rows which should be truncated.
>> If truncating is executed only once, the problem doesn't occur.
>>
>> The test for multi nodes (replication_factor:3, kill & restart C*
>> processes in all nodes) can also reproduce it.
>>
>> test script:
>> 
>>
>> ip=xxx.xxx.xxx.xxx
>>
>> echo "0. prepare a table"
>> cqlsh $ip -e "drop keyspace testdb;"
>> cqlsh $ip -e "CREATE KEYSPACE testdb WITH replication = {'class':
>> 'SimpleStrategy', 'replication_factor': '1'};"
>> cqlsh $ip -e "CREATE TABLE testdb.testtbl (key int PRIMARY KEY, val int);"
>>
>> echo "1. insert rows"
>> for key in $(seq 1 10)
>> do
>> cqlsh $ip -e "insert into testdb.testtbl (key, val) values($key,
>> 1000) IF NOT EXISTS;" >> /dev/null 2>&1
>> done
>>
>> echo "2. truncate the table twice"
>> cqlsh $ip -e "consistency all; truncate table testdb.testtbl"
>> cqlsh $ip -e "consistency all; truncate table testdb.testtbl"
>>
>> echo "3. kill C* process"
>> ps auxww | grep "CassandraDaemon" | awk '{if ($13 ~ /cassand/) print $2}'
>> | xargs sudo kill -9
>>
>> echo "4. restart C* process"
>> sudo /etc/init.d/cassandra start
>> sleep 20
>>
>> echo "5. check the table"
>> cqlsh $ip -e "select * from testdb.testtbl;"
>>
>> 
>>
>> test result:
>> 
>>
>> 0. prepare a table
>> 1. insert rows
>> 2. truncate the table twice
>> Consistency level set to ALL.
>> Consistency level set to ALL.
>> 3. kill C* process
>> 4. restart C* process
>> Starting Cassandra: OK
>> 5. check the table
>>
>>  key | val
>> -+--
>>5 | 1000
>>   10 | 1000
>>1 | 1000
>>8 | 1000
>>2 | 1000
>>4 | 1000
>>7 | 1000
>>6 | 1000
>>9 | 1000
>>3 | 1000
>>
>> (10 rows)
>>
>> 
>>
>>
>> Thanks Christian,
>>
>> I tried with durable_writes=False.
>> It failed. I guessed this failure was caused by another problem.
>> I use SimpleStrategy.
>> A keyspace using the SimpleStrategy isn't permitted to use
>> durable_writes=False.
>>
>>
>> Regards,
>> Yuji
>>
>> On Thu, Aug 11, 2016 at 12:41 AM, horschi  wrote:
>>
>>> Hi Yuji,
>>>
>>> ok, perhaps you are seeing a different issue than I do.
>>>
>>> Have you tried with durable_writes=False? If the issue is caused by the
>>> commitlog, then it should work if you disable durable_writes.
>>>
>>> Cheers,
>>> Christian
>>>
>>>
>>>
>>> On Tue, Aug 9, 2016 at 3:04 PM, Yuji Ito  wrote:
>>>
>>>> Thanks Christian
>>>>
>>>> can you reproduce the behaviour with a single node?
>>>>
>>>> I tried my test with a single node. But I can't.
>>>>
>>>> This behaviour is seems to be CQL only, or at least has gotten worse
>>>>> with CQL. I did not experience this with Thrift.
>>>>
>>>> I truncate tables with CQL. I've never tried with Thrift.
>>>>
>>>> I think that my problem can happen when truncating even succeeds.
>>>> That's because I check all records after truncating.
>>>>
>>>> I checked the source code.
>>>> ReplayPosition.segment and position become -1 and 0
>>>> (ReplayPosition.NONE) in dscardSSTables() at truncating a table when there
>>>> is no SSTable.
>>>> I guess that ReplayPosition.segment shouldn't be -1 at truncating a
>>>> table in this case.
>>>> replayMutation() can request unexpected replay mutations because of
>>>> this segment's value.
>>>>
>>>> Is there anyone familiar with truncate and replay?
>>>>
>>>> Regards,
>>>> Yuji
>>>>
>>>>
>>>> On Mon, Aug 8, 2016 at 6:36 PM, horschi  wrote:
>>&

Re: Stale value appears after consecutive TRUNCATE

2016-08-25 Thread horschi
Nop, still don't get stale values. (I just ran your script 3 times)

On Thu, Aug 25, 2016 at 12:36 PM, Yuji Ito  wrote:

> Thank you for testing, Christian
>
> What did you set commitlog_sync in cassandra.yaml?
> I set commitlog_sync batch (window 2ms) as below.
>
> commitlog_sync: batch
> commitlog_sync_batch_window_in_ms: 2
>
> The problem didn't occur by setting  commitlog_sync periodic(default).
>
> regards,
> yuji
>
>
> On Thu, Aug 25, 2016 at 6:11 PM, horschi  wrote:
>
>> (running C* 2.2.7)
>>
>> On Thu, Aug 25, 2016 at 11:10 AM, horschi  wrote:
>>
>>> Hi Yuji,
>>>
>>> I tried your script a couple of times. I did not experience any stale
>>> values. (On my Linux laptop)
>>>
>>> regards,
>>> Ch
>>>
>>> On Mon, Aug 15, 2016 at 7:29 AM, Yuji Ito  wrote:
>>>
>>>> Hi,
>>>>
>>>> I can reproduce the problem with the following script.
>>>> I got rows which should be truncated.
>>>> If truncating is executed only once, the problem doesn't occur.
>>>>
>>>> The test for multi nodes (replication_factor:3, kill & restart C*
>>>> processes in all nodes) can also reproduce it.
>>>>
>>>> test script:
>>>> 
>>>>
>>>> ip=xxx.xxx.xxx.xxx
>>>>
>>>> echo "0. prepare a table"
>>>> cqlsh $ip -e "drop keyspace testdb;"
>>>> cqlsh $ip -e "CREATE KEYSPACE testdb WITH replication = {'class':
>>>> 'SimpleStrategy', 'replication_factor': '1'};"
>>>> cqlsh $ip -e "CREATE TABLE testdb.testtbl (key int PRIMARY KEY, val
>>>> int);"
>>>>
>>>> echo "1. insert rows"
>>>> for key in $(seq 1 10)
>>>> do
>>>> cqlsh $ip -e "insert into testdb.testtbl (key, val) values($key,
>>>> 1000) IF NOT EXISTS;" >> /dev/null 2>&1
>>>> done
>>>>
>>>> echo "2. truncate the table twice"
>>>> cqlsh $ip -e "consistency all; truncate table testdb.testtbl"
>>>> cqlsh $ip -e "consistency all; truncate table testdb.testtbl"
>>>>
>>>> echo "3. kill C* process"
>>>> ps auxww | grep "CassandraDaemon" | awk '{if ($13 ~ /cassand/) print
>>>> $2}' | xargs sudo kill -9
>>>>
>>>> echo "4. restart C* process"
>>>> sudo /etc/init.d/cassandra start
>>>> sleep 20
>>>>
>>>> echo "5. check the table"
>>>> cqlsh $ip -e "select * from testdb.testtbl;"
>>>>
>>>> 
>>>>
>>>> test result:
>>>> 
>>>>
>>>> 0. prepare a table
>>>> 1. insert rows
>>>> 2. truncate the table twice
>>>> Consistency level set to ALL.
>>>> Consistency level set to ALL.
>>>> 3. kill C* process
>>>> 4. restart C* process
>>>> Starting Cassandra: OK
>>>> 5. check the table
>>>>
>>>>  key | val
>>>> -+--
>>>>5 | 1000
>>>>   10 | 1000
>>>>1 | 1000
>>>>8 | 1000
>>>>2 | 1000
>>>>4 | 1000
>>>>7 | 1000
>>>>6 | 1000
>>>>9 | 1000
>>>>3 | 1000
>>>>
>>>> (10 rows)
>>>>
>>>> 
>>>>
>>>>
>>>> Thanks Christian,
>>>>
>>>> I tried with durable_writes=False.
>>>> It failed. I guessed this failure was caused by another problem.
>>>> I use SimpleStrategy.
>>>> A keyspace using the SimpleStrategy isn't permitted to use
>>>> durable_writes=False.
>>>>
>>>>
>>>> Regards,
>>>> Yuji
>>>>
>>>> On Thu, Aug 11, 2016 at 12:41 AM, horschi  wrote:
>>>>
>>>>> Hi Yuji,
>>>>>
>>>>> ok, perhaps you are seeing a different issue than I do.
>>>>>
>>>>> Have you tried with durable_writes=False? If the issue is caused by
>>>>> the commitlog, then it should work if you disable durable_writes.
>>>>>
>>>>> Cheers,
>>>>> Christian
>>>>>
>>>>>
>>>>>
>>>>>

Re: Java Driver - Specifying parameters for an IN() query?

2016-10-11 Thread horschi
Hi Ali,

do you perhaps want "'Select * from my_table WHERE pk = ? And ck IN ?'" ?
(Without the brackets around the question mark)

regards,
Ch

On Tue, Oct 11, 2016 at 3:14 PM, Ali Akhtar  wrote:

> If I wanted to create an accessor, and have a method which does a query
> like this:
>
> 'Select * from my_table WHERE pk = ? And ck IN (?)'
>
> And there were multiple options that could go inside the IN() query, how
> can I specify that? Will it e.g, let me pass in an array as the 2nd
> variable?
>


Re: Speeding up schema generation during tests

2016-10-19 Thread horschi
Have you tried starting Cassandra with -Dcassandra.unsafesystem=true ?


On Wed, Oct 19, 2016 at 9:31 AM, DuyHai Doan  wrote:

> As I said, when I bootstrap the server and create some keyspace, sometimes
> the schema is not fully initialized and when the test code tried to insert
> data, it fails.
>
> I did not have time to dig into the source code to find the root cause,
> maybe it's something really stupid and simple to fix. If you want to
> investigate and try out my CassandraDaemon server, I'd be happy to get
> feedbacks
>
> On Wed, Oct 19, 2016 at 9:22 AM, Ali Akhtar  wrote:
>
>> Thanks. I've disabled durable writes but this is still pretty slow (about
>> 10 seconds).
>>
>> What issues did you run into with your impl?
>>
>> On Wed, Oct 19, 2016 at 12:15 PM, DuyHai Doan 
>> wrote:
>>
>>> There is a lot of pre-flight checks when starting the cassandra server
>>> and they took time.
>>>
>>> For integration testing, I have developped a modified CassandraDeamon
>>> here that remove pretty most of those checks:
>>>
>>> https://github.com/doanduyhai/Achilles/blob/master/achilles-
>>> embedded/src/main/java/info/archinnov/achilles/embedded/Achi
>>> llesCassandraDaemon.java
>>>
>>> The problem is that I felt into weird scenarios where creating a
>>> keyspace wasn't created in timely manner so I just stop using this impl for
>>> the moment, just look at it and do whatever you want.
>>>
>>> Another idea for testing is to disable durable write to speed up
>>> mutation (CREATE KEYSPACE ... WITH durable_write=false)
>>>
>>> On Wed, Oct 19, 2016 at 3:24 AM, Ali Akhtar 
>>> wrote:
>>>
 Is there a way to speed up the creation of keyspace + tables during
 integration tests? I am using an RF of 1, with SimpleStrategy, but it still
 takes upto 10-15 seconds.

>>>
>>>
>>
>


Re: Speeding up schema generation during tests

2016-10-23 Thread horschi
You have to manually do "nodetool flush && nodetool flush system" before
shutdown, otherwise Cassandra might break. With that it is working nicely.

On Sun, Oct 23, 2016 at 3:40 PM, Ali Akhtar  wrote:

> I'm using https://github.com/jsevellec/cassandra-unit and haven't come
> across any race issues or problems. Cassandra-unit takes care of creating
> the schema before it runs the tests.
>
> On Sun, Oct 23, 2016 at 6:17 PM, DuyHai Doan  wrote:
>
>> Ok I have added -Dcassandra.unsafesystem=true and my tests are broken.
>>
>> The reason is that I create some schemas before executing tests.
>>
>> When unable unsafesystem, Cassandra does not block for schema flush so
>> you man run into race conditions where the test start using the created
>> schema but it has not been fully flushed yet to disk:
>>
>> See C* source code here: https://github.com/apach
>> e/cassandra/blob/trunk/src/java/org/apache/cassandra/sche
>> ma/SchemaKeyspace.java#L278-L282
>>
>> static void flush()
>> {
>> if (!DatabaseDescriptor.isUnsafeSystem())
>> ALL.forEach(table -> FBUtilities.waitOnFuture(getSc
>> hemaCFS(table).forceFlush()));
>> }
>>
>> I don't know how it worked out for you but it didn't for me...
>>
>> On Wed, Oct 19, 2016 at 9:45 AM, DuyHai Doan 
>> wrote:
>>
>>> Ohh didn't know such system property exist, nice idea!
>>>
>>> On Wed, Oct 19, 2016 at 9:40 AM, horschi  wrote:
>>>
>>>> Have you tried starting Cassandra with -Dcassandra.unsafesystem=true ?
>>>>
>>>>
>>>> On Wed, Oct 19, 2016 at 9:31 AM, DuyHai Doan 
>>>> wrote:
>>>>
>>>>> As I said, when I bootstrap the server and create some keyspace,
>>>>> sometimes the schema is not fully initialized and when the test code tried
>>>>> to insert data, it fails.
>>>>>
>>>>> I did not have time to dig into the source code to find the root
>>>>> cause, maybe it's something really stupid and simple to fix. If you want 
>>>>> to
>>>>> investigate and try out my CassandraDaemon server, I'd be happy to get
>>>>> feedbacks
>>>>>
>>>>> On Wed, Oct 19, 2016 at 9:22 AM, Ali Akhtar 
>>>>> wrote:
>>>>>
>>>>>> Thanks. I've disabled durable writes but this is still pretty slow
>>>>>> (about 10 seconds).
>>>>>>
>>>>>> What issues did you run into with your impl?
>>>>>>
>>>>>> On Wed, Oct 19, 2016 at 12:15 PM, DuyHai Doan 
>>>>>> wrote:
>>>>>>
>>>>>>> There is a lot of pre-flight checks when starting the cassandra
>>>>>>> server and they took time.
>>>>>>>
>>>>>>> For integration testing, I have developped a modified
>>>>>>> CassandraDeamon here that remove pretty most of those checks:
>>>>>>>
>>>>>>> https://github.com/doanduyhai/Achilles/blob/master/achilles-
>>>>>>> embedded/src/main/java/info/archinnov/achilles/embedded/Achi
>>>>>>> llesCassandraDaemon.java
>>>>>>>
>>>>>>> The problem is that I felt into weird scenarios where creating a
>>>>>>> keyspace wasn't created in timely manner so I just stop using this impl 
>>>>>>> for
>>>>>>> the moment, just look at it and do whatever you want.
>>>>>>>
>>>>>>> Another idea for testing is to disable durable write to speed up
>>>>>>> mutation (CREATE KEYSPACE ... WITH durable_write=false)
>>>>>>>
>>>>>>> On Wed, Oct 19, 2016 at 3:24 AM, Ali Akhtar 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is there a way to speed up the creation of keyspace + tables during
>>>>>>>> integration tests? I am using an RF of 1, with SimpleStrategy, but it 
>>>>>>>> still
>>>>>>>> takes upto 10-15 seconds.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: All subsequent CAS requests time out after heavy use of new CAS feature

2016-12-15 Thread horschi
Hi,

I would like to warm up this old thread. I did some debugging and found out
that the timeouts are coming from StorageProxy.proposePaxos()
- callback.isFullyRefused() returns false and therefore triggers a
WriteTimeout.

Looking at my ccm cluster logs, I can see that two replica nodes return
different results in their ProposeVerbHandler. In my opinion the
coordinator should not throw a Exception in such a case, but instead retry
the operation.

What do the CAS/Paxos experts on this list say to this? Feel free to
instruct me to do further tests/code changes. I'd be glad to help.

Log:

node1/logs/system.log:WARN  [SharedPool-Worker-5] 2016-12-15 14:48:36,896
PaxosState.java:124 - Rejecting proposal for
Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
columns=[[] | [value]]
node1/logs/system.log-Row: id=@ | value=) because inProgress
is now Commit(2d8146b0-c2cd-11e6-f996-e5c8d88a1da4, [MDS.Lock]
key=locktest_ 1 columns=[[] | [value]]
--
node1/logs/system.log:ERROR [SharedPool-Worker-12] 2016-12-15 14:48:36,980
StorageProxy.java:506 - proposePaxos:
Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
columns=[[] | [value]]
node1/logs/system.log-Row: id=@ | value=)//1//0
--
node2/logs/system.log:WARN  [SharedPool-Worker-7] 2016-12-15 14:48:36,969
PaxosState.java:117 - Accepting proposal:
Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
columns=[[] | [value]]
node2/logs/system.log-Row: id=@ | value=)
--
node3/logs/system.log:WARN  [SharedPool-Worker-2] 2016-12-15 14:48:36,897
PaxosState.java:124 - Rejecting proposal for
Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
columns=[[] | [value]]
node3/logs/system.log-Row: id=@ | value=) because inProgress
is now Commit(2d8146b0-c2cd-11e6-f996-e5c8d88a1da4, [MDS.Lock]
key=locktest_ 1 columns=[[] | [value]]


kind regards,
Christian


On Fri, Apr 15, 2016 at 8:27 PM, Denise Rogers  wrote:

> My thinking was that due to the size of the data that there maybe I/O
> issues. But it sounds more like you're competing for locks and hit a
> deadlock issue.
>
> Regards,
> Denise
> Cell - (860)989-3431 <(860)%20989-3431>
>
> Sent from mi iPhone
>
> On Apr 15, 2016, at 9:00 AM, horschi  wrote:
>
> Hi Denise,
>
> in my case its a small blob I am writing (should be around 100 bytes):
>
>  CREATE TABLE "Lock" (
>  lockname varchar,
>  id varchar,
>  value blob,
>  PRIMARY KEY (lockname, id)
>  ) WITH COMPACT STORAGE
>  AND COMPRESSION = { 'sstable_compression' : 'SnappyCompressor',
> 'chunk_length_kb' : '8' };
>
> You ask because large values are known to cause issues? Anything special
> you have in mind?
>
> kind regards,
> Christian
>
>
>
>
> On Fri, Apr 15, 2016 at 2:42 PM, Denise Rogers  wrote:
>
>> Also, what type of data were you reading/writing?
>>
>> Regards,
>> Denise
>>
>> Sent from mi iPad
>>
>> On Apr 15, 2016, at 8:29 AM, horschi  wrote:
>>
>> Hi Jan,
>>
>> were you able to resolve your Problem?
>>
>> We are trying the same and also see a lot of WriteTimeouts:
>> WriteTimeoutException: Cassandra timeout during write query at
>> consistency SERIAL (2 replica were required but only 1 acknowledged the
>> write)
>>
>> How many clients were competing for a lock in your case? In our case its
>> only two :-(
>>
>> cheers,
>> Christian
>>
>>
>> On Tue, Sep 24, 2013 at 12:18 AM, Robert Coli 
>> wrote:
>>
>>> On Mon, Sep 16, 2013 at 9:09 AM, Jan Algermissen <
>>> jan.algermis...@nordsc.com> wrote:
>>>
>>>> I am experimenting with C* 2.0 ( and today's java-driver 2.0 snapshot)
>>>> for implementing distributed locks.
>>>>
>>>
>>> [ and I'm experiencing the problem described in the subject ... ]
>>>
>>>
>>>> Any idea how to approach this problem?
>>>>
>>>
>>> 1) Upgrade to 2.0.1 release.
>>> 2) Try to reproduce symptoms.
>>> 3) If able to, file a JIRA at https://issues.apache.org/
>>> jira/secure/Dashboard.jspa including repro steps
>>> 4) Reply to this thread with the JIRA ticket URL
>>>
>>> =Rob
>>>
>>>
>>>
>>
>>
>


Re: All subsequent CAS requests time out after heavy use of new CAS feature

2016-12-23 Thread horschi
Update: I replace all quorum reads on that table with serial reads, and now
these errors got less. Somehow quorum reads on CAS values cause most of
these WTEs.

Also I found two tickets on that topic:
https://issues.apache.org/jira/browse/CASSANDRA-9328
https://issues.apache.org/jira/browse/CASSANDRA-8672

On Thu, Dec 15, 2016 at 3:14 PM, horschi  wrote:

> Hi,
>
> I would like to warm up this old thread. I did some debugging and found
> out that the timeouts are coming from StorageProxy.proposePaxos()
> - callback.isFullyRefused() returns false and therefore triggers a
> WriteTimeout.
>
> Looking at my ccm cluster logs, I can see that two replica nodes return
> different results in their ProposeVerbHandler. In my opinion the
> coordinator should not throw a Exception in such a case, but instead retry
> the operation.
>
> What do the CAS/Paxos experts on this list say to this? Feel free to
> instruct me to do further tests/code changes. I'd be glad to help.
>
> Log:
>
> node1/logs/system.log:WARN  [SharedPool-Worker-5] 2016-12-15 14:48:36,896
> PaxosState.java:124 - Rejecting proposal for 
> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc,
> [MDS.Lock] key=locktest_ 1 columns=[[] | [value]]
> node1/logs/system.log-Row: id=@ | value=) because
> inProgress is now Commit(2d8146b0-c2cd-11e6-f996-e5c8d88a1da4, [MDS.Lock]
> key=locktest_ 1 columns=[[] | [value]]
> --
> node1/logs/system.log:ERROR [SharedPool-Worker-12] 2016-12-15 14:48:36,980
> StorageProxy.java:506 - proposePaxos: 
> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc,
> [MDS.Lock] key=locktest_ 1 columns=[[] | [value]]
> node1/logs/system.log-Row: id=@ | value=)//1//0
> --
> node2/logs/system.log:WARN  [SharedPool-Worker-7] 2016-12-15 14:48:36,969
> PaxosState.java:117 - Accepting proposal: 
> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc,
> [MDS.Lock] key=locktest_ 1 columns=[[] | [value]]
> node2/logs/system.log-Row: id=@ | value=)
> --
> node3/logs/system.log:WARN  [SharedPool-Worker-2] 2016-12-15 14:48:36,897
> PaxosState.java:124 - Rejecting proposal for 
> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc,
> [MDS.Lock] key=locktest_ 1 columns=[[] | [value]]
> node3/logs/system.log-Row: id=@ | value=) because
> inProgress is now Commit(2d8146b0-c2cd-11e6-f996-e5c8d88a1da4, [MDS.Lock]
> key=locktest_ 1 columns=[[] | [value]]
>
>
> kind regards,
> Christian
>
>
> On Fri, Apr 15, 2016 at 8:27 PM, Denise Rogers  wrote:
>
>> My thinking was that due to the size of the data that there maybe I/O
>> issues. But it sounds more like you're competing for locks and hit a
>> deadlock issue.
>>
>> Regards,
>> Denise
>> Cell - (860)989-3431 <(860)%20989-3431>
>>
>> Sent from mi iPhone
>>
>> On Apr 15, 2016, at 9:00 AM, horschi  wrote:
>>
>> Hi Denise,
>>
>> in my case its a small blob I am writing (should be around 100 bytes):
>>
>>  CREATE TABLE "Lock" (
>>  lockname varchar,
>>  id varchar,
>>  value blob,
>>  PRIMARY KEY (lockname, id)
>>  ) WITH COMPACT STORAGE
>>  AND COMPRESSION = { 'sstable_compression' : 'SnappyCompressor',
>> 'chunk_length_kb' : '8' };
>>
>> You ask because large values are known to cause issues? Anything special
>> you have in mind?
>>
>> kind regards,
>> Christian
>>
>>
>>
>>
>> On Fri, Apr 15, 2016 at 2:42 PM, Denise Rogers  wrote:
>>
>>> Also, what type of data were you reading/writing?
>>>
>>> Regards,
>>> Denise
>>>
>>> Sent from mi iPad
>>>
>>> On Apr 15, 2016, at 8:29 AM, horschi  wrote:
>>>
>>> Hi Jan,
>>>
>>> were you able to resolve your Problem?
>>>
>>> We are trying the same and also see a lot of WriteTimeouts:
>>> WriteTimeoutException: Cassandra timeout during write query at
>>> consistency SERIAL (2 replica were required but only 1 acknowledged the
>>> write)
>>>
>>> How many clients were competing for a lock in your case? In our case its
>>> only two :-(
>>>
>>> cheers,
>>> Christian
>>>
>>>
>>> On Tue, Sep 24, 2013 at 12:18 AM, Robert Coli 
>>> wrote:
>>>
>>>> On Mon, Sep 16, 2013 at 9:09 AM, Jan Algermissen <
>>>> jan.algermis...@nordsc.com> wrote:
>>>>
>>>>> I am experimenting with C* 2.0 ( and today's java-driver 2.0 snapshot)
>>>>> for implementing distributed locks.
>>>>>
>>>>
>>>> [ and I'm experiencing the problem described in the subject ... ]
>>>>
>>>>
>>>>> Any idea how to approach this problem?
>>>>>
>>>>
>>>> 1) Upgrade to 2.0.1 release.
>>>> 2) Try to reproduce symptoms.
>>>> 3) If able to, file a JIRA at https://issues.apache.org/jira
>>>> /secure/Dashboard.jspa including repro steps
>>>> 4) Reply to this thread with the JIRA ticket URL
>>>>
>>>> =Rob
>>>>
>>>>
>>>>
>>>
>>>
>>
>


Re: All subsequent CAS requests time out after heavy use of new CAS feature

2016-12-24 Thread horschi
Oh yes it is, like Couters :-)


On Sat, Dec 24, 2016 at 4:02 AM, Edward Capriolo 
wrote:

> Anecdotal CAS works differently than the typical cassandra workload. If
> you run a stress instance 3 nodes one host, you find that you typically run
> into CPU issues, but if you are doing a CAS workload you see things timing
> out and before you hit 100% CPU. It is a strange beast.
>
> On Fri, Dec 23, 2016 at 7:28 AM, horschi  wrote:
>
>> Update: I replace all quorum reads on that table with serial reads, and
>> now these errors got less. Somehow quorum reads on CAS values cause most of
>> these WTEs.
>>
>> Also I found two tickets on that topic:
>> https://issues.apache.org/jira/browse/CASSANDRA-9328
>> https://issues.apache.org/jira/browse/CASSANDRA-8672
>>
>> On Thu, Dec 15, 2016 at 3:14 PM, horschi  wrote:
>>
>>> Hi,
>>>
>>> I would like to warm up this old thread. I did some debugging and found
>>> out that the timeouts are coming from StorageProxy.proposePaxos()
>>> - callback.isFullyRefused() returns false and therefore triggers a
>>> WriteTimeout.
>>>
>>> Looking at my ccm cluster logs, I can see that two replica nodes return
>>> different results in their ProposeVerbHandler. In my opinion the
>>> coordinator should not throw a Exception in such a case, but instead retry
>>> the operation.
>>>
>>> What do the CAS/Paxos experts on this list say to this? Feel free to
>>> instruct me to do further tests/code changes. I'd be glad to help.
>>>
>>> Log:
>>>
>>> node1/logs/system.log:WARN  [SharedPool-Worker-5] 2016-12-15
>>> 14:48:36,896 PaxosState.java:124 - Rejecting proposal for
>>> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
>>> columns=[[] | [value]]
>>> node1/logs/system.log-Row: id=@ | value=) because
>>> inProgress is now Commit(2d8146b0-c2cd-11e6-f996-e5c8d88a1da4,
>>> [MDS.Lock] key=locktest_ 1 columns=[[] | [value]]
>>> --
>>> node1/logs/system.log:ERROR [SharedPool-Worker-12] 2016-12-15
>>> 14:48:36,980 StorageProxy.java:506 - proposePaxos:
>>> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
>>> columns=[[] | [value]]
>>> node1/logs/system.log-Row: id=@ | value=)//1//0
>>> --
>>> node2/logs/system.log:WARN  [SharedPool-Worker-7] 2016-12-15
>>> 14:48:36,969 PaxosState.java:117 - Accepting proposal:
>>> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
>>> columns=[[] | [value]]
>>> node2/logs/system.log-Row: id=@ | value=)
>>> --
>>> node3/logs/system.log:WARN  [SharedPool-Worker-2] 2016-12-15
>>> 14:48:36,897 PaxosState.java:124 - Rejecting proposal for
>>> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
>>> columns=[[] | [value]]
>>> node3/logs/system.log-Row: id=@ | value=) because
>>> inProgress is now Commit(2d8146b0-c2cd-11e6-f996-e5c8d88a1da4,
>>> [MDS.Lock] key=locktest_ 1 columns=[[] | [value]]
>>>
>>>
>>> kind regards,
>>> Christian
>>>
>>>
>>> On Fri, Apr 15, 2016 at 8:27 PM, Denise Rogers  wrote:
>>>
>>>> My thinking was that due to the size of the data that there maybe I/O
>>>> issues. But it sounds more like you're competing for locks and hit a
>>>> deadlock issue.
>>>>
>>>> Regards,
>>>> Denise
>>>> Cell - (860)989-3431 <(860)%20989-3431>
>>>>
>>>> Sent from mi iPhone
>>>>
>>>> On Apr 15, 2016, at 9:00 AM, horschi  wrote:
>>>>
>>>> Hi Denise,
>>>>
>>>> in my case its a small blob I am writing (should be around 100 bytes):
>>>>
>>>>  CREATE TABLE "Lock" (
>>>>  lockname varchar,
>>>>  id varchar,
>>>>  value blob,
>>>>  PRIMARY KEY (lockname, id)
>>>>  ) WITH COMPACT STORAGE
>>>>  AND COMPRESSION = { 'sstable_compression' :
>>>> 'SnappyCompressor', 'chunk_length_kb' : '8' };
>>>>
>>>> You ask because large values are known to cause issues? Anything
>>>> special you have in mind?
>>>>
>>>> kind regards,
>>>> Christian
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 15, 2016 at 2:42 PM, Denise Roger

Re: Truncate really slow

2015-07-01 Thread horschi
Hi,

you have to enable -Dcassandra.unsafesystem=true in cassandra-env.sh. Also
disable durables writes for your CFs.

This should speed things up and should reduce IOWait dramatically.

kind regards,
Christian

On Wed, Jul 1, 2015 at 11:52 PM, Robert Wille  wrote:

> I have two test clusters, both 2.0.15. One has a single node and one has
> three nodes. Truncate on the three node cluster is really slow, but is
> quite fast on the single-node cluster. My test cases truncate tables before
> each test, and > 95% of the time in my test cases is spent truncating
> tables on the 3-node cluster. Auto-snapshotting is off.
>
> I know there’s some coordination that has to occur when a truncate
> happens, but it seems really excessive. Almost one second to truncate each
> table with an otherwise idle cluster.
>
> Any thoughts?
>
> Thanks in advance
>
> Robert
>
>


auto_bootstrap=false broken?

2015-08-04 Thread horschi
Hi everyone,

I'll just ask my question as provocative as possible ;-)

Isnt't auto_bootstrap=false broken the way it is currently implemented?

What currently happens:
New node starts with auto_bootstrap=false and it starts serving reads
immediately without having any data.

Would the following be more correct:
- New node should stay in a joining state
- Operator loads data (e.g. using nodetool rebuild or putting in backupped
files or whatever)
- Operator has to manually switch from joining into normal state using
nodetool (only then it will start serving reads)

Wouldn't this behaviour more consistent?

kind regards,
Christian


Re: auto_bootstrap=false broken?

2015-08-04 Thread horschi
Hi Paulo,

thanks for your feedback, but I think this is not what I am looking for.

Starting with join_ring does not take any tokens in the ring. And the
"nodetool join" afterwards will again do token-selection and data loading
in one step.

I would like to separate these steps:
1. assign tokens
2. have the node in a joining state, so that I can copy in data
3. mark the node as ready


I just saw that perhaps write_survey could be misused for that.

Did anyone ever use write_survey for such a partial bootstrapping?
Do I have to worry about data-loss when using multiple write_survey nodes
in one cluster?

kind regards,
Christian



On Tue, Aug 4, 2015 at 2:24 PM, Paulo Motta 
wrote:

> Hello Christian,
>
> You may use the start-up parameter -Dcassandra.join_ring=false if you
> don't want the node to join the ring on startup. More about this parameter
> here:
> http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCUtility_t.html
>
> You can later join the ring via nodetool join command:
> http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsJoin.html
>
> auto_bootstrap=false is typically used to bootstrap new datacenters or
> clusters, or nodes with data already on it before starting the process.
>
> Cheers,
>
> Paulo
>
> 2015-08-04 8:50 GMT-03:00 horschi :
>
>> Hi everyone,
>>
>> I'll just ask my question as provocative as possible ;-)
>>
>> Isnt't auto_bootstrap=false broken the way it is currently implemented?
>>
>> What currently happens:
>> New node starts with auto_bootstrap=false and it starts serving reads
>> immediately without having any data.
>>
>> Would the following be more correct:
>> - New node should stay in a joining state
>> - Operator loads data (e.g. using nodetool rebuild or putting in
>> backupped files or whatever)
>> - Operator has to manually switch from joining into normal state using
>> nodetool (only then it will start serving reads)
>>
>> Wouldn't this behaviour more consistent?
>>
>> kind regards,
>> Christian
>>
>
>


Re: auto_bootstrap=false broken?

2015-08-04 Thread horschi
Hi Aeljami,

thanks for the ticket. I'll keep an eye on it.

I can't get the survey to work at all on 2.0 (I am not getting any schema
on the survey node). So I guess the survey is not going to be a solution
for now.

kind regards,
Christian


On Tue, Aug 4, 2015 at 3:29 PM,  wrote:

> I had problems with write_survey.
>
> I opened a bug :  https://issues.apache.org/jira/browse/CASSANDRA-9934
>
>
>
> *De :* horschi [mailto:hors...@gmail.com]
> *Envoyé :* mardi 4 août 2015 15:20
> *À :* user@cassandra.apache.org
> *Objet :* Re: auto_bootstrap=false broken?
>
>
>
> Hi Paulo,
>
>
>
> thanks for your feedback, but I think this is not what I am looking for.
>
>
>
> Starting with join_ring does not take any tokens in the ring. And the
> "nodetool join" afterwards will again do token-selection and data loading
> in one step.
>
>
>
> I would like to separate these steps:
>
> 1. assign tokens
>
> 2. have the node in a joining state, so that I can copy in data
>
> 3. mark the node as ready
>
>
>
>
>
> I just saw that perhaps write_survey could be misused for that.
>
>
>
> Did anyone ever use write_survey for such a partial bootstrapping?
>
> Do I have to worry about data-loss when using multiple write_survey nodes
> in one cluster?
>
>
>
> kind regards,
>
> Christian
>
>
>
>
>
>
>
> On Tue, Aug 4, 2015 at 2:24 PM, Paulo Motta 
> wrote:
>
> Hello Christian,
>
> You may use the start-up parameter -Dcassandra.join_ring=false if you
> don't want the node to join the ring on startup. More about this parameter
> here:
> http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCUtility_t.html
>
> You can later join the ring via nodetool join command:
> http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsJoin.html
>
> auto_bootstrap=false is typically used to bootstrap new datacenters or
> clusters, or nodes with data already on it before starting the process.
>
> Cheers,
>
> Paulo
>
>
>
> 2015-08-04 8:50 GMT-03:00 horschi :
>
> Hi everyone,
>
>
>
> I'll just ask my question as provocative as possible ;-)
>
>
>
> Isnt't auto_bootstrap=false broken the way it is currently implemented?
>
>
>
> What currently happens:
>
> New node starts with auto_bootstrap=false and it starts serving reads
> immediately without having any data.
>
>
>
> Would the following be more correct:
>
> - New node should stay in a joining state
>
> - Operator loads data (e.g. using nodetool rebuild or putting in backupped
> files or whatever)
>
> - Operator has to manually switch from joining into normal state using
> nodetool (only then it will start serving reads)
>
>
>
> Wouldn't this behaviour more consistent?
>
>
>
> kind regards,
>
> Christian
>
>
>
>
>
> _
>
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou 
> falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been 
> modified, changed or falsified.
> Thank you.
>
>


Re: auto_bootstrap=false broken?

2015-08-04 Thread horschi
Hi Robert,

sorry for the confusion. Perhaps write_survey is not my solution
(unfortunetaly I cant get it to work, so I dont really know). I just
thought that it *could* be my solution.


What I actually want:
I want to be able to start a new node, without it starting to serve reads
prematurely. I want cassandra to wait for me to confirm "everything is ok,
now serve reads".



Possible solutions so far:

A) When starting a new node with auto_bootstrap=false, then I get a node
that has no data, but serves reads. In my opinion it would be cleaner if it
would stay in a joining state where it only receives writes.

B) Disabling join_ring on my new node does nothing. The new node will not
have a token. I cant see it in "nodetool status". Therefore I assume it
will not receive any writes.

C) write_survey unfortunetaly does not seem to work for me: My new node,
which I start with survey-mode, gets writes from other nodes and shows as
"joining" in the ring. Which is good! But does not get a schema, so it
throws exceptions when receiving these writes. I assume its just a bug in
2.0.




Disclaimer: I am using C* 2.0, with which I can't get the desire behaviour
(or at least I don't know how).

kind regards,
Christian




On Tue, Aug 4, 2015 at 7:12 PM, Robert Coli  wrote:

> On Tue, Aug 4, 2015 at 6:19 AM, horschi  wrote:
>
>> I would like to separate these steps:
>> 1. assign tokens
>> 2. have the node in a joining state, so that I can copy in data
>> 3. mark the node as ready
>>
>
>
>> Did anyone ever use write_survey for such a partial bootstrapping?
>>
>
> What you're asking doesn't make sense to me.
>
> What does "partial bootstrap" mean? Where are you getting the data from?
> How are you "copying in data" and why do you need the node to be "in a
> joining state" to do that?
>
> https://issues.apache.org/jira/browse/CASSANDRA-6961
>
> Explains a method by which you can repair a partially joined node. In what
> way does this differ from what you want?
>
> =Rob
>
>


Re: auto_bootstrap=false broken?

2015-08-04 Thread horschi
Hi Jonathan,

unless you specify auto_bootstrap=false :)

kind regards,
Christian

On Tue, Aug 4, 2015 at 7:54 PM, Jonathan Haddad  wrote:

> You're trying to solve a problem that doesn't exist.  Cassandra only
> starts serving reads when it's ready.
>
> On Tue, Aug 4, 2015 at 10:51 AM horschi  wrote:
>
>> Hi Robert,
>>
>> sorry for the confusion. Perhaps write_survey is not my solution
>> (unfortunetaly I cant get it to work, so I dont really know). I just
>> thought that it *could* be my solution.
>>
>>
>> What I actually want:
>> I want to be able to start a new node, without it starting to serve reads
>> prematurely. I want cassandra to wait for me to confirm "everything is ok,
>> now serve reads".
>>
>>
>>
>> Possible solutions so far:
>>
>> A) When starting a new node with auto_bootstrap=false, then I get a node
>> that has no data, but serves reads. In my opinion it would be cleaner if it
>> would stay in a joining state where it only receives writes.
>>
>> B) Disabling join_ring on my new node does nothing. The new node will not
>> have a token. I cant see it in "nodetool status". Therefore I assume it
>> will not receive any writes.
>>
>> C) write_survey unfortunetaly does not seem to work for me: My new node,
>> which I start with survey-mode, gets writes from other nodes and shows as
>> "joining" in the ring. Which is good! But does not get a schema, so it
>> throws exceptions when receiving these writes. I assume its just a bug in
>> 2.0.
>>
>>
>>
>>
>> Disclaimer: I am using C* 2.0, with which I can't get the desire
>> behaviour (or at least I don't know how).
>>
>> kind regards,
>> Christian
>>
>>
>>
>>
>> On Tue, Aug 4, 2015 at 7:12 PM, Robert Coli  wrote:
>>
>>> On Tue, Aug 4, 2015 at 6:19 AM, horschi  wrote:
>>>
>>>> I would like to separate these steps:
>>>> 1. assign tokens
>>>> 2. have the node in a joining state, so that I can copy in data
>>>> 3. mark the node as ready
>>>>
>>>
>>>
>>>> Did anyone ever use write_survey for such a partial bootstrapping?
>>>>
>>>
>>> What you're asking doesn't make sense to me.
>>>
>>> What does "partial bootstrap" mean? Where are you getting the data from?
>>> How are you "copying in data" and why do you need the node to be "in a
>>> joining state" to do that?
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-6961
>>>
>>> Explains a method by which you can repair a partially joined node. In
>>> what way does this differ from what you want?
>>>
>>> =Rob
>>>
>>>
>>


Re: auto_bootstrap=false broken?

2015-08-05 Thread horschi
Hi Rob,

let me try to give examples why auto_bootstrap=false is dangerous:

I just yesterday had the issue that we wanted to set up a new DC:
Unfortunetaly we had one application that used CL.ONE (because its only
querying static data and its read heavy). That application stopped working
after we brought up the new DC, because it was querying against the new
nodes. We are now changing it to LOCAL_ONE, then it should be Ok. But
nevertheless: I think it would have been cleaner if the new node would not
have served reads in the first place. Instead the operations people have to
worry about the applications using the correct CL.


Another, more general, issue with auto_bootstrap=false: When adding a new
node to an existing cluster, you are basically lowering your CL by one.
RF=3 with with quorum will read from two nodes. One might be the
bootstrapped node, which has no data. Then you are relying on a single node
to be 100% consistent.


So what I am trying to say is: Everytime you use auto_bootstrap=false, you
are entering a dangerous path. And I think this could be fixed, if
auto_bootstrap=false would leave the node in a write-only state. Then the
operator could still decide to override it with nodetool.


Disclaimer: I am using C* 2.0.

kind regards,
Christian



On Tue, Aug 4, 2015 at 10:02 PM, Robert Coli  wrote:

> On Tue, Aug 4, 2015 at 11:40 AM, horschi  wrote:
>
>> unless you specify auto_bootstrap=false :)
>>
>
> ... so why are you doing that?
>
> Two experts are confused as to what you're trying to do; why do you think
> you need to do it?
>
> =Rob
>
>


Re: auto_bootstrap=false broken?

2015-08-06 Thread horschi
Hi Rob,



> Your asking the wrong nodes for data in the rebuild-a-new-DC case does not
> indicate a problem with the auto_bootstrap false + rebuild paradigm.
>

The node is "wrong" only because its currently bootstrapping. So imho
Cassandra should not serve any reads in such a case.



>
> What makes you think you should be using it in any case other than
> rebuild-new-DC or "I restored my node's data with tablesnap"?
>

These two cases are my example use-cases. I don't know if there are any
other cases, perhaps there is.



>
> As the bug I pasted you indicates, if you want to repair a node before
> having it join, use join_ring=false, repair it, and then join.
>
> https://issues.apache.org/jira/browse/CASSANDRA-6961
>
> In what way does this functionality not meet your needs?
>

I see two reasons why join_ring=false does not help with my issue or I
misunderstand:

- When I set up a new node and I start it with join_ring=false, then it
does not get any tokens, right? (I tried it, and that was what I got)  How
can I run "nodetool repair" when the node doesn't have any tokens?

- A node that is not joining the ring, will not receive any writes, right?
So if I run repair in a unjoined state for X hours, then I will miss X
hours worth of data afterwards.


The only solution I see seems to be write_survey. I will do some tests with
it, once 2.0.17 is out. I will post my results :-)

kind regards,
Christian


Re: auto_bootstrap=false broken?

2015-08-07 Thread horschi
Hi Jeff,


You’re trying to force your view onto an established ecosystem.
>
It is not my intent to force anyone to do anything. I apologize if my title
was too provocative. I just wanted to clickbait ;-)


It’s not “wrong only because its currently bootstrapping”, it’s not
> bootstrapping at all, you told it not to bootstrap.
>
Let me correct myself. It should be: "its wrong because it isn't
bootstrapped". But that does not change what I am proposing: It still
should not serve reads.


‘auto_bootstrap’ is the knob that tells cassandra whether or not you want
> to stream data from other replicas when you join the ring. Period. That’s
> all it does. If you set it to false, you’re telling cassandra it already
> has the data. The switch implies nothing else. There is no option to “join
> the ring but don’t serve reads until I tell you it’s ready”, and changing
> auto-bootstrap to be that is unlikely to ever happen.
>
I know that it does only that. But I would have made a different design
decision (to not serve reads in such a state).



> Don’t want to serve reads? Disable thrift and native proto, start the with
> auto-bootstrap set to whatever you want but thrift and native proto
> disabled, then enable thrift and native proto again to enable reads from
> clients when ready. Until then, make sure you’re using a consistency level
> appropriate for your requirements.
>
Of course it can be worked around. I just think its error prone to do that
manually. That is why I was proposing a change.



> You’re mis-using a knob that doesn’t do what you think it does, and is
> unlikely to ever be changed to do what you think it should.
>
I wanted to change the definition of what auto_bootstrap=false is. I dont
know if that makes it better or worse ;-)


I hope I did not consume too much of your time. Thanks for all the
responses. I will experiment a bit with write_survey and see if it already
does what I need.

kind regards,
Christian


Re: auto_bootstrap=false broken?

2015-08-07 Thread horschi
Hi Cyril,

thanks for backing me up. I'm under siege from all sides here ;-)


That something we're trying to do too. However disabling clients
> connections (closing thrift and native ports) does not prevent other nodes
> (acting as a coordinators) to request it ... Honestly we'd like to restart
> a node that need to deploy HH and to make it serve reads only when it's
> done. And to be more precise, we know when it's done and don't need it to
> work by itself (automatically).
>
If you use auto_bootstrap=false in the same DC, then I think you are
screwed. Afaik auto_bootstrap=false can only work in a new DC, where you
can control reads via LOCAL-Consistencieslevels.


kind regards,
Christian


Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-07 Thread horschi
Hi Tom,

this sounds very much like my thread: "auto_bootstrap=false broken?"

Did you try booting the new node with survey-mode? I wanted to try this,
but I am waiting for 2.0.17 to come out (survey mode is broken in earlier
versions). Imho survey mode is what you (and me too) want: start a node,
accepting writes, but not serving reads. I have not tested it yet, but I
think it should work.

Also the manual join mentioned in CASSANDRA-9667 sounds very interesting.

kind regards,
Christian

On Mon, Sep 7, 2015 at 10:11 PM, Tom van den Berge 
wrote:

> Running nodetool rebuild on a node that was started with join_ring=false
> does not work, unfortunately. The nodetool command returns immediately,
> after a message appears in the log that the streaming of data has started.
> After that, nothing happens.
>
> Tom
>
>
> On Fri, Sep 12, 2014 at 5:47 PM, Robert Coli  wrote:
>
>> On Fri, Sep 12, 2014 at 6:57 AM, Tom van den Berge 
>> wrote:
>>
>>> Wouldn't it be far more efficient if a node that is rebuilding itself is
>>> responsible for not accepting reads until the rebuild is complete? E.g. by
>>> marking it as "Joining", similar to a node that is being bootstrapped?
>>>
>>
>> Yes, and Cassandra 2.0.7 and above contain this long desired
>> functionality.
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-6961
>>
>> I presume that one can also run a rebuild in this state, though I haven't
>> tried. Driftx gives it an 80% chance... try it and see and let us know? :D
>>
>> =Rob
>>
>>
>
>
>


Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-07 Thread horschi
Hi Tom,

"one drawback: the node joins the cluster as soon as the bootstrapping
begins."
I am not sure I understand this correctly. It will get tokens, but not load
data if you combine it with autobootstrap=false.

How I see it: You should be able to start all the new nodes in the new DC
with autobootstrap=false and survey-mode=true. Then you should have a new
DC with nodes that have tokens but no data. Then you can start rebuild on
all new nodes. During this process, the new nodes should get writes, but
not serve reads.

Disclaimer: I have not tested the combination of the two!



"It turns out that join_ring=false in this scenario does not solve this
problem"
I also don't see how joing_ring would help here. (Actually I have no clue
where you would ever need that option)


"A workaround could be to ensure that only LOCAL_* CL is used by all
clients, but even then I'm seeing failed queries, because they're
mysteriously routed to the new DC every now and then."
Yes, it works fine if you don't do any mistakes. Keep in mind you also have
to make sure your driver does not connect against the other DC. But I agree
with you: its a workaround for this scenario. To me this does not feel
correct.



"Currently I'm trying to auto_bootstrap my new DC. The good thing is that
it doesn't accept reads from other DCs."
The joining-state actually works perfectly. The joining state is a state
where node take writes, but not serve ready. It would be really cool if you
could boot a node into the joining state. Actually, write_survey should
basically be the same.

kind regards,
Christian

PS: I would love to see the results, if you perform any tests on the
write-survey. Please share it here on the mailing list :-)



On Mon, Sep 7, 2015 at 11:10 PM, Tom van den Berge <
tom.vandenbe...@gmail.com> wrote:

> Hi Christian,
>
> No, I never tried survey mode. I didn't know it until now, but form the
> info I was able to find it looks like it is meant for a different purpose.
> Maybe it can be used to bootstrap a new DC, though.
>
> On the other hand, the auto_bootstrap=false + rebuild scenario seems to be
> designed to do exactly what I need, except that it has one drawback: the
> node joins the cluster as soon as the bootstrapping begins.
>
> It turns out that join_ring=false in this scenario does not solve this
> problem, since nodetool rebuild does not do anything if C* is started with
> this option.
>
> A workaround could be to ensure that only LOCAL_* CL is used by all
> clients, but even then I'm seeing failed queries, because they're
> mysteriously routed to the new DC every now and then.
>
> Currently I'm trying to auto_bootstrap my new DC. The good thing is that
> it doesn't accept reads from other DCs. The bad thing is that a) I can't
> choose where it streams its data from, and b) the two nodes I've been
> trying to bootstrap crashed when they were almost finished...
>
>
>
> On Mon, Sep 7, 2015 at 10:22 PM, horschi  wrote:
>
>> Hi Tom,
>>
>> this sounds very much like my thread: "auto_bootstrap=false broken?"
>>
>> Did you try booting the new node with survey-mode? I wanted to try this,
>> but I am waiting for 2.0.17 to come out (survey mode is broken in earlier
>> versions). Imho survey mode is what you (and me too) want: start a node,
>> accepting writes, but not serving reads. I have not tested it yet, but I
>> think it should work.
>>
>> Also the manual join mentioned in CASSANDRA-9667 sounds very interesting.
>>
>> kind regards,
>> Christian
>>
>> On Mon, Sep 7, 2015 at 10:11 PM, Tom van den Berge 
>> wrote:
>>
>>> Running nodetool rebuild on a node that was started with join_ring=false
>>> does not work, unfortunately. The nodetool command returns immediately,
>>> after a message appears in the log that the streaming of data has started.
>>> After that, nothing happens.
>>>
>>> Tom
>>>
>>>
>>> On Fri, Sep 12, 2014 at 5:47 PM, Robert Coli 
>>> wrote:
>>>
>>>> On Fri, Sep 12, 2014 at 6:57 AM, Tom van den Berge 
>>>> wrote:
>>>>
>>>>> Wouldn't it be far more efficient if a node that is rebuilding itself
>>>>> is responsible for not accepting reads until the rebuild is complete? E.g.
>>>>> by marking it as "Joining", similar to a node that is being bootstrapped?
>>>>>
>>>>
>>>> Yes, and Cassandra 2.0.7 and above contain this long desired
>>>> functionality.
>>>>
>>>> https://issues.apache.org/jira/browse/CASSANDRA-6961
>>>>
>>>> I presume that one can also run a rebuild in this state, though I
>>>> haven't tried. Driftx gives it an 80% chance... try it and see and let us
>>>> know? :D
>>>>
>>>> =Rob
>>>>
>>>>
>>>
>>>
>>>
>>
>


Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-08 Thread horschi
Hi Tom,

"The idea of join_ring=false is that other nodes are not aware of the new
node, and therefore never send requests to it. The new node can then be
repaired"
Nicely explained, but I still see the issue that this node would not
receive writes during that time. So after the repair the node would still
miss data.
Again, what is needed is either some joining-state or write-survey that
allows disabling reads, but still accepts writes.



"To set up a new DC, I was hoping that you could also rebuild (instead of a
repair) a new node while join_ring=false, but that seems not to work."
Correct. The node does not get any tokens with join_ring=false. And again,
your node won't receive any writes while you are rebuilding. Therefore you
will have outdated data at the point when you are done rebuilding.


kind regards,
Christian





On Tue, Sep 8, 2015 at 10:00 AM, Tom van den Berge <
tom.vandenbe...@gmail.com> wrote:

> "one drawback: the node joins the cluster as soon as the bootstrapping
>> begins."
>> I am not sure I understand this correctly. It will get tokens, but not
>> load data if you combine it with autobootstrap=false.
>>
> Joining the cluster means that all other nodes become aware of the new
> node, and therefore it might receive reads. And yes, it will not have any
> data, because auto_bootstrap=false.
>
>
>
>> How I see it: You should be able to start all the new nodes in the new DC
>> with autobootstrap=false and survey-mode=true. Then you should have a new
>> DC with nodes that have tokens but no data. Then you can start rebuild on
>> all new nodes. During this process, the new nodes should get writes, but
>> not serve reads.
>>
> Maybe you're right.
>
>
>>
>> "It turns out that join_ring=false in this scenario does not solve this
>> problem"
>> I also don't see how joing_ring would help here. (Actually I have no clue
>> where you would ever need that option)
>>
> The idea of join_ring=false is that other nodes are not aware of the new
> node, and therefore never send requests to it. The new node can then be
> repaired (see https://issues.apache.org/jira/browse/CASSANDRA-6961). To
> set up a new DC, I was hoping that you could also rebuild (instead of a
> repair) a new node while join_ring=false, but that seems not to work.
>
>>
>>
>> "Currently I'm trying to auto_bootstrap my new DC. The good thing is that
>> it doesn't accept reads from other DCs."
>> The joining-state actually works perfectly. The joining state is a state
>> where node take writes, but not serve ready. It would be really cool if you
>> could boot a node into the joining state. Actually, write_survey should
>> basically be the same.
>>
> It would be great if you could select the DC from where it's bootstrapped,
> similar to nodetool rebuild. I'm currently bootstrapping a node in
> San-Jose. It decides to stream all data from another DC in Amsterdam, while
> we also have another DC in San-Jose, right next to it. Streaming data
> across the Atlantic takes a lot more time :(
>
>
>
>>
>> kind regards,
>> Christian
>>
>> PS: I would love to see the results, if you perform any tests on the
>> write-survey. Please share it here on the mailing list :-)
>>
>>
>>
>> On Mon, Sep 7, 2015 at 11:10 PM, Tom van den Berge <
>> tom.vandenbe...@gmail.com> wrote:
>>
>>> Hi Christian,
>>>
>>> No, I never tried survey mode. I didn't know it until now, but form the
>>> info I was able to find it looks like it is meant for a different purpose.
>>> Maybe it can be used to bootstrap a new DC, though.
>>>
>>> On the other hand, the auto_bootstrap=false + rebuild scenario seems to
>>> be designed to do exactly what I need, except that it has one drawback: the
>>> node joins the cluster as soon as the bootstrapping begins.
>>>
>>> It turns out that join_ring=false in this scenario does not solve this
>>> problem, since nodetool rebuild does not do anything if C* is started with
>>> this option.
>>>
>>> A workaround could be to ensure that only LOCAL_* CL is used by all
>>> clients, but even then I'm seeing failed queries, because they're
>>> mysteriously routed to the new DC every now and then.
>>>
>>> Currently I'm trying to auto_bootstrap my new DC. The good thing is that
>>> it doesn't accept reads from other DCs. The bad thing is that a) I can't
>>> choose where it streams its data from, and b) the two nodes I've been
>>> trying to boot

Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-08 Thread horschi
Hi Robert,

I tried to set up a new node with join_ring=false once. In my test that
node did not pick a token in the ring. I assume running repair or rebuild
would not do anything in that case: No tokens = no data. But I must admit:
I have not tried running rebuild.

Is a new node with join_ring=false supposed to pick tokens? From driftx
comment in CASSANDRA-6961 I take it should not.

Tom: What does "nodetool status" say after you started the new node
with join_ring=false?
In my test I got a node that was not in the ring at all.

kind regards,
Christian



On Tue, Sep 8, 2015 at 9:05 PM, Robert Coli  wrote:

>
>
> On Tue, Sep 8, 2015 at 1:39 AM, horschi  wrote:
>
>> "The idea of join_ring=false is that other nodes are not aware of the
>> new node, and therefore never send requests to it. The new node can then be
>> repaired"
>> Nicely explained, but I still see the issue that this node would not
>> receive writes during that time. So after the repair the node would still
>> miss data.
>> Again, what is needed is either some joining-state or write-survey that
>> allows disabling reads, but still accepts writes.
>>
>
> https://issues.apache.org/jira/browse/CASSANDRA-6961
> "
> We can *almost* set join_ring to false, then repair, and then join the
> ring to narrow the window (actually, you can do this and everything
> succeeds because the node doesn't know it's a member yet, which is probably
> a bit of a bug.) If instead we modified this to put the node in hibernate,
> like replace_address does, it could work almost like replace, except you
> could run a repair (manually) while in the hibernate state, and then flip
> to normal when it's done.
> "
>
> Since 2.0.7, you should be able to use join_ring=false + repair to do the
> operation this thread discusses.
>
> Has anyone here tried and found it wanting? If so, in what way?
>
> For the record, I find various statements in this thread confusing and
> likely to be wrong :
>
> " And again, your node won't receive any writes while you are rebuilding.
>>  "
>
>
> If your RF has been increased in the new DC, sure you will, you'll get the
> writes you're supposed to get because of your RF? The challenge with
> rebuild is premature reads from the new DC, not losing writes?
>
> Running nodetool rebuild on a node that was started with join_ring=false
>> does not work, unfortunately. The nodetool command returns immediately,
>> after a message appears in the log that the streaming of data has started.
>> After that, nothing happens.
>
>
> Per driftx, the author of CASSANDRA-6961, this sounds like a bug. If you
> can repro, please file a JIRA and let the list know the URL.
>
> =Rob
>
>
>


Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-10 Thread horschi
Hi Rob,

regarding 1-3:
Thank you for the step-by-step explanation :-) My mistake was to use
join_ring=false during the inital start already. It now works for me as its
supposed to. Nevertheless it does not what I want, as it does not take
writes during the time of repair/rebuild: Running an 8 hour repair will
lead to 8 hours of data missing.


regarding 1-6:
This is what we did. And it works of course. Our issue was just that we had
some global-QUORUMS hidden somewhere, which the operator was not aware of.
Therefore it would have been nice if the ops guy could prevent these reads
by himself.




Another issue I think the current bootstrapping process has: Doesn't it
practically reduce the RF for old data by one? (With old data I mean any
data that was written before the bootstrap).

Let me give an example:

Lets assume I have a cluster of Node 1,2 and 3 with RF=3. And lets assume a
single write on node 2 got lost. So this particular write is only available
on node 1 and 3.

Now I add node 4, which takes the range in such a way that node 1 will not
own that previously written key any more. Also assume that the new node
loads its data from node 2.

This means we have a cluster where the previously mentioned write is only
on node 3. (Node 1 is not responsible for the key any more and node 4
loaded its data from the wrong node)

Any quorum-read that hit node 2 & 4 will not return the column. So this
means we effectively lowered the CL/RF.


Therefore what I would like to be able to do is:
- Add new node 4, but leave it in a joining state. (This means it gets all
the writes but does not serve reads.)
- Do "nodetool rebuild"
- New node should not serve reads yet. And node 1 should not yet give up
its ranges to node 4.
- Do "nodetool repair", to ensure consistency.
- Finish bootstrap. Now node1 should not be responsible for the range and
node4 should become eligible for reads.


regards,
Christian




On Tue, Sep 8, 2015 at 11:51 PM, Robert Coli  wrote:

> On Tue, Sep 8, 2015 at 2:39 PM, horschi  wrote:
>
>> I tried to set up a new node with join_ring=false once. In my test that
>> node did not pick a token in the ring. I assume running repair or rebuild
>> would not do anything in that case: No tokens = no data. But I must admit:
>> I have not tried running rebuild.
>>
>
> I admit I haven't been following this thread closely, perhaps I have
> missed what exactly it is you're trying to do.
>
> It's possible you'd need to :
>
> 1) join the node with auto_bootstrap=false
> 2) immediately stop it
> 3) re-start it with join_ring=false
>
> To actually use repair or rebuild in this way.
>
> However, if your goal is to create a new data-center and rebuild a node
> there without any risk of reading from that node while creating the new
> data center, you can just :
>
> 1) create nodes in new data-center, with RF=0 for that DC
> 2) change RF in that DC
> 3) run rebuild on new data-center nodes
> 4) while doing so, don't talk to new data-center coordinators from your
> client
> 5) and also use LOCAL_ONE/LOCAL_QUORUM to avoid cross-data-center reads
> from your client
> 6) modulo the handful of current bugs which make 5) currently imperfect
>
> What problem are you encountering with this procedure? If it's this ...
>
> I've learned from experience that the node immediately joins the cluster,
>> and starts accepting reads (from other DCs) for the range it owns.
>
>
> This seems to be the incorrect assumption at the heart of the confusion.
> You "should" be able to prevent this behavior entirely via correct use of
> ConsistencyLevel and client configuration.
>
> In an ideal world, I'd write a detailed blog post explaining this... :/ in
> my copious spare time...
>
> =Rob
>
>
>


Re: Is it possible to bootstrap the 1st node of a new DC?

2015-09-10 Thread horschi
Hi Samuel,

thanks a lot for the jira link. Another reason to upgrade to 2.1 :-)

regards,
Christian



On Thu, Sep 10, 2015 at 1:28 PM, Samuel CARRIERE 
wrote:

> Hi Christian,
> The problem you mention (violation of constency) is a true one. If I have
> understood correctly, it is resolved in cassandra 2.1 (see CASSANDRA-2434).
> Regards,
> Samuel
>
>
> horschi  a écrit sur 10/09/2015 12:41:41 :
>
> > De : horschi 
> > A : user@cassandra.apache.org,
> > Date : 10/09/2015 12:42
> > Objet : Re: Is it possible to bootstrap the 1st node of a new DC?
> >
> > Hi Rob,
> >
> > regarding 1-3:
> > Thank you for the step-by-step explanation :-) My mistake was to use
> > join_ring=false during the inital start already. It now works for me
> > as its supposed to. Nevertheless it does not what I want, as it does
> > not take writes during the time of repair/rebuild: Running an 8 hour
> > repair will lead to 8 hours of data missing.
> >
> > regarding 1-6:
> > This is what we did. And it works of course. Our issue was just that
> > we had some global-QUORUMS hidden somewhere, which the operator was
> > not aware of. Therefore it would have been nice if the ops guy could
> > prevent these reads by himself.
> >
> >
> > Another issue I think the current bootstrapping process has: Doesn't
> > it practically reduce the RF for old data by one? (With old data I
> > mean any data that was written before the bootstrap).
> >
> > Let me give an example:
> >
> > Lets assume I have a cluster of Node 1,2 and 3 with RF=3. And lets
> > assume a single write on node 2 got lost. So this particular write
> > is only available on node 1 and 3.
> >
> > Now I add node 4, which takes the range in such a way that node 1
> > will not own that previously written key any more. Also assume that
> > the new node loads its data from node 2.
> >
> > This means we have a cluster where the previously mentioned write is
> > only on node 3. (Node 1 is not responsible for the key any more and
> > node 4 loaded its data from the wrong node)
> >
> > Any quorum-read that hit node 2 & 4 will not return the column. So
> > this means we effectively lowered the CL/RF.
> >
> > Therefore what I would like to be able to do is:
> > - Add new node 4, but leave it in a joining state. (This means it
> > gets all the writes but does not serve reads.)
> > - Do "nodetool rebuild"
> > - New node should not serve reads yet. And node 1 should not yet
> > give up its ranges to node 4.
> > - Do "nodetool repair", to ensure consistency.
> > - Finish bootstrap. Now node1 should not be responsible for the
> > range and node4 should become eligible for reads.
> >
> > regards,
> > Christian
> >
> > On Tue, Sep 8, 2015 at 11:51 PM, Robert Coli 
> wrote:
> > On Tue, Sep 8, 2015 at 2:39 PM, horschi  wrote:
> > I tried to set up a new node with join_ring=false once. In my test
> > that node did not pick a token in the ring. I assume running repair
> > or rebuild would not do anything in that case: No tokens = no data.
> > But I must admit: I have not tried running rebuild.
> >
> > I admit I haven't been following this thread closely, perhaps I have
> > missed what exactly it is you're trying to do.
> >
> > It's possible you'd need to :
> >
> > 1) join the node with auto_bootstrap=false
> > 2) immediately stop it
> > 3) re-start it with join_ring=false
> >
> > To actually use repair or rebuild in this way.
> >
> > However, if your goal is to create a new data-center and rebuild a
> > node there without any risk of reading from that node while creating
> > the new data center, you can just :
> >
> > 1) create nodes in new data-center, with RF=0 for that DC
> > 2) change RF in that DC
> > 3) run rebuild on new data-center nodes
> > 4) while doing so, don't talk to new data-center coordinators from your
> client
> > 5) and also use LOCAL_ONE/LOCAL_QUORUM to avoid cross-data-center
> > reads from your client
> > 6) modulo the handful of current bugs which make 5) currently imperfect
> >
> > What problem are you encountering with this procedure? If it's this ...
> >
> > I've learned from experience that the node immediately joins the
> > cluster, and starts accepting reads (from other DCs) for the range it
> owns.
> >
> > This seems to be the incorrect assumption at the heart of the
> > confusion. You "should" be able to prevent this behavior entirely
> > via correct use of ConsistencyLevel and client configuration.
> >
> > In an ideal world, I'd write a detailed blog post explaining this...
> > :/ in my copious spare time...
> >
> > =Rob
> >
>


Re: memory usage problem of Metadata.tokenMap.tokenToHost

2015-09-22 Thread horschi
Hi Joseph,

I think 2000 keyspaces might be just too much. Fewer keyspaces (and CFs)
will probably work much better.

kind regards,
Christian


On Tue, Sep 22, 2015 at 9:29 AM, joseph gao  wrote:

> Hi, anybody could help me?
>
> 2015-09-21 0:47 GMT+08:00 joseph gao :
>
>> ps : that's the code in java drive , in MetaData.TokenMap.build:
>>
>> for (KeyspaceMetadata keyspace : keyspaces)
>> {
>> ReplicationStrategy strategy = keyspace.replicationStrategy();
>> Map> ksTokens = (strategy == null)
>> ? makeNonReplicatedMap(tokenToPrimary)
>> : strategy.computeTokenToReplicaMap(tokenToPrimary, ring);
>>
>> tokenToHosts.put(keyspace.getName(), ksTokens);
>>
>> tokenToPrimary is all same, ring is all same, and if strategy is all
>> same , strategy.computeTokenToReplicaMap would return 'same' map but
>> different object( cause every calling returns a new HashMap
>>
>> 2015-09-21 0:22 GMT+08:00 joseph gao :
>>
>>> cassandra: 2.1.7
>>> java driver: datastax java driver 2.1.6
>>>
>>> Here is the problem:
>>>My application uses 2000+ keyspaces, and will dynamically create
>>> keyspaces and tables. And then in java client, the
>>> Metadata.tokenMap.tokenToHost would use about 1g memory. so this will cause
>>> a lot of  full gc.
>>>As I see, the key of the tokenToHost is keyspace, and the value is a
>>> tokenId_to_replicateNodes map.
>>>
>>>When I try to solve this problem, I find something not sure: all
>>> keyspaces have same 'tokenId_to_replicateNodes' map.
>>> My replication strategy of all keyspaces is : simpleStrategy and
>>> replicationFactor is 3
>>>
>>> So would it be possible if keyspaces use same strategy, the value of
>>> tokenToHost map use a same map. So it would extremely reduce the memory
>>> usage
>>>
>>>  thanks a lot
>>>
>>> --
>>> --
>>> Joseph Gao
>>> PhoneNum:15210513582
>>> QQ: 409343351
>>>
>>
>>
>>
>> --
>> --
>> Joseph Gao
>> PhoneNum:15210513582
>> QQ: 409343351
>>
>
>
>
> --
> --
> Joseph Gao
> PhoneNum:15210513582
> QQ: 409343351
>


Re: 3k sstables during a repair incremental !!

2016-02-10 Thread horschi
Hi Jean,

which Cassandra version do you use?

Incremental repair got much better in 2.2 (for us at least).

kind regards,
Christian

On Wed, Feb 10, 2016 at 2:33 PM, Jean Carlo  wrote:
> Hello guys!
>
> I am testing the repair inc in my custer cassandra. I am doing my test over
> these tables
>
> CREATE TABLE pns_nonreg_bench.cf3 (
> s text,
> sp int,
> d text,
> dp int,
> m map,
> t timestamp,
> PRIMARY KEY (s, sp, d, dp)
> ) WITH CLUSTERING ORDER BY (sp ASC, d ASC, dp ASC)
>
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
> AND compression = {'sstable_compression':
> 'org.apache.cassandra.io.compress.SnappyCompressor'}
>
> CREATE TABLE pns_nonreg_bench.cf1 (
> ise text PRIMARY KEY,
> int_col int,
> text_col text,
> ts_col timestamp,
> uuid_col uuid
> ) WITH bloom_filter_fp_chance = 0.01
>  AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
> AND compression = {'sstable_compression':
> 'org.apache.cassandra.io.compress.SnappyCompressor'}
>
> table cf1
> Space used (live): 665.7 MB
> table cf2
> Space used (live): 697.03 MB
>
> It happens that when I do repair -inc -par on theses tables, cf2 got a pick
> of 3k sstables. When the repair finish, it takes 30 min or more to finish
> all the compactations and return to 6 sstable.
>
> I am a little concern about if this will happen on production. is it normal?
>
> Saludos
>
> Jean Carlo
>
> "The best way to predict the future is to invent it" Alan Kay


Re: 3k sstables during a repair incremental !!

2016-02-10 Thread horschi
Hi Jean,

we had the same issue, but on SizeTieredCompaction. During repair the
number of SSTables and pending compactions were exploding.

It not only affected latencies, at some point Cassandra ran out of heap.

After the upgrade to 2.2 things got much better.

regards,
Christian


On Wed, Feb 10, 2016 at 2:46 PM, Jean Carlo  wrote:
> Hi Horschi !!!
>
> I have the 2.1.12. But I think it is something related to Level compaction
> strategy. It is impressive that we passed from 6 sstables to 3k sstable.
> I think this will affect the latency on production because the number of
> compactions going on
>
>
>
> Best regards
>
> Jean Carlo
>
> "The best way to predict the future is to invent it" Alan Kay
>
> On Wed, Feb 10, 2016 at 2:37 PM, horschi  wrote:
>>
>> Hi Jean,
>>
>> which Cassandra version do you use?
>>
>> Incremental repair got much better in 2.2 (for us at least).
>>
>> kind regards,
>> Christian
>>
>> On Wed, Feb 10, 2016 at 2:33 PM, Jean Carlo 
>> wrote:
>> > Hello guys!
>> >
>> > I am testing the repair inc in my custer cassandra. I am doing my test
>> > over
>> > these tables
>> >
>> > CREATE TABLE pns_nonreg_bench.cf3 (
>> > s text,
>> > sp int,
>> > d text,
>> > dp int,
>> > m map,
>> > t timestamp,
>> > PRIMARY KEY (s, sp, d, dp)
>> > ) WITH CLUSTERING ORDER BY (sp ASC, d ASC, dp ASC)
>> >
>> > AND compaction = {'class':
>> > 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
>> > AND compression = {'sstable_compression':
>> > 'org.apache.cassandra.io.compress.SnappyCompressor'}
>> >
>> > CREATE TABLE pns_nonreg_bench.cf1 (
>> > ise text PRIMARY KEY,
>> > int_col int,
>> > text_col text,
>> > ts_col timestamp,
>> > uuid_col uuid
>> > ) WITH bloom_filter_fp_chance = 0.01
>> >  AND compaction = {'class':
>> > 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
>> > AND compression = {'sstable_compression':
>> > 'org.apache.cassandra.io.compress.SnappyCompressor'}
>> >
>> > table cf1
>> > Space used (live): 665.7 MB
>> > table cf2
>> > Space used (live): 697.03 MB
>> >
>> > It happens that when I do repair -inc -par on theses tables, cf2 got a
>> > pick
>> > of 3k sstables. When the repair finish, it takes 30 min or more to
>> > finish
>> > all the compactations and return to 6 sstable.
>> >
>> > I am a little concern about if this will happen on production. is it
>> > normal?
>> >
>> > Saludos
>> >
>> > Jean Carlo
>> >
>> > "The best way to predict the future is to invent it" Alan Kay
>
>


Re: 3k sstables during a repair incremental !!

2016-02-10 Thread horschi
btw: I am not saying incremental Repair in 2.1 is broken, but ... ;-)

On Wed, Feb 10, 2016 at 2:59 PM, horschi  wrote:
> Hi Jean,
>
> we had the same issue, but on SizeTieredCompaction. During repair the
> number of SSTables and pending compactions were exploding.
>
> It not only affected latencies, at some point Cassandra ran out of heap.
>
> After the upgrade to 2.2 things got much better.
>
> regards,
> Christian
>
>
> On Wed, Feb 10, 2016 at 2:46 PM, Jean Carlo  wrote:
>> Hi Horschi !!!
>>
>> I have the 2.1.12. But I think it is something related to Level compaction
>> strategy. It is impressive that we passed from 6 sstables to 3k sstable.
>> I think this will affect the latency on production because the number of
>> compactions going on
>>
>>
>>
>> Best regards
>>
>> Jean Carlo
>>
>> "The best way to predict the future is to invent it" Alan Kay
>>
>> On Wed, Feb 10, 2016 at 2:37 PM, horschi  wrote:
>>>
>>> Hi Jean,
>>>
>>> which Cassandra version do you use?
>>>
>>> Incremental repair got much better in 2.2 (for us at least).
>>>
>>> kind regards,
>>> Christian
>>>
>>> On Wed, Feb 10, 2016 at 2:33 PM, Jean Carlo 
>>> wrote:
>>> > Hello guys!
>>> >
>>> > I am testing the repair inc in my custer cassandra. I am doing my test
>>> > over
>>> > these tables
>>> >
>>> > CREATE TABLE pns_nonreg_bench.cf3 (
>>> > s text,
>>> > sp int,
>>> > d text,
>>> > dp int,
>>> > m map,
>>> > t timestamp,
>>> > PRIMARY KEY (s, sp, d, dp)
>>> > ) WITH CLUSTERING ORDER BY (sp ASC, d ASC, dp ASC)
>>> >
>>> > AND compaction = {'class':
>>> > 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
>>> > AND compression = {'sstable_compression':
>>> > 'org.apache.cassandra.io.compress.SnappyCompressor'}
>>> >
>>> > CREATE TABLE pns_nonreg_bench.cf1 (
>>> > ise text PRIMARY KEY,
>>> > int_col int,
>>> > text_col text,
>>> > ts_col timestamp,
>>> > uuid_col uuid
>>> > ) WITH bloom_filter_fp_chance = 0.01
>>> >  AND compaction = {'class':
>>> > 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
>>> > AND compression = {'sstable_compression':
>>> > 'org.apache.cassandra.io.compress.SnappyCompressor'}
>>> >
>>> > table cf1
>>> > Space used (live): 665.7 MB
>>> > table cf2
>>> > Space used (live): 697.03 MB
>>> >
>>> > It happens that when I do repair -inc -par on theses tables, cf2 got a
>>> > pick
>>> > of 3k sstables. When the repair finish, it takes 30 min or more to
>>> > finish
>>> > all the compactations and return to 6 sstable.
>>> >
>>> > I am a little concern about if this will happen on production. is it
>>> > normal?
>>> >
>>> > Saludos
>>> >
>>> > Jean Carlo
>>> >
>>> > "The best way to predict the future is to invent it" Alan Kay
>>
>>


Low compactionthroughput blocks reads?

2016-02-26 Thread horschi
Hi,

I just had a weird behaviour on one of our Cassandra nodes, which I would
like to share:

Short version:
My pending reads went up from ~0 to the hundreds when I reduced the
compactionthroughput from 16 to 2.


Long version:

One of our more powerful nodes had a few pending reads, while the other
ones didn't. So far nothing special.

Strangely neither CPU, nor IO Wait, nor disk-ops/s, nor C*-heap was
particularly high. So I was wondering.

That machine had two compactions and a validation(incremental) running, so
I set the compactionthroughput to 2. To my surprise I saw the pending reads
go up to the hundreds within 5-10 seconds. Setting the compactionthroughput
back to 16 and the pending reads went back to 0 (or at least close to zero).

I kept the compactionthroughput on 2 for less than a minute. So the issue
is not compactions falling behind.

I was able to reproduce this behaviour 5-10 times. The pending reads went
up, everytime I *de*creased the compactionthroughput. I watched the pending
reads while the compactionthroughput was on 16, and I never observed even a
two digit pending read count while it was on compactionthroughput 16.

Unfortunetaly the machine does not show this behaviour any more. Also it
was only a single machine.



Our setup:
C* 2.2.5 with 256 vnodes + 9 nodes + incremental repair + 6GB heap


My question:
Did someone else ever observe such a behaviour?

Is it perhaps possible that the read-path shares a lock with
repair/compaction that waits on ThrottledReader while holding that lock?


kind regards,
Christian


Dynamic TTLs / limits still not working in 2.2 ?

2016-03-08 Thread horschi
Hi,

according to CASSANDRA-4450
 it should be fixed,
but I still can't use dynamic TTLs or limits in my CQL queries.

Query:
update mytable set data=:data where ts=:ts and randkey=:randkey using ttl
:timetolive

Exception:
Caused by: com.datastax.driver.core.exceptions.SyntaxError: line 1:138
missing EOF at 'using' (...:ts and randkey=:randkey [using] ttl...)
at com.datastax.driver.core.Responses$Error.asException(Responses.java:100)

I am using Cassandra 2.2 (using Datastax java driver 2.1.9) and I still see
this, even though the Jira ticket states fixVersion 2.0.

Has anyone used this successfully? Am I doing something wrong or is there
still a bug?

kind regards,
Christian


Tickets:
https://datastax-oss.atlassian.net/browse/JAVA-54
https://issues.apache.org/jira/browse/CASSANDRA-4450


Re: Dynamic TTLs / limits still not working in 2.2 ?

2016-03-08 Thread horschi
Hi Nick,

I will try your workaround. Thanks a lot.

I was not expecting the Java-Driver to have a bug, because in the Jira
Ticket (JAVA-54) it says "not a problem". So i assumed there is nothing to
do to support it :-)

kind regards,
Christian

On Tue, Mar 8, 2016 at 2:56 PM, Nicholas Wilson  wrote:

> Hi Christian,
>
>
> I ran into this problem last month; after some chasing I thought it was
> possibly a bug in the Datastax driver, which I'm also using. The CQL
> protocol itself supports dynamic TTLs fine.
>
>
> One workaround that seems to work is to use an unnamed bind marker for the
> TTL ('?') and then set it using the "[ttl]" reserved name as the bind
> marker name ('setLong("[ttl]", myTtl)'), which will set the correct field
> in the bound statement.
>
>
> Best,
>
> Nick​
>
>
> --
> *From:* horschi 
> *Sent:* 08 March 2016 13:52
> *To:* user@cassandra.apache.org
> *Subject:* Dynamic TTLs / limits still not working in 2.2 ?
>
> Hi,
>
> according to CASSANDRA-4450
> <https://issues.apache.org/jira/browse/CASSANDRA-4450> it should be
> fixed, but I still can't use dynamic TTLs or limits in my CQL queries.
>
> Query:
> update mytable set data=:data where ts=:ts and randkey=:randkey using ttl
> :timetolive
>
> Exception:
> Caused by: com.datastax.driver.core.exceptions.SyntaxError: line 1:138
> missing EOF at 'using' (...:ts and randkey=:randkey [using] ttl...)
> at com.datastax.driver.core.Responses$Error.asException(Responses.java:100)
>
> I am using Cassandra 2.2 (using Datastax java driver 2.1.9) and I still
> see this, even though the Jira ticket states fixVersion 2.0.
>
> Has anyone used this successfully? Am I doing something wrong or is there
> still a bug?
>
> kind regards,
> Christian
>
>
> Tickets:
> https://datastax-oss.atlassian.net/browse/JAVA-54
> https://issues.apache.org/jira/browse/CASSANDRA-4450
>
>
>


Re: Dynamic TTLs / limits still not working in 2.2 ?

2016-03-08 Thread horschi
Oh, I just realized I made a mistake with the TTL query:

The TTL has to be specified before the set. Like this:
update mytable using ttl :timetolive set data=:data where ts=:ts and
randkey=:randkey

And this of course works nicely. Sorry for the confusion.


Nevertheless, I don't think this is the issue with my "select ... limit"
querys. But I will verify this and also try the workaround.



On Tue, Mar 8, 2016 at 3:08 PM, horschi  wrote:

> Hi Nick,
>
> I will try your workaround. Thanks a lot.
>
> I was not expecting the Java-Driver to have a bug, because in the Jira
> Ticket (JAVA-54) it says "not a problem". So i assumed there is nothing to
> do to support it :-)
>
> kind regards,
> Christian
>
> On Tue, Mar 8, 2016 at 2:56 PM, Nicholas Wilson <
> nicholas.wil...@realvnc.com> wrote:
>
>> Hi Christian,
>>
>>
>> I ran into this problem last month; after some chasing I thought it was
>> possibly a bug in the Datastax driver, which I'm also using. The CQL
>> protocol itself supports dynamic TTLs fine.
>>
>>
>> One workaround that seems to work is to use an unnamed bind marker for
>> the TTL ('?') and then set it using the "[ttl]" reserved name as the bind
>> marker name ('setLong("[ttl]", myTtl)'), which will set the correct field
>> in the bound statement.
>>
>>
>> Best,
>>
>> Nick​
>>
>>
>> --
>> *From:* horschi 
>> *Sent:* 08 March 2016 13:52
>> *To:* user@cassandra.apache.org
>> *Subject:* Dynamic TTLs / limits still not working in 2.2 ?
>>
>> Hi,
>>
>> according to CASSANDRA-4450
>> <https://issues.apache.org/jira/browse/CASSANDRA-4450> it should be
>> fixed, but I still can't use dynamic TTLs or limits in my CQL queries.
>>
>> Query:
>> update mytable set data=:data where ts=:ts and randkey=:randkey using ttl
>> :timetolive
>>
>> Exception:
>> Caused by: com.datastax.driver.core.exceptions.SyntaxError: line 1:138
>> missing EOF at 'using' (...:ts and randkey=:randkey [using] ttl...)
>> at
>> com.datastax.driver.core.Responses$Error.asException(Responses.java:100)
>>
>> I am using Cassandra 2.2 (using Datastax java driver 2.1.9) and I still
>> see this, even though the Jira ticket states fixVersion 2.0.
>>
>> Has anyone used this successfully? Am I doing something wrong or is there
>> still a bug?
>>
>> kind regards,
>> Christian
>>
>>
>> Tickets:
>> https://datastax-oss.atlassian.net/browse/JAVA-54
>> https://issues.apache.org/jira/browse/CASSANDRA-4450
>>
>>
>>
>


Re: Dynamic TTLs / limits still not working in 2.2 ?

2016-03-08 Thread horschi
Ok, I just realized the parameter should not be called ":limit" :-)

Also I upgraded my Java Driver from 2.1.6 to 2.1.9.

Both, TTL and limit, work fine now. Sorry again for the confusion.

cheers,
Christian


On Tue, Mar 8, 2016 at 3:19 PM, horschi  wrote:

> Oh, I just realized I made a mistake with the TTL query:
>
> The TTL has to be specified before the set. Like this:
> update mytable using ttl :timetolive set data=:data where ts=:ts and
> randkey=:randkey
>
> And this of course works nicely. Sorry for the confusion.
>
>
> Nevertheless, I don't think this is the issue with my "select ... limit"
> querys. But I will verify this and also try the workaround.
>
>
>
> On Tue, Mar 8, 2016 at 3:08 PM, horschi  wrote:
>
>> Hi Nick,
>>
>> I will try your workaround. Thanks a lot.
>>
>> I was not expecting the Java-Driver to have a bug, because in the Jira
>> Ticket (JAVA-54) it says "not a problem". So i assumed there is nothing to
>> do to support it :-)
>>
>> kind regards,
>> Christian
>>
>> On Tue, Mar 8, 2016 at 2:56 PM, Nicholas Wilson <
>> nicholas.wil...@realvnc.com> wrote:
>>
>>> Hi Christian,
>>>
>>>
>>> I ran into this problem last month; after some chasing I thought it was
>>> possibly a bug in the Datastax driver, which I'm also using. The CQL
>>> protocol itself supports dynamic TTLs fine.
>>>
>>>
>>> One workaround that seems to work is to use an unnamed bind marker for
>>> the TTL ('?') and then set it using the "[ttl]" reserved name as the bind
>>> marker name ('setLong("[ttl]", myTtl)'), which will set the correct field
>>> in the bound statement.
>>>
>>>
>>> Best,
>>>
>>> Nick​
>>>
>>>
>>> --
>>> *From:* horschi 
>>> *Sent:* 08 March 2016 13:52
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Dynamic TTLs / limits still not working in 2.2 ?
>>>
>>> Hi,
>>>
>>> according to CASSANDRA-4450
>>> <https://issues.apache.org/jira/browse/CASSANDRA-4450> it should be
>>> fixed, but I still can't use dynamic TTLs or limits in my CQL queries.
>>>
>>> Query:
>>> update mytable set data=:data where ts=:ts and randkey=:randkey using
>>> ttl :timetolive
>>>
>>> Exception:
>>> Caused by: com.datastax.driver.core.exceptions.SyntaxError: line 1:138
>>> missing EOF at 'using' (...:ts and randkey=:randkey [using] ttl...)
>>> at
>>> com.datastax.driver.core.Responses$Error.asException(Responses.java:100)
>>>
>>> I am using Cassandra 2.2 (using Datastax java driver 2.1.9) and I still
>>> see this, even though the Jira ticket states fixVersion 2.0.
>>>
>>> Has anyone used this successfully? Am I doing something wrong or is
>>> there still a bug?
>>>
>>> kind regards,
>>> Christian
>>>
>>>
>>> Tickets:
>>> https://datastax-oss.atlassian.net/browse/JAVA-54
>>> https://issues.apache.org/jira/browse/CASSANDRA-4450
>>>
>>>
>>>
>>
>


Re: All subsequent CAS requests time out after heavy use of new CAS feature

2016-04-15 Thread horschi
Hi Jan,

were you able to resolve your Problem?

We are trying the same and also see a lot of WriteTimeouts:
WriteTimeoutException: Cassandra timeout during write query at consistency
SERIAL (2 replica were required but only 1 acknowledged the write)

How many clients were competing for a lock in your case? In our case its
only two :-(

cheers,
Christian


On Tue, Sep 24, 2013 at 12:18 AM, Robert Coli  wrote:

> On Mon, Sep 16, 2013 at 9:09 AM, Jan Algermissen <
> jan.algermis...@nordsc.com> wrote:
>
>> I am experimenting with C* 2.0 ( and today's java-driver 2.0 snapshot)
>> for implementing distributed locks.
>>
>
> [ and I'm experiencing the problem described in the subject ... ]
>
>
>> Any idea how to approach this problem?
>>
>
> 1) Upgrade to 2.0.1 release.
> 2) Try to reproduce symptoms.
> 3) If able to, file a JIRA at
> https://issues.apache.org/jira/secure/Dashboard.jspa including repro steps
> 4) Reply to this thread with the JIRA ticket URL
>
> =Rob
>
>
>


Re: All subsequent CAS requests time out after heavy use of new CAS feature

2016-04-15 Thread horschi
Hi Denise,

in my case its a small blob I am writing (should be around 100 bytes):

 CREATE TABLE "Lock" (
 lockname varchar,
 id varchar,
 value blob,
 PRIMARY KEY (lockname, id)
 ) WITH COMPACT STORAGE
 AND COMPRESSION = { 'sstable_compression' : 'SnappyCompressor',
'chunk_length_kb' : '8' };

You ask because large values are known to cause issues? Anything special
you have in mind?

kind regards,
Christian




On Fri, Apr 15, 2016 at 2:42 PM, Denise Rogers  wrote:

> Also, what type of data were you reading/writing?
>
> Regards,
> Denise
>
> Sent from mi iPad
>
> On Apr 15, 2016, at 8:29 AM, horschi  wrote:
>
> Hi Jan,
>
> were you able to resolve your Problem?
>
> We are trying the same and also see a lot of WriteTimeouts:
> WriteTimeoutException: Cassandra timeout during write query at consistency
> SERIAL (2 replica were required but only 1 acknowledged the write)
>
> How many clients were competing for a lock in your case? In our case its
> only two :-(
>
> cheers,
> Christian
>
>
> On Tue, Sep 24, 2013 at 12:18 AM, Robert Coli 
> wrote:
>
>> On Mon, Sep 16, 2013 at 9:09 AM, Jan Algermissen <
>> jan.algermis...@nordsc.com> wrote:
>>
>>> I am experimenting with C* 2.0 ( and today's java-driver 2.0 snapshot)
>>> for implementing distributed locks.
>>>
>>
>> [ and I'm experiencing the problem described in the subject ... ]
>>
>>
>>> Any idea how to approach this problem?
>>>
>>
>> 1) Upgrade to 2.0.1 release.
>> 2) Try to reproduce symptoms.
>> 3) If able to, file a JIRA at
>> https://issues.apache.org/jira/secure/Dashboard.jspa including repro
>> steps
>> 4) Reply to this thread with the JIRA ticket URL
>>
>> =Rob
>>
>>
>>
>
>


Re: Cassandra Evaluation/ Benchmarking: Throughput not scaling as expected neither latency showing good numbers

2012-07-17 Thread horschi
When they say "linear scalibility" they mean "throughput scales with the
amount of machines in your cluster".

Try adding more machines to your cluster and measure the thoughput. I'm
pretty sure you'll see linear scalibility.

regards,
Christian


On Tue, Jul 17, 2012 at 6:13 AM, Code Box  wrote:

> I am doing Cassandra Benchmarking using YCSB for evaluating the best
> performance for my application which will be both read and write intensive.
> I have set up a three cluster environment on EC2 and i am using YCSB in the
> same availability region as a client. I have tried various combinations of
> tuning cassandra parameters like FSync ( Setting to batch and periodic ),
> Increasing the number of rpc_threads, increasing number of concurrent reads
> and concurrent writes, write consistency one and Quorum i am not getting
> very great results and also i do not see a linear graph in terms of
> scalability that is if i increase the number of clients i do not see an
> increase in the throughput.
>
> Here are some sample numbers that i got :-
>
> *Test 1:-  Write Consistency set to Quorum Write Proportion = 100%. FSync
> = Batch and Window = 0ms*
>
> ThreadsThroughput ( write per sec ) Avg Latency (ms)TP95(ms) TP99(ms)
> Min(ms)Max(ms)
>
>
>  102149 3.1984 51.499291   1004070 23.82870 2.22602004151 45.96571301.7
> 1242 300419764.68 1154222.09 216
>
>
> If you look at the numbers the number of threads do not increase the
> throughput. Also the latency values are not that great. I am using fsync
> set to batch and with 0 ms window.
>
> *Test 2:- ** Write Consistency set to Quorum Write Proportion = 100%.
> FSync = Periodic and Window = 1000 ms*
> *
> *
> 1803 1.23712 1.012312.9Q 10015944 5.343925 1.21579.1Q 200196309.047 1970
> 1.17 1851Q
> Are these numbers expected numbers or does Cassandra perform better ? Am i
> missing something ?
>


Re: are counters stable enough for production?

2012-09-18 Thread horschi
"The repair of taking the highest value of two inconsistent might cause
getting higher values?"


If a counter counts backwards (therefore has negative values), then repair
would still choose the larger value? Or does cassandra take the highter
absolute value?  This would result to an undercounting in case of an error
instead of an overcount.

To go further, would it maybe be an idea to count everything twice? One as
postive value and once as negative value. When reading the counters, the
application could just compare the negative and positive counter to get an
error margin.

Has anybody tried something like this? I assume most people would rather
have an under- than an overcount.

cheers,
Christian


Re: Cassandra vs Couchbase benchmarks

2012-10-01 Thread horschi
Hi Andy,

things I find odd:

- Replicacount=1 for mongo and couchdb. How is that a realistic benchmark?
I always want at least 2 replicas for my data. Maybe thats just me.
- On the Mongo Config slide they said they disabled journaling. Why do you
disable all safety mechanisms that you would want in a production
environment? Maybe they should have added /dev/null to their benchmark ;-)
- I dont see the replicacount for Cassandra in the slides. Also CL is not
specified. Imho the important stuff is missing in the cassandra
configuration.
- In the goals section it said "more data than RAM". But they only have
12GB data per node, with 15GB of RAM per node!

I am very interested in a recent cassandra-benchmark, but I find this
benchmark very disappointing.

cheers,
Christian


On Mon, Oct 1, 2012 at 5:05 PM, Andy Cobley
wrote:

> There are some interesting results in the benchmarks below:
>
> http://www.slideshare.net/renatko/couchbase-performance-benchmarking
>
> Without starting a flame war etc, I'm interested if these results should
> be considered "Fair and Balanced" or if the methodology is flawed in some
> way ? (for instance is the use of Amazon EC2 sensible for Cassandra
> deployment) ?
>
> Andy
>
>
>
> The University of Dundee is a Scottish Registered Charity, No. SC015096.
>
>
>


Re: repair, compaction, and tombstone rows

2012-11-02 Thread horschi
Hi Sylvain,

might I ask why repair cannot simply ignore anything that is older than
gc-grace? (like Aaron proposed)  I agree that repair should not process any
tombstones or anything. But in my mind it sounds reasonable to make repair
ignore timed-out data. Because the timestamp is created on the client,
there is no reason to repair these, right?

We are using TTLs quite heavily and I was noticing that every repair
increases the load of all nodes by 1-2 GBs, where each node has about
20-30GB of data. I dont know if this increases with the data-volume. The
data is mostly time-series data.
I even noticed an increase when running two repairs directly after each
other. So even when data was just repaired, there is still data being
transferred. I assume this is due some columns timing out within that
timeframe and the entire row being repaired.

regards,
Christian

On Thu, Nov 1, 2012 at 9:43 AM, Sylvain Lebresne wrote:

> > Is this a feature or a bug?
>
> Neither really. Repair doesn't do any gcable tombstone collection and
> it would be really hard to change that (besides, it's not his job). So
> if you when you run repair there is sstable with tombstone that could
> be collected but are not yet, then yes, they will be streamed. Now the
> theory is that compaction will run often enough that gcable tombstone
> will be collected in a reasonably timely fashion and so you will never
> have lots of such tombstones in general (making the fact that repair
> stream them largely irrelevant). That being said, in practice, I don't
> doubt that there is a few scenario like your own where this still can
> lead to doing too much useless work.
>
> I believe the main problem is that size tiered compaction has a
> tendency to not compact the largest sstables very often. Meaning that
> you could have large sstable with mostly gcable tombstone sitting
> around. In the upcoming Cassandra 1.2,
> https://issues.apache.org/jira/browse/CASSANDRA-3442 will fix that.
> Until then, if you are no afraid of a little bit of scripting, one
> option could be before running a repair to run a small script that
> would check the creation time of your sstable. If an sstable is old
> enough (for some value of that that depends on what is the TTL you use
> on all your columns), you may want to force a compaction (using the
> JMX call forceUserDefinedCompaction()) of that sstable. The goal being
> to get read of a maximum of outdated tombstones before running the
> repair (you could also alternatively run a major compaction prior to
> the repair, but major compactions have a lot of nasty effect so I
> wouldn't recommend that a priori).
>
> --
> Sylvain
>


Re: repair, compaction, and tombstone rows

2012-11-02 Thread horschi
> IIRC, tombstone timestamps are written by the server, at compaction
> time. Therefore if you have RF=X, you have X different timestamps
> relative to GCGraceSeconds. I believe there was another thread about
> two weeks ago in which Sylvain detailed the problems with what you are
> proposing, when someone else asked approximately the same question.
>
Oh yes, I forgot about the thread. I assume you are talking about:
http://grokbase.com/t/cassandra/user/12ab6pbs5n/unnecessary-tombstones-transmission-during-repair-process

I think these are multiple issues that correlate with each other:

1) Repair uses the local timestamp of DeletedColumns for Merkle tree
calculation. This is what the other thread was about.
Alexey claims that this was fixed by some other commit:
https://issues.apache.org/jira/secure/attachment/12544204/CASSANDRA-4561-CS.patch
But honestly, I dont see how this solves it. I understand how Alexeys patch
a few messages before would solve it (by overriding the updateDigest method
in DeletedColumn)

2) ExpiringColumns should not be used for merkle tree calculation if they
are timed out.
I checked LazilyCompactedRow and saw that it does not exclude any timed-out
columns. It loops over all columns and calls updateDigest on them. Without
any condition. Imho ExpiringColumn.updateDigest() should check for its own
isMarkedForDelete() first before doing any digest-changes (We cannot simply
call isMarkedDelete from LazilyCompactionRow because we dont want this for
DeletedColumns).

3) Cassandra should not create tombstones for expiring columns.
I am not a 100% sure, but it looks to me like cassandra creates tombstones
for expired ExpiringColumns. This makes me wonder if we could delete
expired columns directly. The digest for a ExpiringColumn and DeletedColumn
can never match, due to the different implementations. So there will be
always a repair if compactions are not synchronous on nodes.
Imho it should be valid to delete ExpiringColumns directly, because the TTL
is given by the client and should pass on all nodes at the same time.

All together should reduce over-repair.


Merkle trees are an optimization, what they trade for this
> optimization is over-repair.
>
> (FWIW, I agree that, if possible, this particular case of over-repair
> would be nice to eliminate.)
>
Of course, rather over-repair than corrupt something.


Re: repair, compaction, and tombstone rows

2012-11-03 Thread horschi
Sure, created CASSANDRA-4905. I understand that these tombstones will be
still streamed though. Thats fine with me.

Do you mind if I ask where you stand on making...
- ... ExpiringColumn not create any tombstones? Imo this could be safely
done if the columns TTL is >= gcgrace. That way it is ensured that repair
ran and any previous un-TTLed columns were overwritten.
- ... ExpiringColumn not add local timestamp to digest?

Cheers,
Christian


On Sat, Nov 3, 2012 at 8:37 PM, Sylvain Lebresne wrote:

> On Fri, Nov 2, 2012 at 10:46 AM, horschi  wrote:
> > might I ask why repair cannot simply ignore anything that is older than
> > gc-grace? (like Aaron proposed)
>
> Well, actually the merkle tree computation could probably ignore
> gcable tombstones without much problem, which might not be such a bad
> idea and would probably solve much of your problem. However, when we
> stream the ranges that needs to be repaired, we stream sub-part of the
> sstables without deserializing them, so we can't exclude the gcable
> tombstones in that phase (that is, it's a technical reason, but
> deserializing data would be inefficient). Meaning that if we won't
> guarantee that you won't ever stream gcable tombstones.
>
> But excluding gcable tombstones from the merkle-tree computation is a
> good idea. Would you mind opening a JIRA ticket?
>
> --
> Sylvain
>


Re: repair, compaction, and tombstone rows

2012-11-05 Thread horschi
> - ... ExpiringColumn not create any tombstones? Imo this could be safely
> > done if the columns TTL is >= gcgrace.
>
> Yes, if the TTL >= gcgrace this would be safe and I'm pretty sure we
> use to have a ticket for that (can't find it back with a quick search
> but JIRA search suck and I didn't bother long). But basically we
> decided to not do it for now for 2 reasons:...
>
The only ticket I found that was anything similar is CASSANDRA-4565. I have
my doubts that you meant that one :-)

I dont know what your approach was back then, but maybe it could be solved
quite easily: When creating tombstones for ExpiringColumns, we could use
the ExpiringColumn.timestamp to set the DeletedColumn.localDeletionTime .
So instead of using the deletiontime of the ExpiringColumn, we use the
creationtime.

In the ExpiringColumn class this would like this:

public static Column create(ByteBuffer name, ByteBuffer value, long
timestamp, int timeToLive, int localExpirationTime, int expireBefore,
IColumnSerializer.Flag flag)
{
if (localExpirationTime >= expireBefore || flag ==
IColumnSerializer.Flag.PRESERVE_SIZE)
return new ExpiringColumn(name, value, timestamp, timeToLive,
localExpirationTime);
// the column is now expired, we can safely return a simple tombstone
return new DeletedColumn(name, *timestamp/1000*, timestamp); // uses
creation timestamp for ExpiringColumn
// return new DeletedColumn(name, localExpirationTime, timestamp); //
old code
}

Imo this makes tombstones of DeletedColumns live only as long as they need
to be:
In case you specify ExpireColumn.TTL > 10days, then the created
DeletedColumn would have a timestamp thats >10days in the past, which makes
it obsolete for gc right away. With ttl=5days the tombstone stays for 5
days, enough for either the ExpiringColumn or the Tombstone to be repaired.





> > - ... ExpiringColumn not add local timestamp to digest?
>
> As I said in a previous thread, I don't see what the problem is here.
> The timestamp is not local to the node, it is assigned once and for
> all by the coordinator at insert time. I can agree that it's not
> really useful per se to the digest, but I don't think it matters in
> any case.
>
Oh sorry, you're right, I mixed something up there. Its DeletedColumn that
has the localtimestamp (as value). It takes a localDeletionTime (which is
supplied by RowMutation.delete) and uses that a value for the
DeletedColumn. This value is used by Column to update the digest.



Sorry for not letting this go, but I think there are some low hanging
fruits here.

cheers,
Christian


Re: repair, compaction, and tombstone rows

2012-11-05 Thread horschi
That's true, we could just create an already gcable tombstone. It's a bit
> of an abuse of the localDeletionTime but why not. Honestly a good part of
> the reason we haven't done anything yet is because we never really had
> anything for which tombstones of expired columns where a big pain point.
> Again, feel free to open a ticket (but what we should do is retrieve the
> ttl from the localExpirationTime when creating the tombstone, not using the
> creation time (partly because that creation time is a user provided
> timestamp so we can't use it, and because we must still keep tombstones if
> the ttl < gcGrace)).
>

Created CASSANDRA-4917. I changed the example implementation to use
(localExpirationTime-timeToLive) for the tombstone. I agree this is not the
biggest itch to scratch. But it might save a few seeks here and there :-)


Did you also have a look at DeletedColumn? It uses the updateDigest
implementation from its parent class, which applies also the value to the
digest. Unfortunetaly the value is the localDeletionTime, which is being
generated on each node individually, right? (at RowMutation.delete)
The resolution of the time is low, so there is a good chance the timestamps
will match on all nodes, but that should be nothing to rely on.


cheers,
Christian


Re: repair, compaction, and tombstone rows

2012-11-06 Thread horschi
Hi Bryan,

As the OP of this thread,

sorry for stealing for thread btw ;-)



> it is a big itch for my use case.  Repair ends up streaming tens of
> gigabytes of data which has expired TTL and has been compacted away on some
> nodes but not yet on others.  The wasted work is not nice plus it drives up
> the memory usage (for bloom filters, indexes, etc) of all nodes since there
> are many more rows to track than planned.  Disabling the periodic repair
> lowered the per-node load by 100GB which was all dead data in my case.


What is the issue with your setup? Do you use TTLs or do you think its due
to DeletedColumns?  Was your intension to push the idea of removing
localDeletionTime from DeletedColumn.updateDigest ?


cheers,
Christian



>
> On Mon, Nov 5, 2012 at 5:12 PM, horschi  wrote:
>
>>
>>
>> That's true, we could just create an already gcable tombstone. It's a bit
>>> of an abuse of the localDeletionTime but why not. Honestly a good part of
>>> the reason we haven't done anything yet is because we never really had
>>> anything for which tombstones of expired columns where a big pain point.
>>> Again, feel free to open a ticket (but what we should do is retrieve the
>>> ttl from the localExpirationTime when creating the tombstone, not using the
>>> creation time (partly because that creation time is a user provided
>>> timestamp so we can't use it, and because we must still keep tombstones if
>>> the ttl < gcGrace)).
>>>
>>
>> Created CASSANDRA-4917. I changed the example implementation to use
>> (localExpirationTime-timeToLive) for the tombstone. I agree this is not the
>> biggest itch to scratch. But it might save a few seeks here and there :-)
>>
>>
>> Did you also have a look at DeletedColumn? It uses the updateDigest
>> implementation from its parent class, which applies also the value to the
>> digest. Unfortunetaly the value is the localDeletionTime, which is being
>> generated on each node individually, right? (at RowMutation.delete)
>> The resolution of the time is low, so there is a good chance the
>> timestamps will match on all nodes, but that should be nothing to rely on.
>>
>>
>> cheers,
>> Christian
>>
>>
>>
>>
>>
>


Re: repair, compaction, and tombstone rows

2012-11-06 Thread horschi
I don't know enough about the code level implementation to comment on the
> validity of the fix.  My main issue is that we use a lot of TTL columns and
> in many cases all columns have a TTL that is less than gc_grace.  The
> problem arises when the columns are gc-able and are compacted away on one
> node but not on all replicas, the periodic repair process ends up copying
> all the garbage columns & rows back to all other replicas.  It consumes a
> lot of repair resources and makes rows stick around for much longer than
> they really should which consumes even more cluster resources.
>
You can set gc_grace to 0 if you never manually delete any of them. You
only need tombstones if you do manual deletes.

Otherwise the two tickets should improve your situation.


Re: Driver consistency issue

2018-02-27 Thread horschi
Hi Abhishek & everyone else,

might it be related to https://issues.apache.org/jira/browse/CASSANDRA-7868
?

regards,
Christian



On Tue, Feb 27, 2018 at 12:46 PM, Abhishek Kumar Maheshwari <
abhishek.maheshw...@timesinternet.in> wrote:

> Hi,
>
> Not always. Randomly i am getting this exception. (one observation, mostly
> i got this exception when i add new node in cluster.)
>
> On Tue, Feb 27, 2018 at 4:29 PM, Nicolas Guyomar <
> nicolas.guyo...@gmail.com> wrote:
>
>> Hi,
>>
>> Adding the java-driver ML for further question, because this does look
>> like a bug
>>
>> Are you able to reproduce it a clean environnement using the same C*
>> version and driver version ?
>>
>>
>> On 27 February 2018 at 10:05, Abhishek Kumar Maheshwari <
>> abhishek.maheshw...@timesinternet.in> wrote:
>>
>>> Hi Alex,
>>>
>>> i have only One DC (with name DC1) and have only one keyspace. So i dont
>>> think so both of the scenario is possible. (yes in my case QUORUM is  
>>> equivalent
>>> of ALL)
>>>
>>> cqlsh> SELECT * FROM system_schema.keyspaces  where
>>> keyspace_name='adlog' ;
>>>
>>>  keyspace_name | durable_writes | replication
>>> ---++---
>>> 
>>>  adlog |   True | {'DC1': '2', 'class':
>>> 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
>>>
>>>
>>> On Tue, Feb 27, 2018 at 2:27 PM, Oleksandr Shulgin <
>>> oleksandr.shul...@zalando.de> wrote:
>>>
 On Tue, Feb 27, 2018 at 9:45 AM, Abhishek Kumar Maheshwari <
 abhishek.maheshw...@timesinternet.in> wrote:

>
> i have a KeySpace in Cassandra (cassandra version 3.0.9- total 12
> Servers )With below definition:
>
> {'DC1': '2', 'class': 'org.apache.cassandra.locator.
> NetworkTopologyStrategy'}
>
> Some time i am getting below exception
>
> [snip]

> Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException:
> Cassandra timeout during write query at consistency QUORUM (3 replica were
> required but only 2 acknowledged the write)
> at com.datastax.driver.core.excep
> tions.WriteTimeoutException.copy(WriteTimeoutException.java:100)
> at com.datastax.driver.core.Respo
> nses$Error.asException(Responses.java:134)
> at com.datastax.driver.core.Reque
> stHandler$SpeculativeExecution.onSet(RequestHandler.java:525)
> at com.datastax.driver.core.Conne
> ction$Dispatcher.channelRead0(Connection.java:1077)
>
> why its waiting for acknowledged from 3rd server as replication
> factor is 2?
>

 I see two possibilities:

 1) The data in this keyspace is replicated to another DC, so there is
 also 'DC2': '2', for example, but you didn't show it.  In this case QUORUM
 requires more than 2 nodes.
 2) The write was targeting a table in a different keyspace than you
 think.

 In any case QUORUM (or LOCAL_QUORUM) with RF=2 is equivalent of ALL.
 Not sure why would you use it in the first place.

 For consistency levels involving quorum you want to go with RF=3 in a
 single DC.  For multi DC you should think if you want QUORUM or EACH_QUORUM
 for your writes and figure out the RFs from that.

 Cheers,
 --
 Alex


>>>
>>>
>>> --
>>>
>>> *Thanks & Regards,*
>>> *Abhishek Kumar Maheshwari*
>>> *+91- 805591 <+91%208%2005591> (Mobile)*
>>>
>>> Times Internet Ltd. | A Times of India Group Company
>>>
>>> FC - 6, Sector 16A, Film City,  Noida,  U.P. 201301 | INDIA
>>>
>>> *P** Please do not print this email unless it is absolutely necessary.
>>> Spread environmental awareness.*
>>>
>>
>>
>
>
> --
>
> *Thanks & Regards,*
> *Abhishek Kumar Maheshwari*
> *+91- 805591 <+91%208%2005591> (Mobile)*
>
> Times Internet Ltd. | A Times of India Group Company
>
> FC - 6, Sector 16A, Film City,  Noida,  U.P. 201301 | INDIA
>
> *P** Please do not print this email unless it is absolutely necessary.
> Spread environmental awareness.*
>


Re: How do you run integration tests for your cassandra code?

2014-10-13 Thread horschi
Hi Kevin,

I run my tests against my locally running Cassandra instance. I am not
using any framework, but simply truncate all my tables after/before each
test. With which I am quite happy.

You have to enable the unsafeSystem property, disable durable writes on the
CFs and disable auto-snapshot in the yaml for it to be fast.

kind regards,
Christian


On Mon, Oct 13, 2014 at 9:50 PM, Kevin Burton  wrote:

> Curious to see if any of you have an elegant solution here.
>
> Right now I”m using cassandra unit;
>
> https://github.com/jsevellec/cassandra-unit
>
> for my integration tests.
>
> The biggest problem is that it doesn’t support shutdown.  so I can’t stop
> or cleanup after cassandra between tests.
>
> I have other Java daemons that have the same problem.  For example,
> ActiveMQ doesn’t clean up after itself.
>
> I was *thinking* of using docker or vagrant to startup a daemon in a
> container, then shut it down between tests.
>
> But this seems difficult to setup and configure … as well as being not
> amazingly portable.
>
> Another solution is to use a test suite, and a setUp/tearDown that drops
> all tables created by a test.   This way you’re still on the same cassandra
> instance, but the tables are removed for each pass.
>
> Anyone have an elegant solution to this?
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
> 
>
>


Query returning tombstones

2015-04-29 Thread horschi
Hi,

did anybody ever raise a feature request for selecting tombstones in
CQL/thrift?

It would be nice if I could use CQLSH to see where my tombstones are coming
from. This would much more convenient than using sstable2json.

Maybe someone can point me to an existing jira-ticket, but I also
appreciate any other feedback :-)

regards,
Christian


Re: Query returning tombstones

2015-05-03 Thread horschi
Hi Jens,

thanks a lot for the link! Your ticket seems very similar to my request.

kind regards,
Christian


On Sat, May 2, 2015 at 2:25 PM, Jens Rantil  wrote:

> Hi Christian,
>
> I just know Sylvain explicitly stated he wasn't a fan of exposing
> tombstones here:
> https://issues.apache.org/jira/browse/CASSANDRA-8574?focusedCommentId=14292063&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14292063
>
> Cheers,
> Jens
>
> On Wed, Apr 29, 2015 at 12:43 PM, horschi  wrote:
>
>> Hi,
>>
>> did anybody ever raise a feature request for selecting tombstones in
>> CQL/thrift?
>>
>> It would be nice if I could use CQLSH to see where my tombstones are
>> coming from. This would much more convenient than using sstable2json.
>>
>> Maybe someone can point me to an existing jira-ticket, but I also
>> appreciate any other feedback :-)
>>
>> regards,
>> Christian
>>
>>
>
>
> --
> Jens Rantil
> Backend engineer
> Tink AB
>
> Email: jens.ran...@tink.se
> Phone: +46 708 84 18 32
> Web: www.tink.se
>
> Facebook <https://www.facebook.com/#!/tink.se> Linkedin
> <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
>  Twitter <https://twitter.com/tink>
>


Re: How to minimize Cassandra memory usage for test environment?

2015-06-09 Thread horschi
Hi Eax,

are you truncating/dropping tables between tests? Are your issues perhaps
related to that?

If you are, you should disable autoSnapshots and enable -DunsafeSystem=true
to make it run smoother.

kind regards,
Christian

On Tue, Jun 9, 2015 at 11:25 AM, Jason Wee  wrote:

> for a start, maybe you can see the setting use by raspberry pi project,
> for instance
> http://ac31004.blogspot.com/2012/05/apache-cassandra-on-raspberry-pi.html
>
> you can look at these two files, to tune down the settings for test
> environment.
> cassandra-env.sh
> cassandra.yaml
>
> hth
>
> jason
>
> On Tue, Jun 9, 2015 at 3:59 PM, Eax Melanhovich  wrote:
>
>> Hello.
>>
>> We are running integration tests, using real Cassandra (not a mock)
>> under Vagrant. MAX_HEAP_SIZE is set to 500M. As I discovered, lower
>> value causes 'out of memory' after some time.
>>
>> Could memory usage be decreased somehow? Developers don't usually have
>> a lot of free RAM and performance obviously is not an issue in this
>> case.
>>
>> --
>> Best regards,
>> Eax Melanhovich
>> http://eax.me/
>>
>
>


Murmur Long.MIN_VALUE token allowed?

2013-12-04 Thread horschi
Hi,

I just realized that I can move a node to Long.MIN_VALUE:

127.0.0.1  rack1   Up Normal  1011.58 KB  100.00%
-9223372036854775808

Is that really a valid token for Murmur3Partitioner ?

I thought that Long.MIN_VALUE (like -1 for Random) is not a regular token.
Shouldn't be only used for token-range-scans?

kind regards,
Christian


Re: Murmur Long.MIN_VALUE token allowed?

2013-12-10 Thread horschi
Hi Aaron,

thanks for your response. But that is exactly what scares me:
RandomPartitioner.MIN is -1, which is not a valid token :-)

And my feeling gets worse when I look at Murmur3Partitioner.normalize().
This one explicitly excludes Long.MIN_VALUE by changing it to
Long.MAX_VALUE.

I think I'll just avoid it in the future. Better safe than sorry...

cheers,
Christian


On Tue, Dec 10, 2013 at 8:24 AM, Aaron Morton wrote:

> AFAIK any value that is a valid output from murmor3 is a valid token.
>
> The Murmur3Partitioner set’s min and max to long min and max…
>
> public static final LongToken MINIMUM = new LongToken(Long.MIN_VALUE);
> public static final long MAXIMUM = Long.MAX_VALUE;
>
> Cheers
>
> -
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 5/12/2013, at 12:38 am, horschi  wrote:
>
> Hi,
>
> I just realized that I can move a node to Long.MIN_VALUE:
>
> 127.0.0.1  rack1   Up Normal  1011.58 KB  100.00%
> -9223372036854775808
>
> Is that really a valid token for Murmur3Partitioner ?
>
> I thought that Long.MIN_VALUE (like -1 for Random) is not a regular token.
> Shouldn't be only used for token-range-scans?
>
> kind regards,
> Christian
>
>
>


Offline migration: Random->Murmur

2013-12-23 Thread horschi
Hi list,

has anyone ever tried to migrate a cluster from Random to Murmur?

We would like to do so, to have a more standardized setup. I wrote a small
(yet untested) utility, which should be able to read SSTable files from
disk and write them into a cassandra cluster using Hector. This migration
would be offline of course and would only work for smaller clusters.

Any thoughts on the topic?

kind regards,
Christian

PS: The reason for doing so are not "performance". It is to simplify
operational stuff for the years to come. :-)
import java.io.File;
import java.io.FilenameFilter;
import java.nio.ByteBuffer;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Set;

import me.prettyprint.cassandra.serializers.ByteBufferSerializer;
import me.prettyprint.cassandra.service.CassandraHostConfigurator;
import me.prettyprint.cassandra.service.ThriftCfDef;
import me.prettyprint.cassandra.service.ThriftKsDef;
import me.prettyprint.hector.api.Cluster;
import me.prettyprint.hector.api.Keyspace;
import me.prettyprint.hector.api.beans.HColumn;
import me.prettyprint.hector.api.beans.HCounterColumn;
import me.prettyprint.hector.api.ddl.ColumnFamilyDefinition;
import me.prettyprint.hector.api.ddl.KeyspaceDefinition;
import me.prettyprint.hector.api.factory.HFactory;
import me.prettyprint.hector.api.mutation.Mutator;

import org.apache.cassandra.config.CFMetaData;
import org.apache.cassandra.config.Config;
import org.apache.cassandra.config.DatabaseDescriptor;
import org.apache.cassandra.config.KSMetaData;
import org.apache.cassandra.config.Schema;
import org.apache.cassandra.db.Column;
import org.apache.cassandra.db.CounterColumn;
import org.apache.cassandra.db.DeletedColumn;
import org.apache.cassandra.db.ExpiringColumn;
import org.apache.cassandra.db.OnDiskAtom;
import org.apache.cassandra.db.columniterator.OnDiskAtomIterator;
import org.apache.cassandra.db.filter.QueryFilter;
import org.apache.cassandra.db.filter.QueryPath;
import org.apache.cassandra.db.filter.SliceQueryFilter;
import org.apache.cassandra.dht.IPartitioner;
import org.apache.cassandra.io.sstable.Descriptor;
import org.apache.cassandra.io.sstable.SSTableMetadata;
import org.apache.cassandra.io.sstable.SSTableReader;
import org.apache.cassandra.io.sstable.SSTableScanner;

/**
 * 
 * @author Christian Spriegel
 *
 */
public class BulkMigrator
{
private static Cluster cluster;
private static String  keyspaceName = null;
private static Keyspacekeyspace;

public static void main(String[] args) throws Exception
{
//Config.setClientMode(true);
Config.setLoadYaml(false);
if(args.length != 1)
{
System.out.println("java -jar BulkMigrator.jar ");
return;
}

String path = args[0];
File dir = new File(path);
if(!dir.exists() ||  !dir.isDirectory() )
{
System.out.println("Path is not a directory: "+path);
return;
}

cluster = HFactory.getOrCreateCluster("TestCluster", new CassandraHostConfigurator("localhost:9160"));
FilenameFilter filter = new FilenameFilter()
{
@Override
public boolean accept(File dir, String name)
{
if(!name.endsWith("-Data.db"))
return false; // ignore non data files
if(name.substring(0,name.length()-8) .contains("."))
return false; // ignore secondary indexes
return true;
}
};
for(File f : dir.listFiles(filter))
{
System.out.println ("Found file "+f +" ... ");
Descriptor desc = Descriptor.fromFilename(dir, f.getName()).left;
initCF(desc.ksname);
System.out.println("Loaded descriptor "+desc+" ...");
SSTableMetadata sstableMetadata = SSTableMetadata.serializer.deserialize(desc).left;
DatabaseDescriptor.setPartitioner((IPartitioner)(Class.forName(sstableMetadata.partitioner).newInstance()));
System.out.println("Using partitioner "+DatabaseDescriptor.getPartitionerName());
SSTableReader ssreader = SSTableReader.open(desc);
int numkeys = (int)(ssreader.estimatedKeys());
System.out.println("Opened reader "+ssreader+" with "+numkeys+" keys");

SliceQueryFilter atomfilter = new SliceQueryFilter(ByteBuffer.allocate(0), ByteBuffer.allocate(0), false, Integer.MAX_VALUE);
String cfname = desc.cfname;
SSTableScanner rrar = ssreader.getScanner(new QueryFilter(null, new QueryPath(cfname), atomfilter));

Mutator mutator = HFactory.createMutator(keyspace, ByteBufferSerializer.get());
int keyi = 0;
long bufsize = 0L;
while(rrar.hasNext())
{
OnDiskAtomIterator odai = rrar.next();
ByteBuffer rowkey = odai.getKey(

Re: Offline migration: Random->Murmur

2013-12-23 Thread horschi
Interesting you even dare to do a live migration :-)

Do you do all Murmur-writes with the timestamp from the "Random"-data? So
that all migrated data is written with timestamps from the past.



On Mon, Dec 23, 2013 at 3:59 PM, Rahul Menon  wrote:

> Christian,
>
> I have been planning to migrate my cluster from random to murmur3 in a
> similar manner. I intend to use pycassa to read and then write to the newer
> cluster. My only concern would be ensuring the consistency of already
> migrated data as the cluster ( with random ) would be constantly serving
> the production traffic. I was able to do this on a non prod cluster, but
> production is a different game.
>
> I would also like to hear more about this, especially if someone was able
> to successfully do this.
>
> Thanks
> Rahul
>
>
> On Mon, Dec 23, 2013 at 6:45 PM, horschi  wrote:
>
>> Hi list,
>>
>> has anyone ever tried to migrate a cluster from Random to Murmur?
>>
>> We would like to do so, to have a more standardized setup. I wrote a
>> small (yet untested) utility, which should be able to read SSTable files
>> from disk and write them into a cassandra cluster using Hector. This
>> migration would be offline of course and would only work for smaller
>> clusters.
>>
>> Any thoughts on the topic?
>>
>> kind regards,
>> Christian
>>
>> PS: The reason for doing so are not "performance". It is to simplify
>> operational stuff for the years to come. :-)
>>
>
>


Re: Cassandra unit testing becoming nearly impossible: suggesting alternative.

2013-12-25 Thread horschi
Hi Ed,

my opinion on unit testing with C* is: Use the real database, not any
embedded crap :-)

All you need are fast truncates, by which I mean:
JVM_OPTS="$JVM_OPTS -Dcassandra.unsafesystem=true"
and
auto_snapshot: false

This setup works really nice for me (C* 1.1 and 1.2, have not tested 2.0
yet).

Imho this setup is better for multiple reasons:
- No extra classpath issues
- Faster: Running JUnits and C* in one JVM would require a really large
heap (for me at least).
- Faster: No Cassandra startup everytime I run my tests.

The only downside is that developers must change the properties in their
configs.

cheers,
Christian



On Tue, Dec 24, 2013 at 9:31 PM, Edward Capriolo wrote:

> I am not sure there how many people have been around developing Cassandra
> for as long as I have, but the state of all the client libraries and the
> cassandra server is WORD_I_DONT_WANT_TO_SAY.
>
> Here is an example of something I am seeing:
> ERROR 14:59:45,845 Exception in thread Thread[Thrift:5,5,main]
> java.lang.AbstractMethodError:
> org.apache.thrift.ProcessFunction.isOneway()Z
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:51)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:194)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> DEBUG 14:59:51,654 retryPolicy for schema_triggers is 0.99
>
> In short: If you are new to cassandra and only using the newest client I
> am sure everything is peachy for you.
>
> For people that have been using Cassandra for a while it is harder to
> "jump ship" when something better comes along. You need sometimes to
> support both hector and astyanax, it happens.
>
> For a while I have been using hector. Even not to use hector as an API,
> but the one nice thing I got from hector was a simple EmbeddedServer that
> would clean up after itself. Hector seems badly broken at the moment. I
> have no idea how the current versions track with anything out there in the
> cassandra world.
>
> For a while I played with https://github.com/Netflix/astyanax, which has
> it's own version and schemes and dependent libraries. (astyanax has some
> packaging error that forces me into maven3)
>
> Enter cassandra 2.0 which forces you into java 0.7. Besides that it has
> it's own kit of things it seems to want.
>
> I am guessing since hectors embedded server does not work, and I should go
> to https://github.com/jsevellec/cassandra-unit not sure...really...how
> anyone does this anymore. I am sure I could dive into the source code and
> figure this out, but I would just rather have a stable piece of code that
> brings up the embedded server that "just works" and "continues working".
>
> I can not seem to get this working right either. (since it includes hector
> I see from the pom)
>
> Between thrift, cassandra,client x, it is almost impossible to build a
> sane classpath, and that is not even counting the fact that people have
> their own classpath issues (with guava mismatches etc).
>
> I think the only sane thing to do is start shipping cassandra-embedded
> like this:
>
> https://github.com/kstyrc/embedded-redis
>
> In other words package embedded-cassandra as a binary. Don't force the
> client/application developer to bring cassandra on the classpath and fight
> with mismatches in thrift/guava etc. That or provide a completely shaded
> cassandra server for embedded testing. As it stands now trying to support a
> setup that uses more than one client or works with multiple versions of
> cassandra is major pita.  (aka library x compiled against 1.2.0 library y
> compiled against 2.0.3)
>
> Does anyone have any thoughts on this, or tried something similar?
>
> Edward
>
>


Re: Possible optimization: avoid creating tombstones for TTLed columns if updates to TTLs are disallowed

2014-01-28 Thread horschi
Hi Donald,

I was reporting the ticket you mentioned, so I kinds feel like I should
answer this :-)


 I presume the point is that GCable tombstones can still do work
> (preventing spurious writing from nodes that were down) but only until the
> data is flushed to disk.
>
I am not sure I understand this correctly. Could you rephrase that sentence?



> If the effective TTL exceeds gc_grace_seconds then the tombstone will be
> deleted anyway.
>
Its not even written (since  CASSANDRA-4917). There is no delete on the
tombstone in that case.



>  It occurred to me that if you never update the TTL of a column, then
> there should be no need for tombstones at all:  any replicas will have the
> same TTL.  So there'd be no risk of missed deletes.  You wouldn't even need
> GCable tombstones
>
I think so too. There should be no need for a tombstone at all if the
following condition are given:
- column was not deleted manually, but timed out by itself
- column was not updated in the last gc_grace days

If I am not mistaken, the second point would even be neccessary for
CASSANDRA-4917 to be able to handle changing TTLs correctly: I think the
current implementation might break, if a column gets updated with a smaller
TTL, or to be more precise when  (old.creationdate + old.ttl) <
(new.creationdate + new.ttl) && new.ttl < gc_grace


Imho, for any further tombstone-optimization to work, compaction would have
to be smarter:
 I think it should be able to track max(old.creationdate + old.ttl ,
new.creationdate + new.ttl) when merging columns. I have no idea if that is
possible though.


>
> So, if - and it's a big if - a table disallowed updates to TTL, then you
> could really optimize deletion of TTLed columns: you could do away with
> tombstones entirely.   If a table allows updates to TTL then it's possible
> a different node will have the row without the TTL and the tombstone would
> be needed.
>
I am not sure I understand this. My "thrift" understanding of cassandra is
that you cannot update the TTL, you can just update an entire column. Also
each column has its own TTL. There is no TTL on the row.


cheers,
Christian


Re: Expired column showing up

2014-02-14 Thread horschi
Hi Mahesh,

is it possible you are creating columns with a long TTL, then update these
columns with a smaller TTL?

kind regards,
Christian


On Fri, Feb 14, 2014 at 3:45 PM, mahesh rajamani
wrote:

> Hi,
>
> I am using Cassandra 2.0.2 version. On a wide row (approx. 1 columns),
> I expire few column by setting TTL as 1 second. At times these columns show
> up during slice query.
>
> When I have this issue, running count and get commands for that row using
> Cassandra cli it gives different column counts.
>
> But once I run flush and compact, the issue goes off and expired columns
> don't show up.
>
> Can someone provide some help on this issue.
>
> --
> Regards,
> Mahesh Rajamani
>


Re: Expired column showing up

2014-02-17 Thread horschi
Hi Mahesh,

the problem is that every column is only tombstoned for as long as the
original column was valid.

So if the last update was only valid for 1 sec, then the tombstone will
also be valid for 1 second! If the previous was valid for a longer time,
then this old value might reappear.

Maybe you can explain why you are doing this?

kind regards,
Christian



On Mon, Feb 17, 2014 at 6:18 PM, mahesh rajamani
wrote:

> Christain,
>
> Yes. Is it a problem?  Can you explain what happens in this scenario?
>
> Thanks
> Mahesh
>
>
> On Fri, Feb 14, 2014 at 3:07 PM, horschi  wrote:
>
>> Hi Mahesh,
>>
>> is it possible you are creating columns with a long TTL, then update
>> these columns with a smaller TTL?
>>
>> kind regards,
>> Christian
>>
>>
>> On Fri, Feb 14, 2014 at 3:45 PM, mahesh rajamani <
>> rajamani.mah...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am using Cassandra 2.0.2 version. On a wide row (approx. 1
>>> columns), I expire few column by setting TTL as 1 second. At times these
>>> columns show up during slice query.
>>>
>>> When I have this issue, running count and get commands for that row
>>> using Cassandra cli it gives different column counts.
>>>
>>> But once I run flush and compact, the issue goes off and expired columns
>>> don't show up.
>>>
>>> Can someone provide some help on this issue.
>>>
>>> --
>>> Regards,
>>> Mahesh Rajamani
>>>
>>
>>
>
>
> --
> Regards,
> Mahesh Rajamani
>


Does NetworkTopologyStrategy in Cassandra 2.0 work?

2014-04-22 Thread horschi
Hi,

is it possible that NetworkTopologyStrategy does not work with Cassandra
2.0 any more?

I just updated my Dev Cluster to 2.0.7 and got UnavailableExceptions for
CQL&Thrift queries on my already existing column families, even though all
(two) nodes were up. Changing to SimpleStrategy fixed the issue.

Also I cannot switch switch back to NetworkTopologyStrategy:

[default@unknown] update keyspace MYKS with placement_strategy =
'NetworkTopologyStrategy';
Error constructing replication strategy class

[default@unknown] update keyspace MYKS with placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy';
Error constructing replication strategy class


This does not seem to be something I encountered with 1.2 before. Can
anyone tell me which one is broken here, Cassandra or myself? :-)

cheers,
Christian


Re: Does NetworkTopologyStrategy in Cassandra 2.0 work?

2014-04-22 Thread horschi
Ok, it seems 2.0 now is simply stricter about datacenter names. I simply
had to change the datacenter name to match the name in "nodetool ring":

update keyspace MYKS with placement_strategy = 'NetworkTopologyStrategy'
and strategy_options = {datacenter1 : 2};

So the schema was wrong, but 1.2 did not care about it.

cheers,
Christian


On Tue, Apr 22, 2014 at 1:51 PM, horschi  wrote:

> Hi,
>
> is it possible that NetworkTopologyStrategy does not work with Cassandra
> 2.0 any more?
>
> I just updated my Dev Cluster to 2.0.7 and got UnavailableExceptions for
> CQL&Thrift queries on my already existing column families, even though all
> (two) nodes were up. Changing to SimpleStrategy fixed the issue.
>
> Also I cannot switch switch back to NetworkTopologyStrategy:
>
> [default@unknown] update keyspace MYKS with placement_strategy =
> 'NetworkTopologyStrategy';
> Error constructing replication strategy class
>
> [default@unknown] update keyspace MYKS with placement_strategy =
> 'org.apache.cassandra.locator.NetworkTopologyStrategy';
> Error constructing replication strategy class
>
>
> This does not seem to be something I encountered with 1.2 before. Can
> anyone tell me which one is broken here, Cassandra or myself? :-)
>
> cheers,
> Christian
>


Cassandra 2.0.8 MemoryMeter goes crazy

2014-06-14 Thread horschi
Hi everyone,

this week we upgraded one of our Systems from Cassandra 1.2.16 to 2.0.8.
All 3 nodes were upgraded. SStables are upgraded.

Unfortunetaly we are now experiencing that Cassandra starts to hang every
10 hours or so.

We can see the MemoryMeter being very active, every time it is hanging.
Both in tpstats and in the system.log:

 INFO [MemoryMeter:1] 2014-06-14 19:24:09,488 Memtable.java (line 481)
CFS(Keyspace='MDS', ColumnFamily='ResponsePortal') liveRatio is 64.0
(just-counted was 64.0).  calculation took 0ms for 0 cells

This line is logged hundreds of times per second (!) when Cassandra is
down. CPU is a 100% busy.

Interestingly this is only logged for this particular Columnfamily. This CF
is used as a queue, which only contains a few entries (datafiles are about
4kb, only ~100 keys, usually 1-2 active, 98-99 tombstones).

Table: ResponsePortal
SSTable count: 1
Space used (live), bytes: 4863
Space used (total), bytes: 4863
SSTable Compression Ratio: 0.9545454545454546
Number of keys (estimate): 128
Memtable cell count: 0
Memtable data size, bytes: 0
Memtable switch count: 1
Local read count: 0
Local read latency: 0.000 ms
Local write count: 5
Local write latency: 0.000 ms
Pending tasks: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.0
Bloom filter space used, bytes: 176
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 50
Compacted partition mean bytes: 50
Average live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0


Table: ResponsePortal
SSTable count: 1
Space used (live), bytes: 4765
Space used (total), bytes: 5777
SSTable Compression Ratio: 0.75
Number of keys (estimate): 128
Memtable cell count: 0
Memtable data size, bytes: 0
Memtable switch count: 12
Local read count: 0
Local read latency: 0.000 ms
Local write count: 1096
Local write latency: 0.000 ms
Pending tasks: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.0
Bloom filter space used, bytes: 16
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 50
Compacted partition mean bytes: 50
Average live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0


Has anyone ever seen this or has an idea what could be wrong? It seems that
2.0 can handle this column family not as good as 1.2 could.

Any hints on what could be wrong are greatly appreciated :-)

Cheers,
Christian


Re: Cassandra 2.0.8 MemoryMeter goes crazy

2014-06-16 Thread horschi
Hi again,

before people start replying here: I just reported a Jira ticket:
https://issues.apache.org/jira/browse/CASSANDRA-7401

I think Memtable.maybeUpdateLiveRatio() needs some love.

kind regards,
Christian



On Sat, Jun 14, 2014 at 10:02 PM, horschi  wrote:

> Hi everyone,
>
> this week we upgraded one of our Systems from Cassandra 1.2.16 to 2.0.8.
> All 3 nodes were upgraded. SStables are upgraded.
>
> Unfortunetaly we are now experiencing that Cassandra starts to hang every
> 10 hours or so.
>
> We can see the MemoryMeter being very active, every time it is hanging.
> Both in tpstats and in the system.log:
>
>  INFO [MemoryMeter:1] 2014-06-14 19:24:09,488 Memtable.java (line 481)
> CFS(Keyspace='MDS', ColumnFamily='ResponsePortal') liveRatio is 64.0
> (just-counted was 64.0).  calculation took 0ms for 0 cells
>
> This line is logged hundreds of times per second (!) when Cassandra is
> down. CPU is a 100% busy.
>
> Interestingly this is only logged for this particular Columnfamily. This
> CF is used as a queue, which only contains a few entries (datafiles are
> about 4kb, only ~100 keys, usually 1-2 active, 98-99 tombstones).
>
> Table: ResponsePortal
> SSTable count: 1
> Space used (live), bytes: 4863
> Space used (total), bytes: 4863
> SSTable Compression Ratio: 0.9545454545454546
> Number of keys (estimate): 128
> Memtable cell count: 0
> Memtable data size, bytes: 0
> Memtable switch count: 1
> Local read count: 0
> Local read latency: 0.000 ms
> Local write count: 5
> Local write latency: 0.000 ms
> Pending tasks: 0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.0
> Bloom filter space used, bytes: 176
> Compacted partition minimum bytes: 43
> Compacted partition maximum bytes: 50
> Compacted partition mean bytes: 50
> Average live cells per slice (last five minutes): 0.0
> Average tombstones per slice (last five minutes): 0.0
>
>
> Table: ResponsePortal
> SSTable count: 1
> Space used (live), bytes: 4765
> Space used (total), bytes: 5777
> SSTable Compression Ratio: 0.75
> Number of keys (estimate): 128
> Memtable cell count: 0
> Memtable data size, bytes: 0
> Memtable switch count: 12
> Local read count: 0
> Local read latency: 0.000 ms
> Local write count: 1096
> Local write latency: 0.000 ms
> Pending tasks: 0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.0
> Bloom filter space used, bytes: 16
> Compacted partition minimum bytes: 43
> Compacted partition maximum bytes: 50
> Compacted partition mean bytes: 50
> Average live cells per slice (last five minutes): 0.0
> Average tombstones per slice (last five minutes): 0.0
>
>
> Has anyone ever seen this or has an idea what could be wrong? It seems
> that 2.0 can handle this column family not as good as 1.2 could.
>
> Any hints on what could be wrong are greatly appreciated :-)
>
> Cheers,
> Christian
>


Re: Cassandra 2.0.8 MemoryMeter goes crazy

2014-06-16 Thread horschi
Hi Robert,

sorry, I am using our own internal terminology :-)

The entire cluster was upgraded. All 3 nodes of that cluster are on 2.0.8
now.

About the issue:
To me it looks like there is something wrong in the Memtable class. Some
very special edge case on CFs that are updated rarely. I cant say if it is
new to 2.0 or if it already existed in 1.2.

About running mixed versions:
I thought running mixed versions is ok. Running repair with mixed versions
is not though. Right?

kind regards,
Christian



On Mon, Jun 16, 2014 at 7:50 PM, Robert Coli  wrote:

> On Sat, Jun 14, 2014 at 1:02 PM, horschi  wrote:
>
>> this week we upgraded one of our Systems from Cassandra 1.2.16 to 2.0.8.
>> All 3 nodes were upgraded. SStables are upgraded.
>>
>
> One of your *clusters* or one of your *systems*?
>
> Running with split major versions is not supported.
>
> =Rob
>


Re: MemtablePostFlusher and FlushWriter

2014-07-15 Thread horschi
I have seen this behavour when Commitlog files got deleted (or permissions
were set to read only).

MemtablePostFlusher is the stage that marks the Commitlog as flushed. When
they fail it usually means there is something wrong with the commitlog
files.

Check your logfiles for any commitlog related errors.

regards,
Christian


On Tue, Jul 15, 2014 at 7:03 PM, Kais Ahmed  wrote:

> Hi all,
>
> I have a small cluster (2 nodes RF 2)  running with C* 2.0.6 on I2 Extra
> Large (AWS) with SSD disk,
> the nodetool tpstats shows many MemtablePostFlusher pending and
> FlushWriter All time blocked.
>
> The two nodes have the default configuration. All CF use size-tiered
> compaction strategy.
>
> There are 10 times more reads than writes (1300 reads/s and 150 writes/s).
>
>
> ubuntu@node1 :~$ nodetool tpstats
> Pool NameActive   Pending  Completed   Blocked
> All time blocked
> MemtablePostFlusher   1  1158 159590
> 0 0
> FlushWriter   0 0  11568
> 0  1031
>
> ubuntu@node1:~$ nodetool compactionstats
> pending tasks: 90
> Active compaction remaining time :n/a
>
>
> ubuntu@node2:~$ nodetool tpstats
> Pool NameActive   Pending  Completed   Blocked
> All time blocked
> MemtablePostFlusher   1  1020  50987
> 0 0
> FlushWriter   0 0   6672
> 0   948
>
>
> ubuntu@node2:~$ nodetool compactionstats
> pending tasks: 89
> Active compaction remaining time :n/a
>
> I think there is something wrong, thank you for your help.
>
>


Re: MemtablePostFlusher and FlushWriter

2014-07-16 Thread horschi
Hi Ahmed,

this exception is caused by you creating rows with a key-length of more
than 64kb. Your key is 394920 bytes long it seems.

Keys and column-names are limited to 64kb. Only values may be larger.

I cannot say for sure if this is the cause of your high MemtablePostFlusher
pending count, but I would say it is possible.

kind regards,
Christian

PS: I still use good old thrift lingo.






On Wed, Jul 16, 2014 at 3:14 PM, Kais Ahmed  wrote:

> Hi chris, christan,
>
> Thanks for reply, i'm not using DSE.
>
> I have in the log files, this error that appear two times.
>
> ERROR [FlushWriter:3456] 2014-07-01 18:25:33,607 CassandraDaemon.java
> (line 196) Exception in thread Thread[FlushWriter:3456,5,main]
> java.lang.AssertionError: 394920
> at
> org.apache.cassandra.utils.ByteBufferUtil.writeWithShortLength(ByteBufferUtil.java:342)
> at
> org.apache.cassandra.db.ColumnIndex$Builder.maybeWriteRowHeader(ColumnIndex.java:201)
> at
> org.apache.cassandra.db.ColumnIndex$Builder.add(ColumnIndex.java:188)
> at
> org.apache.cassandra.db.ColumnIndex$Builder.build(ColumnIndex.java:133)
> at
> org.apache.cassandra.io.sstable.SSTableWriter.rawAppend(SSTableWriter.java:202)
> at
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:187)
> at
> org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:365)
> at
> org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:318)
> at
> org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
> at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
>
>
> It's the same error than this link
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201305.mbox/%3cbay169-w52699dd7a1c0007783f8d8a8...@phx.gbl%3E
> ,
> with the same configuration 2 nodes RF 2 with SimpleStrategy.
>
> Hope this help.
>
> Thanks,
>
>
>
> 2014-07-16 1:49 GMT+02:00 Chris Lohfink :
>
> The MemtablePostFlusher is also used for flushing non-cf backed (solr)
>> indexes.  Are you using DSE and solr by chance?
>>
>> Chris
>>
>> On Jul 15, 2014, at 5:01 PM, horschi  wrote:
>>
>> I have seen this behavour when Commitlog files got deleted (or
>> permissions were set to read only).
>>
>> MemtablePostFlusher is the stage that marks the Commitlog as flushed.
>> When they fail it usually means there is something wrong with the commitlog
>> files.
>>
>> Check your logfiles for any commitlog related errors.
>>
>> regards,
>> Christian
>>
>>
>> On Tue, Jul 15, 2014 at 7:03 PM, Kais Ahmed  wrote:
>>
>>> Hi all,
>>>
>>> I have a small cluster (2 nodes RF 2)  running with C* 2.0.6 on I2 Extra
>>> Large (AWS) with SSD disk,
>>> the nodetool tpstats shows many MemtablePostFlusher pending and
>>> FlushWriter All time blocked.
>>>
>>> The two nodes have the default configuration. All CF use size-tiered
>>> compaction strategy.
>>>
>>> There are 10 times more reads than writes (1300 reads/s and 150
>>> writes/s).
>>>
>>>
>>> ubuntu@node1 :~$ nodetool tpstats
>>> Pool NameActive   Pending  Completed   Blocked
>>> All time blocked
>>> MemtablePostFlusher   1  1158 159590
>>> 0 0
>>> FlushWriter   0 0  11568
>>> 0  1031
>>>
>>> ubuntu@node1:~$ nodetool compactionstats
>>> pending tasks: 90
>>> Active compaction remaining time :n/a
>>>
>>>
>>> ubuntu@node2:~$ nodetool tpstats
>>> Pool NameActive   Pending  Completed   Blocked
>>> All time blocked
>>> MemtablePostFlusher   1  1020  50987
>>> 0 0
>>> FlushWriter   0 0   6672
>>> 0   948
>>>
>>>
>>> ubuntu@node2:~$ nodetool compactionstats
>>> pending tasks: 89
>>> Active compaction remaining time :n/a
>>>
>>> I think there is something wrong, thank you for your help.
>>>
>>>
>>
>>
>


Re: MemtablePostFlusher and FlushWriter

2014-07-17 Thread horschi
Hi Ahmed,

for that you should increase the flush queue size setting in your
cassandra.yaml

kind regards,
Christian



On Thu, Jul 17, 2014 at 10:54 AM, Kais Ahmed  wrote:

> Thanks christian,
>
> I'll check on my side.
>
> Have you an idea about FlushWriter 'All time blocked'
>
> Thanks,
>
>
> 2014-07-16 16:23 GMT+02:00 horschi :
>
> Hi Ahmed,
>>
>> this exception is caused by you creating rows with a key-length of more
>> than 64kb. Your key is 394920 bytes long it seems.
>>
>> Keys and column-names are limited to 64kb. Only values may be larger.
>>
>> I cannot say for sure if this is the cause of your high
>> MemtablePostFlusher pending count, but I would say it is possible.
>>
>> kind regards,
>> Christian
>>
>> PS: I still use good old thrift lingo.
>>
>>
>>
>>
>>
>>
>> On Wed, Jul 16, 2014 at 3:14 PM, Kais Ahmed  wrote:
>>
>>> Hi chris, christan,
>>>
>>> Thanks for reply, i'm not using DSE.
>>>
>>> I have in the log files, this error that appear two times.
>>>
>>> ERROR [FlushWriter:3456] 2014-07-01 18:25:33,607 CassandraDaemon.java
>>> (line 196) Exception in thread Thread[FlushWriter:3456,5,main]
>>> java.lang.AssertionError: 394920
>>> at
>>> org.apache.cassandra.utils.ByteBufferUtil.writeWithShortLength(ByteBufferUtil.java:342)
>>> at
>>> org.apache.cassandra.db.ColumnIndex$Builder.maybeWriteRowHeader(ColumnIndex.java:201)
>>> at
>>> org.apache.cassandra.db.ColumnIndex$Builder.add(ColumnIndex.java:188)
>>> at
>>> org.apache.cassandra.db.ColumnIndex$Builder.build(ColumnIndex.java:133)
>>> at
>>> org.apache.cassandra.io.sstable.SSTableWriter.rawAppend(SSTableWriter.java:202)
>>> at
>>> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:187)
>>> at
>>> org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:365)
>>> at
>>> org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:318)
>>> at
>>> org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
>>> at
>>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:744)
>>>
>>>
>>> It's the same error than this link
>>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201305.mbox/%3cbay169-w52699dd7a1c0007783f8d8a8...@phx.gbl%3E
>>> ,
>>> with the same configuration 2 nodes RF 2 with SimpleStrategy.
>>>
>>> Hope this help.
>>>
>>> Thanks,
>>>
>>>
>>>
>>> 2014-07-16 1:49 GMT+02:00 Chris Lohfink :
>>>
>>> The MemtablePostFlusher is also used for flushing non-cf backed (solr)
>>>> indexes.  Are you using DSE and solr by chance?
>>>>
>>>> Chris
>>>>
>>>> On Jul 15, 2014, at 5:01 PM, horschi  wrote:
>>>>
>>>> I have seen this behavour when Commitlog files got deleted (or
>>>> permissions were set to read only).
>>>>
>>>> MemtablePostFlusher is the stage that marks the Commitlog as flushed.
>>>> When they fail it usually means there is something wrong with the commitlog
>>>> files.
>>>>
>>>> Check your logfiles for any commitlog related errors.
>>>>
>>>> regards,
>>>> Christian
>>>>
>>>>
>>>> On Tue, Jul 15, 2014 at 7:03 PM, Kais Ahmed  wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a small cluster (2 nodes RF 2)  running with C* 2.0.6 on I2
>>>>> Extra Large (AWS) with SSD disk,
>>>>> the nodetool tpstats shows many MemtablePostFlusher pending and
>>>>> FlushWriter All time blocked.
>>>>>
>>>>> The two nodes have the default configuration. All CF use size-tiered
>>>>> compaction strategy.
>>>>>
>>>>> There are 10 times more reads than writes (1300 reads/s and 150
>>>>> writes/s).
>>>>>
>>>>>
>>>>> ubuntu@node1 :~$ nodetool tpstats
>>>>> Pool NameActive   Pending  Completed
>>>>> Blocked  All time blocked
>>>>> MemtablePostFlusher   1  1158 159590
>>>>> 0 0
>>>>> FlushWriter   0 0  11568
>>>>> 0  1031
>>>>>
>>>>> ubuntu@node1:~$ nodetool compactionstats
>>>>> pending tasks: 90
>>>>> Active compaction remaining time :n/a
>>>>>
>>>>>
>>>>> ubuntu@node2:~$ nodetool tpstats
>>>>> Pool NameActive   Pending  Completed
>>>>> Blocked  All time blocked
>>>>> MemtablePostFlusher   1  1020  50987
>>>>> 0 0
>>>>> FlushWriter   0 0   6672
>>>>> 0   948
>>>>>
>>>>>
>>>>> ubuntu@node2:~$ nodetool compactionstats
>>>>> pending tasks: 89
>>>>> Active compaction remaining time :n/a
>>>>>
>>>>> I think there is something wrong, thank you for your help.
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Repair does not fix inconsistency

2013-04-04 Thread horschi
Hi Michal,

Let's say the tombstone on one of the nodes (X) is gcable and was not
> compacted (purged) so far. After it was created we re-created this row, but
> due some problems it was written only to the second node (Y), so we have
> "live" data on node Y which is newer than the gcable tombstone on replica
> node X. Some time ago we did NOT repair our cluster for a  while (well,
> pretty long while), so it's possible that such situation happened.
>
> My concern is: will AntiEntropy ignore this tombstone only, or basically
> everything related to the row key that this tombstone was created for?
>
It will only ignore the tombstone (which should have been repair in a
previous repair anyway - assuming you to repairs within gc_grace). Any
newer columns (overwriting the tombstone) would be still alive and would
not be ignored.

The only way for CASSANDRA-4905 to make any difference is to not run repair
within gc_grace. With the patch it would not repair these old tombstones
any more. But in that case you should simply increase gc_grace and not undo
the patch :-)



"When I query (cqlsh) some rows by key (CL is default = ONE) I _always_ get
a correct result.  However, when I query it by indexed column, it returns
nothing."
This looks to me more like a secondary index issue. If you say the access
via rowkey is always correct, then the repair works fine. I think there
might be something wrong with your secondary index then.


Cheers,
Christian


Re: Repair does not fix inconsistency

2013-04-04 Thread horschi
Hi,

This was my first thought too, but if you take a look at the logs I
> attached to previous e-mail, you'll notice that query "by key"
> (no-index.log) retrieves data from BOTH replicas, while the "by indexed
> column" one (index.log) talks only to one of them (too bad it's the one
> that contains tombstone only - 1:7). In the first case it is possible to
> "resolve" the conflict and return the proper result, while in the second
> case it's impossible because tombstone is the only thing that is returned
> for this key.
>
Sorry, I did not look into the logs. Thats the first time I'm seeing the
trace btw. :-)

Does CQL not allow CL=ONE queries? Why does it ask two nodes for the key,
when you say that you are using CL=default=1? I'm a bit confused here (I'm
a thrift user).

But thinking about your theory some more: I think CASSANDRA-4905 might make
reappearing columns more common (only if you do not run repair within
gc_grace of course). Before CASSANDRA-4905 the tombstones would be repaired
even after gc_grace, so it was a bit more forgiving. It was never
guaranteed that the inconsistency would be repaired though.

I think you should have increased gc-grace or run repair within the 10 days.


The repair bit makes sense now in my head, unlike the CQL CL :-)

cheers,
Christian


Re: Repair does not fix inconsistency

2013-04-04 Thread horschi
> Well... Strange. We have such problem with 6 users, but there's only ONE
> tombstone (created 8 days ago, so it's not gcable yet) in all the SSTables
> on 2:1 node - checked using sstable2json.
> Moreover, this tombstone DOES NOT belong to the row key I'm using for
> tests, because this user was NOT ever removed / replaced.
> So now I have no bloody idea how C* can see a tombstone for this key when
> running a query. Maybe it's a index problem then?
>
Yes, maybe there are two issues here: repair not running and maybe really
some index-thing.

Maybe you can try a CL=ONE with cassandra-cli? So that we can see how it
works without index.


it tells me about "Key cache hit for  sstable" for SSTables 23 & 24, but I
> don't have such SSTables for Users CF. However, I have SSTables like this
> for index.
>
I think the Index-SSTables and the data SSTables are compacted separately
and the numbers can differ from the data, even though they are flushed
together. So the numbers can differ. (anybody feel free to correct me on
this)


Re: Repair does not fix inconsistency

2013-04-04 Thread horschi
Repair is fine - all the data seem to be in SSTables. I've checked it and
> while index tells me that I have 1 tombstone and 0 live cells for a key, I
> can _see_, thanks to sstable2json, that I have 3 "live cells" (assuming a
> cell is an entry in SSTable) and 0 tombstones. After being confused for the
> most of the day, now I'm almost sure it's a problem with index (re)building.

I'm glad to hear that. I feared my ticket might be responsible for your
data loss. I could not live the guilt ;-) Seriously: I'm glad we can rule
out the repair change.



> The same: for key-based query it returns correct result no matter if I use
> CL=ONE or TWO (or stronger). When querying by indexed column it works for
> CL=TWO or more, but returns no data for CL=ONE.

Yes, if it works with CL=one, then it must be the index. Check the
mailing-list, I think someone else posted something similar the other day.


cheers,
Christian


Re: MySQL Cluster performing faster than Cassandra cluster on single table

2013-04-16 Thread horschi
Hi Hannah,

mysql-cluster is a in-memory database.

In-memory is fast. But I dont think you ever be able to store hundreds of
Gigabytes of data on a node, which is something you can do with Cassandra.

If your dataset is small, then maybe NDB is the better choice for you. I
myself will not even touch it with stick any more, I hate it with a
passion. But this might depend on the use-case :-)

regards,
Christian

On Tue, Apr 16, 2013 at 12:56 PM, jrdn hannah  wrote:

> Hi,
>
> I was wondering if anybody here had any insight into this.
>
> I was running some tests on cassandra and mysql performance, with a two
> node and three node cassandra cluster, and a five node mysql cluster (mgmt,
> 2 x api, 2 x data).
>
> On the cassandra 2 node cluster vs mysql cluster, I was getting a couple
> of strange results. For example, on updating a single table in MySQL, with
> the equivalent super column in Cassandra, I was getting results of 0.231 ms
> for MySQL and 1.248ms for Cassandra to perform the update 1000 times.
>
> Could anybody help explain why this is the case?
>
> Thanks,
> Hannah


Re: MySQL Cluster performing faster than Cassandra cluster on single table

2013-04-16 Thread horschi
Ah, I see, that makes sense. Have you got a source for the storing of
> hundreds of gigabytes? And does Cassandra not store anything in memory?
>
It stores bloom filters and index-samples in memory. But they are much
smaller than the actual data and they can be configured.


>
> Yeah, my dataset is small at the moment - perhaps I should have chosen
> something larger for the work I'm doing (University dissertation), however,
> it is far too late to change now!
>
On paper mysql-cluster looks great. But in daily use its not as nice as
Cassandra (where you have machines dying, networks splitting, etc.).

cheers,
Christian


Re: (unofficial) Community Poll for Production Operators : Repair

2013-05-15 Thread horschi
Hi Alain,

have you had a look at the following tickets?

CASSANDRA-4905 - Repair should exclude gcable tombstones from merkle-tree
computation
CASSANDRA-4932 - Agree on a gcbefore/expirebefore value for all replica
during validation compaction
CASSANDRA-4917 - Optimize tombstone creation for ExpiringColumns
CASSANDRA-5398 - Remove localTimestamp from merkle-tree calculation (for
tombstones)

Imho these should reduce the over-repair to some degree. Especially when
using TTL. Some of them are already fixed in 1.2. The rest will (hopefully)
follow :-)

cheers,
Christian


On Wed, May 15, 2013 at 10:27 AM, Alain RODRIGUEZ wrote:

> Rob, I was wondering something. Are you a commiter working on improving
> the repair or something similar ?
>
> Anyway, if a commiter (or any other expert) could give us some feedback on
> our comments (Are we doing well or not, whether things we observe are
> normal or unexplained, what is going to be improved in the future about
> repair...)
>
> I am always interested on hearing about how things work and whether I am
> doing well or not.
>
> Alain
>
>


Cassandra 1.1.11 does not always show filename of corrupted files

2013-05-30 Thread horschi
Hi,

we had some hard-disk issues this week, which caused some datafiles to get
corrupt, which was reported by the compaction. My approach to fix this was
to delete the corrupted files and run repair. That sounded easy at first,
but unfortunetaly C* 1.1.11 sometimes does not show which datafile is
causing the exception.

How do you handle such cases? Do you delete the entire CF or do you look up
the compaction-started message and delete the files being involved?

In my opinion the Stacktrace should always show the filename of the file
which could not be read. Does anybody know if there were already changes to
the logging since 1.1.11?
CASSANDRA-2261does
not seem to have fixed the Exceptionhandling part. Were there perhaps
changes in 1.2 with the new disk-failure handling?

cheers,
Christian

PS: Here are some examples I found in my logs:

*Bad behaviour:*
ERROR [ValidationExecutor:1] 2013-05-29 13:26:09,121
AbstractCassandraDaemon.java (line 132) Exception in thread
Thread[ValidationExecutor:1,1,main]
java.io.IOError: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
at
org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:116)
at
org.apache.cassandra.db.compaction.PrecompactedRow.(PrecompactedRow.java:99)
at
org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:176)
at
org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:83)
at
org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:68)
at
org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:118)
at
org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:101)
at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
at
com.google.common.collect.Iterators$7.computeNext(Iterators.java:614)
at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
at
org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:726)
at
org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:69)
at
org.apache.cassandra.db.compaction.CompactionManager$9.call(CompactionManager.java:457)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
at
org.apache.cassandra.io.compress.SnappyCompressor.uncompress(SnappyCompressor.java:94)
at
org.apache.cassandra.io.compress.CompressedRandomAccessReader.decompressChunk(CompressedRandomAccessReader.java:90)
at
org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:71)
at
org.apache.cassandra.io.util.RandomAccessReader.read(RandomAccessReader.java:302)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:397)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:377)
at
org.apache.cassandra.utils.BytesReadTracker.readFully(BytesReadTracker.java:95)
at
org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:401)
at
org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:363)
at
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:114)
at
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:37)
at
org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:144)
at
org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:234)
at
org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:112)
... 19 more

*Also bad behaviour:*
ERROR [CompactionExecutor:1] 2013-05-29 13:12:58,896
AbstractCassandraDaemon.java (line 132) Exception in thread
Thread[CompactionExecutor:1,1,main]
java.io.IOError: java.io.IOException: java.util.zip.DataFormatException:
incomplete dynamic bit lengths tree
at
org.apache.cassandra.db.compaction.PrecompactedRow.merge

Re: Cassandra 1.1.11 does not always show filename of corrupted files

2013-05-31 Thread horschi
Ok, looking at the code I can see that 1.2 fixes the issue:
try
{
validBufferBytes =
metadata.compressor().uncompress(compressed.array(), 0, chunk.length,
buffer, 0);
}
catch (IOException e)
{
throw new CorruptBlockException(getPath(), chunk);
}

So thats nice :-)

But does nobody else find the old behaviour annoying? Nobody ever wanted to
identfy the broken files?

cheers,
Christian

On Thu, May 30, 2013 at 7:11 PM, horschi  wrote:

> Hi,
>
> we had some hard-disk issues this week, which caused some datafiles to get
> corrupt, which was reported by the compaction. My approach to fix this was
> to delete the corrupted files and run repair. That sounded easy at first,
> but unfortunetaly C* 1.1.11 sometimes does not show which datafile is
> causing the exception.
>
> How do you handle such cases? Do you delete the entire CF or do you look
> up the compaction-started message and delete the files being involved?
>
> In my opinion the Stacktrace should always show the filename of the file
> which could not be read. Does anybody know if there were already changes to
> the logging since 1.1.11? 
> CASSANDRA-2261<https://issues.apache.org/jira/browse/CASSANDRA-2261>does not 
> seem to have fixed the Exceptionhandling part. Were there perhaps
> changes in 1.2 with the new disk-failure handling?
>
> cheers,
> Christian
>
> PS: Here are some examples I found in my logs:
>
> *Bad behaviour:*
> ERROR [ValidationExecutor:1] 2013-05-29 13:26:09,121
> AbstractCassandraDaemon.java (line 132) Exception in thread
> Thread[ValidationExecutor:1,1,main]
> java.io.IOError: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
> at
> org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:116)
> at
> org.apache.cassandra.db.compaction.PrecompactedRow.(PrecompactedRow.java:99)
> at
> org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:176)
> at
> org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:83)
> at
> org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:68)
> at
> org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:118)
> at
> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:101)
> at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
> at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
> at
> com.google.common.collect.Iterators$7.computeNext(Iterators.java:614)
> at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
> at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
> at
> org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:726)
> at
> org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:69)
> at
> org.apache.cassandra.db.compaction.CompactionManager$9.call(CompactionManager.java:457)
> at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
> at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
> at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
> at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
> at
> org.apache.cassandra.io.compress.SnappyCompressor.uncompress(SnappyCompressor.java:94)
> at
> org.apache.cassandra.io.compress.CompressedRandomAccessReader.decompressChunk(CompressedRandomAccessReader.java:90)
> at
> org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:71)
> at
> org.apache.cassandra.io.util.RandomAccessReader.read(RandomAccessReader.java:302)
> at java.io.RandomAccessFile.readFully(RandomAccessFile.java:397)
> at java.io.RandomAccessFile.readFully(RandomAccessFile.java:377)
> at
> org.apache.cassandra.utils.BytesReadTracker.readFully(BytesReadTracker.java:95)
> at
> org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:401)
> at
> org.apache.cassandr

Re: Compacted data returns with repair?

2013-06-04 Thread horschi
Hi,

this sounds like the following issue:

https://issues.apache.org/jira/browse/CASSANDRA-4905

cheers,
Ch

On Tue, Jun 4, 2013 at 5:50 PM, André Cruz  wrote:

> Hello.
>
> I deleted a lot of data from one of my CFs, waited the gc_grace_period,
> and as the compactions were deleting the data slowly, ran a major
> compaction on that CF. It reduced the size to what I expected. I did not
> run a major compaction on the other 2 nodes (RF = 3) before repairs took
> place and now the CF is again jumping in size on the node I ran the major
> compaction. Is this expected? Does the repair get the data back from the
> other nodes even though it should be gone?
>
> I should add I'm using Cassandra 1.1.5.
>
> Thanks,
> André


Re: Cassandra optimizations for multi-core machines

2013-06-05 Thread horschi
Hi,

Cassandra is heavily multithreaded. If the load demands it will make use of
your 8 cores.

I dont know the startup code, but I would assume it would be parallelized
if neccessary/possible. Afaik there were optimizations already made to
reduce the startup time. Therefore I would assume any optimizations left
would not be so easy.

cheers,
Christian Spriegel



On Wed, Jun 5, 2013 at 11:04 PM, srmore  wrote:

> Hello All,
> We are thinking of going with Cassandra on a 8 core machine, are there any
> optimizations that can help us here ?
>
> I have seen that during startup stage Cassandra uses only one core, is
> there a way we can speed up the startup process ?
>
> Thanks !
>


C* 1.2.5 AssertionError in ColumnSerializer:40

2013-07-01 Thread horschi
Hi,

using C* 1.2.5 I just found a weird AssertionError in our logfiles:

...
 INFO [OptionalTasks:1] 2013-07-01 09:15:43,608 MeteredFlusher.java (line
58) flushing high-traffic column family CFS(Keyspace='Monitoring',
ColumnFamily='cfDateOrderedMessages') (estimated 5242880 bytes)
 INFO [OptionalTasks:1] 2013-07-01 09:15:43,609 ColumnFamilyStore.java
(line 630) Enqueuing flush of
Memtable-cfDateOrderedMessages@2147245119(4616888/5242880
serialized/live bytes, 23714 ops)
 INFO [FlushWriter:9] 2013-07-01 09:15:43,610 Memtable.java (line 461)
Writing Memtable-cfDateOrderedMessages@2147245119(4616888/5242880
serialized/live bytes, 23714 ops)
ERROR [FlushWriter:9] 2013-07-01 09:15:44,145 CassandraDaemon.java (line
192) Exception in thread Thread[FlushWriter:9,5,main]
java.lang.AssertionError
at
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:40)
at
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:30)
at
org.apache.cassandra.db.OnDiskAtom$Serializer.serializeForSSTable(OnDiskAtom.java:62)
at
org.apache.cassandra.db.ColumnIndex$Builder.add(ColumnIndex.java:181)
at
org.apache.cassandra.db.ColumnIndex$Builder.build(ColumnIndex.java:133)
at
org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:185)
at
org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:489)
at
org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:448)
at
org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


I looked into the code and it seems to be coming from the following code:

public void serialize(IColumn column, DataOutput dos) throws IOException
{
assert column.name().remaining() > 0; // crash
ByteBufferUtil.writeWithShortLength(column.name(), dos);
try
{...


Does anybody have an idea why this is happening? The machine has some
issues with its disks, but flush shouldn't be affected by bad disks, right?
I can rule out that this memtable was filled by a bad commitlog.

Thanks,
Christian


Re: TTL, Tombstones, and gc_grace

2013-07-25 Thread horschi
Hi Michael,

yes, you should never loose a delete, because there are no real deletes. No
matter what version you are using.

btw: There is actually a ticket that builds an optimization on top of that
assumption: CASSANDRA-4917. Basically, if TTL>gc_grace then do not create
tombstones for expiring-columns. This works because disappear anyway if TTL
is over.

cheers,
Christian


On Thu, Jul 25, 2013 at 3:24 PM, Michael Theroux wrote:

> Hello,
>
> Quick question on Cassandra, TTLs, tombstones, and GC grace.  If we have a
> column family whose only mechanism of deleting columns is utilizing TTLs,
> is repair really necessary to make tombstones consistent, and therefore
> would it be safe to set the gc grace period of the column family to a very
> low value?
>
> I ask because of this blog post based on Cassandra .7:
> http://www.datastax.com/dev/blog/whats-new-cassandra-07-expiring-columns.
>
> "The first time the expired column is compacted, it is transformed into a
> tombstone. This transformation frees some disk space: the size of the value
> of the expired column. From that moment on, the column is a normal
> tombstone and follows the tombstone rules: it will be totally removed by
> compaction (including minor ones in most cases since Cassandra 0.6.6) after
> GCGraceSeconds."
>
> Since tombstones are not written using a replicated write, but instead
> written during compaction, theoretically, it shouldn't be possible to lose
> a tombstone?  Or is this blog post inaccurate for later versions of
> cassandra?  We are using cassandra 1.1.11.
>
> Thanks,
> -Mike
>
>
>


Re: About column family

2013-07-25 Thread horschi
With 1.2.7 you can use -Dcassandra.unsafesystem. That will speed up cf
creation. So you will get in even more trouble even faster!



On Tue, Jul 23, 2013 at 12:23 PM, bjbylh  wrote:

> Hi all:
> i have two questions to ask:
> 1,how many column families can be created in a cluster?is there a limit to
> the number of it?
> 2,it spents 2-5 seconds to create a new cf while the cluster contains
> about 1 cfs(if the cluster is empty,it spents about 0.5s).is it
> normal?how to improve the efficiency of creating cf?
> btw C* is 1.2.4.
> thanks a lot.
>
> Sent from Samsung Mobile
>


Re: How often to run `nodetool repair`

2013-08-01 Thread horschi
> TTL is effectively DELETE; you need to run a repair once every
> gc_grace_seconds. If you don't, data might un-delete itself.
>

The undelete part is not true. btw: With CASSANDRA-4917 TTLed columns will
not even create a tombstone (assuming ttl > gc_grace).

The rest of your mail I agree with :-)


Re: TTL and gc_grace_Seconds

2013-09-18 Thread horschi
Hi Christopher,

in 2.0 gc_grace should be capped by TTL anyway: see CASSANDRA-4917

cheers,
Christian



On Wed, Sep 18, 2013 at 4:29 PM, Christopher Wirt wrote:

> I have a column family contains time series events, all columns have a 24
> hour TTL and gc_grace_seconds is currently 20 days. There is a TimeUUID in
> part of the key.
>
> ** **
>
> It takes 15 days to repair the entire ring.
>
> ** **
>
> Consistency is not my main worry. Speed is. We currently write to this CF
> at LOCAL_QUORUM and read at ONE.
>
> ** **
>
> Is there any reason to have gc_grace_seconds higher than the TTL? Feels
> like I’m just wasting resources all over given my consistency and speed
> requirements. 
>
> ** **
>
> ** **
>
> Second question.
>
> Does anyone vary their read repair ratio throughout the day.. i.e. at peak
> turn off read repairs, turn to 0.7 for the grave yard shift.
>
> ** **
>
> ** **
>
> Cheers,
>
> Chris
>