Re: nodetool repair caused high disk space usage

2011-08-20 Thread Philippe
Péter,
In our case they get created exclusively during  repairs. Compactionstats
showed a huge number of sstable build compactions
On Aug 20, 2011 1:23 AM, "Peter Schuller" 
wrote:
>> Is there any chance that the entire file from source node got streamed to
>> destination node even though only small amount of data in hte file from
>> source node is supposed to be streamed destination node?
>
> Yes, but the thing that's annoying me is that even if so - you should
> not be seeing a 40 gb -> hundreds of gig increase even if all
> neighbors sent all their data.
>
> Can you check system.log for references to these sstables to see when
> and under what circumstances they got written?
>
> --
> / Peter Schuller (@scode on twitter)


Re: node restart taking too long

2011-08-20 Thread Yan Chunlu
any suggestion? thanks!

On Fri, Aug 19, 2011 at 10:26 PM, Yan Chunlu  wrote:

> the log file shows as follows, not sure what does 'Couldn't find cfId=1000'
> means(google just returned useless results):
>
>
> INFO [main] 2011-08-18 07:23:17,688 DatabaseDescriptor.java (line 453)
> Found table data in data directories. Consider using JMX to call
> org.apache.cassandra.service.StorageService.loadSchemaFromYaml().
>  INFO [main] 2011-08-18 07:23:17,705 CommitLogSegment.java (line 50)
> Creating new commitlog segment
> /cassandra/commitlog/CommitLog-1313670197705.log
>  INFO [main] 2011-08-18 07:23:17,716 CommitLog.java (line 155) Replaying
> /cassandra/commitlog/CommitLog-1313670030512.log
>  INFO [main] 2011-08-18 07:23:17,734 CommitLog.java (line 314) Finished
> reading /cassandra/commitlog/CommitLog-1313670030512.log
>  INFO [main] 2011-08-18 07:23:17,744 CommitLog.java (line 163) Log replay
> complete
>  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 364)
> Cassandra version: 0.7.4
>  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 365) Thrift
> API version: 19.4.0
>  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 378) Loading
> persisted ring state
>  INFO [main] 2011-08-18 07:23:17,766 StorageService.java (line 414)
> Starting up server gossip
>  INFO [main] 2011-08-18 07:23:17,771 ColumnFamilyStore.java (line 1048)
> Enqueuing flush of Memtable-LocationInfo@832310230(29 bytes, 1 operations)
>  INFO [FlushWriter:1] 2011-08-18 07:23:17,772 Memtable.java (line 157)
> Writing Memtable-LocationInfo@832310230(29 bytes, 1 operations)
>  INFO [FlushWriter:1] 2011-08-18 07:23:17,822 Memtable.java (line 164)
> Completed flushing /cassandra/data/system/LocationInfo-f-66-Data.db (80
> bytes)
>  INFO [CompactionExecutor:1] 2011-08-18 07:23:17,823 CompactionManager.java
> (line 396) Compacting
> [SSTableReader(path='/cassandra/data/system/LocationInfo-f-63-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-64-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-65-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-66-Data.db')]
>  INFO [main] 2011-08-18 07:23:17,853 StorageService.java (line 478) Using
> saved token 113427455640312821154458202477256070484
>  INFO [main] 2011-08-18 07:23:17,854 ColumnFamilyStore.java (line 1048)
> Enqueuing flush of Memtable-LocationInfo@18895884(53 bytes, 2 operations)
>  INFO [FlushWriter:1] 2011-08-18 07:23:17,854 Memtable.java (line 157)
> Writing Memtable-LocationInfo@18895884(53 bytes, 2 operations)
> ERROR [MutationStage:28] 2011-08-18 07:23:18,246
> RowMutationVerbHandler.java (line 86) Error in row mutation
> org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find
> cfId=1000
> at
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:117)
> at
> org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:380)
> at
> org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:50)
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:636)
>  INFO [GossipStage:1] 2011-08-18 07:23:18,255 Gossiper.java (line 623) Node
> /node1 has restarted, now UP again
> ERROR [ReadStage:1] 2011-08-18 07:23:18,254
> DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor
> java.lang.IllegalArgumentException: Unknown ColumnFamily prjcache in
> keyspace prjkeyspace
> at
> org.apache.cassandra.config.DatabaseDescriptor.getComparator(DatabaseDescriptor.java:966)
> at
> org.apache.cassandra.db.ColumnFamily.getComparatorFor(ColumnFamily.java:388)
> at
> org.apache.cassandra.db.ReadCommand.getComparator(ReadCommand.java:93)
> at
> org.apache.cassandra.db.SliceByNamesReadCommand.(SliceByNamesReadCommand.java:44)
> at
> org.apache.cassandra.db.SliceByNamesReadCommandSerializer.deserialize(SliceByNamesReadCommand.java:110)
> at
> org.apache.cassandra.db.ReadCommandSerializer.deserialize(ReadCommand.java:122)
> at
> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:67)
>
>
>
> On Fri, Aug 19, 2011 at 5:44 AM, aaron morton wrote:
>
>> Look in the logs to work find out why the migration did not get to node2.
>>
>> Otherwise yes you can drop those files.
>>
>> Cheers
>>
>>   -
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 18/08/2011, at 11:25 PM, Yan Chunlu wrote:
>>
>> just found out that changes via cassandra-cli, the schema change didn't
>> reach node2. and node2 became unreachable
>>
>> I did as this document:
>> http://wiki.apache.org/cassandra/FAQ#schema_disagreement
>>
>> but after that I just got two 

Re: 0.7.4: Replication assertion error after removetoken, removetoken force and a restart

2011-08-20 Thread Anand Somani
0.7.4/ 3 node cluster/ RF -3 /Quorum read/write

After I re-introduced a corrupted node, followed the process as (thanks to
folks on the mailing list for helping me) listed on the operations wiki to
handle failures.
Still doing a cleanup on one node at this point. But I noticed that I am
seeing this same exception appear 10/12 times in a minute, on an existing
node (not the new one). I think it started around the removetoken.

How do I solve this, should I just restart this node? Any other
cleanups/resets I need to do?

Thanks


On Thu, Apr 28, 2011 at 2:26 AM, aaron morton wrote:

> I *think* that code is used when one node tells others via gossip it is
> removing a token that is not it's own. The ode that receives information in
> gossip does some work and then replies to the first node with a
> REPLICATION_FINISHED message, which is the node I assume the error is
> happening on.
>
> Have you been doing any moves / removes or additions or tokens/nodes?
>
> Thanks
> Aaron
>
> On 28 Apr 2011, at 08:39, Alexis Lê-Quôc wrote:
>
> > Hi,
> >
> > I've been getting the following lately, every few seconds.
> >
> > 2011-04-27T20:21:18.299885+00:00 10.202.61.193 [MiscStage: 97] Error
> > in ThreadPoolExecutor
> > 2011-04-27T20:21:18.299885+00:00 10.202.61.193 java.lang.AssertionError
> > 2011-04-27T20:21:18.300038+00:00 10.202.61.193 10.202.61.193   at
> >
> org.apache.cassandra.service.StorageService.confirmReplication(StorageService.java:1872)
> > 2011-04-27T20:21:18.300038+00:00 10.202.61.193 10.202.61.193   at
> >
> org.apache.cassandra.streaming.ReplicationFinishedVerbHandler.doVerb(ReplicationFinishedVerbHandler.java:38)
> > 2011-04-27T20:21:18.300047+00:00 10.202.61.193 10.202.61.193   at
> >
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
> > 2011-04-27T20:21:18.300047+00:00 10.202.61.193 10.202.61.193   at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > 2011-04-27T20:21:18.300055+00:00 10.202.61.193 10.202.61.193   at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > 2011-04-27T20:21:18.300055+00:00 10.202.61.193 10.202.61.193   at
> > java.lang.Thread.run(Thread.java:636)
> > 2011-04-27T20:21:18.300555+00:00 10.202.61.193 [MiscStage: 97] Fatal
> > exception in thread Thread[MiscStage:97,5,main]
> >
> > I see it coming from
> > 32 public class ReplicationFinishedVerbHandler implements IVerbHandler
> > 33 {
> > 34 private static Logger logger =
> > LoggerFactory.getLogger(ReplicationFinishedVerbHandler.class);
> > 35
> > 36 public void doVerb(Message msg, String id)
> > 37 {
> > 38 StorageService.instance.confirmReplication(msg.getFrom());
> > 39 Message response =
> > msg.getInternalReply(ArrayUtils.EMPTY_BYTE_ARRAY);
> > 40 if (logger.isDebugEnabled())
> > 41 logger.debug("Replying to " + id + "@" + msg.getFrom());
> > 42 MessagingService.instance().sendReply(response, id,
> msg.getFrom());
> > 43 }
> > 44 }
> >
> > Before I dig deeper in the code, has anybody dealt with this before?
> >
> > Thanks,
> >
> > --
> > Alexis Lê-Quôc
>
>


Re: node restart taking too long

2011-08-20 Thread Peter Schuller
> the log file shows as follows, not sure what does 'Couldn't find cfId=1000'
> means(google just returned useless results):

Those should be the indication that the schema is wrong on the node.
Reads and writes are being received from other nodes pertaining to
column families it does not know about.

I don't know, without investigation, why the instructions from the
wiki don't work though. You did the procedure of restarting the node
with the migrations/schema removed, right?

-- 
/ Peter Schuller (@scode on twitter)


Re: node restart taking too long

2011-08-20 Thread Peter Schuller
Can you post the complete Cassandra log starting with the initial
start-up of the node after having removed schema/migrations?

-- 
/ Peter Schuller (@scode on twitter)


Re: Occasionally getting old data back with ConsistencyLevel.ALL

2011-08-20 Thread Peter Schuller
> Do you mean the cassandra log, or just logging in the script itself?

The script itself. I.e, some "independent" verification that the line
of code after the insert is in fact running, just in case there's some
kind of silent failure.

Sounds like you've tried to address it though with the E-Mail:s.

I suppose it boils down to: Either there is something wrong in your
environment/code, or Cassandra does have a bug. If the latter, it
would probably be helpful if you could try to reproduce it in your
environment in a way which can be shared - such as a script that does
writes and reads back to confirm the write made it. Or maybe just
adding more explicit logging to your script (even if it causes some
log flooding) to "prove" that a write truly happened.

-- 
/ Peter Schuller (@scode on twitter)


Re: nodetool repair caused high disk space usage

2011-08-20 Thread Peter Schuller
> In our case they get created exclusively during  repairs. Compactionstats
> showed a huge number of sstable build compactions

Do you have an indication that at least the disk space is in fact
consistent with the amount of data being streamed between the nodes? I
think you had 90 -> ~ 450 gig with RF=3, right? Still sounds like a
lot assuming repairs are not running concurrently (and compactions are
able to run after a repair before the next repair of a neighbor
starts).

-- 
/ Peter Schuller (@scode on twitter)


Re: node restart taking too long

2011-08-20 Thread Jonathan Ellis
This means you should upgrade, because we've fixed bugs about ignoring
deleted CFs since 0.7.4.

On Fri, Aug 19, 2011 at 9:26 AM, Yan Chunlu  wrote:
> the log file shows as follows, not sure what does 'Couldn't find cfId=1000'
> means(google just returned useless results):
>
> INFO [main] 2011-08-18 07:23:17,688 DatabaseDescriptor.java (line 453) Found
> table data in data directories. Consider using JMX to call
> org.apache.cassandra.service.StorageService.loadSchemaFromYaml().
>  INFO [main] 2011-08-18 07:23:17,705 CommitLogSegment.java (line 50)
> Creating new commitlog segment
> /cassandra/commitlog/CommitLog-1313670197705.log
>  INFO [main] 2011-08-18 07:23:17,716 CommitLog.java (line 155) Replaying
> /cassandra/commitlog/CommitLog-1313670030512.log
>  INFO [main] 2011-08-18 07:23:17,734 CommitLog.java (line 314) Finished
> reading /cassandra/commitlog/CommitLog-1313670030512.log
>  INFO [main] 2011-08-18 07:23:17,744 CommitLog.java (line 163) Log replay
> complete
>  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 364)
> Cassandra version: 0.7.4
>  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 365) Thrift
> API version: 19.4.0
>  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 378) Loading
> persisted ring state
>  INFO [main] 2011-08-18 07:23:17,766 StorageService.java (line 414) Starting
> up server gossip
>  INFO [main] 2011-08-18 07:23:17,771 ColumnFamilyStore.java (line 1048)
> Enqueuing flush of Memtable-LocationInfo@832310230(29 bytes, 1 operations)
>  INFO [FlushWriter:1] 2011-08-18 07:23:17,772 Memtable.java (line 157)
> Writing Memtable-LocationInfo@832310230(29 bytes, 1 operations)
>  INFO [FlushWriter:1] 2011-08-18 07:23:17,822 Memtable.java (line 164)
> Completed flushing /cassandra/data/system/LocationInfo-f-66-Data.db (80
> bytes)
>  INFO [CompactionExecutor:1] 2011-08-18 07:23:17,823 CompactionManager.java
> (line 396) Compacting
> [SSTableReader(path='/cassandra/data/system/LocationInfo-f-63-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-64-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-65-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-66-Data.db')]
>  INFO [main] 2011-08-18 07:23:17,853 StorageService.java (line 478) Using
> saved token 113427455640312821154458202477256070484
>  INFO [main] 2011-08-18 07:23:17,854 ColumnFamilyStore.java (line 1048)
> Enqueuing flush of Memtable-LocationInfo@18895884(53 bytes, 2 operations)
>  INFO [FlushWriter:1] 2011-08-18 07:23:17,854 Memtable.java (line 157)
> Writing Memtable-LocationInfo@18895884(53 bytes, 2 operations)
> ERROR [MutationStage:28] 2011-08-18 07:23:18,246 RowMutationVerbHandler.java
> (line 86) Error in row mutation
> org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find
> cfId=1000
>     at
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:117)
>     at
> org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:380)
>     at
> org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:50)
>     at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>     at java.lang.Thread.run(Thread.java:636)
>  INFO [GossipStage:1] 2011-08-18 07:23:18,255 Gossiper.java (line 623) Node
> /node1 has restarted, now UP again
> ERROR [ReadStage:1] 2011-08-18 07:23:18,254
> DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor
> java.lang.IllegalArgumentException: Unknown ColumnFamily prjcache in
> keyspace prjkeyspace
>     at
> org.apache.cassandra.config.DatabaseDescriptor.getComparator(DatabaseDescriptor.java:966)
>     at
> org.apache.cassandra.db.ColumnFamily.getComparatorFor(ColumnFamily.java:388)
>     at
> org.apache.cassandra.db.ReadCommand.getComparator(ReadCommand.java:93)
>     at
> org.apache.cassandra.db.SliceByNamesReadCommand.(SliceByNamesReadCommand.java:44)
>     at
> org.apache.cassandra.db.SliceByNamesReadCommandSerializer.deserialize(SliceByNamesReadCommand.java:110)
>     at
> org.apache.cassandra.db.ReadCommandSerializer.deserialize(ReadCommand.java:122)
>     at
> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:67)
>
>
> On Fri, Aug 19, 2011 at 5:44 AM, aaron morton 
> wrote:
>>
>> Look in the logs to work find out why the migration did not get to node2.
>> Otherwise yes you can drop those files.
>> Cheers
>> -
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> On 18/08/2011, at 11:25 PM, Yan Chunlu wrote:
>>
>> just found out that changes via cassandra-cli, the schema change didn't
>> reach node2. and node2 became unreachable
>> I did as this
>> document:http://wiki.apache.org/cassandra/FAQ#sche

Re: Re: Urgent:!! Re: Need to maintenance on a cassandra node, are there problems with this process

2011-08-20 Thread Anand Somani
Thanks for the help, this seems to have worked. Except that while adding the
new node we added the same token to a different IP (operational script
goofup) and brought the node up, so now the other nodes just had the message
that a new IP had taken over the token.


   - So we brought it down and fixed it and it all came up fine.
   - ran removetoken did not finish
   - so ran removetoken force, that seemed to work
   - Cleaned up the nodes
   - Everything from the ring perspective appeared ok on all nodes
  - except for this error message (which based on some thread it seemed
  would go away) reported in this thread =>
  
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/0-7-4-Replication-assertion-error-after-removetoken-removetoken-force-and-a-restart-td6311082.html
  - So I restarted this one node that was complaining (this was not the
   node that was replaced)
   - But once this node was restarted, the ring command on it showed the old
   single token IP (the one we removed).
   - So I am running the removetoken again , been running for about 2-3
   hours now.

the ring shows


113427455640312821154458202477256070485
10.xxx.0.184   Up Normal  829.73 GB   33.33%
0
10.xxx.0.185   Up Normal  576.09 GB   33.33%
56713727820156410577229101238628035241
10.xxx.0.189   Down   Leaving 139.73 KB   0.00%
56713727820156410577229101238628035242
10.xxx.0.188   Up Normal  697.41 GB   33.33%
113427455640312821154458202477256070485

What are my choices here, how do I clean up the ring? The other 2 nodes show
the ring fine (not even aware of 189)

Thanks
Anand


On Fri, Aug 19, 2011 at 11:53 AM, Anand Somani  wrote:

> ok I will go with the IP change strategy and keep you posted. Not going to
> manually copy any data, just bring up the node and let it bootstrap.
>
> Thanks
>
>
> On Fri, Aug 19, 2011 at 11:46 AM, Peter Schuller <
> peter.schul...@infidyne.com> wrote:
>
>> > (Yes, this should definitely be easier. Maybe the most generally
>> > useful fix would be for Cassandra to support a node joining the wring
>> > in "write-only" mode. This would be useful in other cases, such as
>> > when you're trying to temporarily off-load a node by dissabling
>> > gossip).
>>
>> I knew I had read discussions before:
>>
>>   https://issues.apache.org/jira/browse/CASSANDRA-2568
>>
>> --
>> / Peter Schuller (@scode on twitter)
>>
>
>


Re: node restart taking too long

2011-08-20 Thread Yan Chunlu
that could be the reason, I did nodetool repair(unfinished, data size
increased 6 times bigger 30G vs 170G) and there should be some unclean
sstables on that node.

however upgrade it a tough work for me right now.  could the nodetool scrub
help?  or decommission the node and join it again?


On Sun, Aug 21, 2011 at 5:56 AM, Jonathan Ellis  wrote:

> This means you should upgrade, because we've fixed bugs about ignoring
> deleted CFs since 0.7.4.
>
> On Fri, Aug 19, 2011 at 9:26 AM, Yan Chunlu  wrote:
> > the log file shows as follows, not sure what does 'Couldn't find
> cfId=1000'
> > means(google just returned useless results):
> >
> > INFO [main] 2011-08-18 07:23:17,688 DatabaseDescriptor.java (line 453)
> Found
> > table data in data directories. Consider using JMX to call
> > org.apache.cassandra.service.StorageService.loadSchemaFromYaml().
> >  INFO [main] 2011-08-18 07:23:17,705 CommitLogSegment.java (line 50)
> > Creating new commitlog segment
> > /cassandra/commitlog/CommitLog-1313670197705.log
> >  INFO [main] 2011-08-18 07:23:17,716 CommitLog.java (line 155) Replaying
> > /cassandra/commitlog/CommitLog-1313670030512.log
> >  INFO [main] 2011-08-18 07:23:17,734 CommitLog.java (line 314) Finished
> > reading /cassandra/commitlog/CommitLog-1313670030512.log
> >  INFO [main] 2011-08-18 07:23:17,744 CommitLog.java (line 163) Log replay
> > complete
> >  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 364)
> > Cassandra version: 0.7.4
> >  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 365)
> Thrift
> > API version: 19.4.0
> >  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 378)
> Loading
> > persisted ring state
> >  INFO [main] 2011-08-18 07:23:17,766 StorageService.java (line 414)
> Starting
> > up server gossip
> >  INFO [main] 2011-08-18 07:23:17,771 ColumnFamilyStore.java (line 1048)
> > Enqueuing flush of Memtable-LocationInfo@832310230(29 bytes, 1
> operations)
> >  INFO [FlushWriter:1] 2011-08-18 07:23:17,772 Memtable.java (line 157)
> > Writing Memtable-LocationInfo@832310230(29 bytes, 1 operations)
> >  INFO [FlushWriter:1] 2011-08-18 07:23:17,822 Memtable.java (line 164)
> > Completed flushing /cassandra/data/system/LocationInfo-f-66-Data.db (80
> > bytes)
> >  INFO [CompactionExecutor:1] 2011-08-18 07:23:17,823
> CompactionManager.java
> > (line 396) Compacting
> >
> [SSTableReader(path='/cassandra/data/system/LocationInfo-f-63-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-64-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-65-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-66-Data.db')]
> >  INFO [main] 2011-08-18 07:23:17,853 StorageService.java (line 478) Using
> > saved token 113427455640312821154458202477256070484
> >  INFO [main] 2011-08-18 07:23:17,854 ColumnFamilyStore.java (line 1048)
> > Enqueuing flush of Memtable-LocationInfo@18895884(53 bytes, 2
> operations)
> >  INFO [FlushWriter:1] 2011-08-18 07:23:17,854 Memtable.java (line 157)
> > Writing Memtable-LocationInfo@18895884(53 bytes, 2 operations)
> > ERROR [MutationStage:28] 2011-08-18 07:23:18,246
> RowMutationVerbHandler.java
> > (line 86) Error in row mutation
> > org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't
> find
> > cfId=1000
> > at
> >
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:117)
> > at
> >
> org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:380)
> > at
> >
> org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:50)
> > at
> >
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > at java.lang.Thread.run(Thread.java:636)
> >  INFO [GossipStage:1] 2011-08-18 07:23:18,255 Gossiper.java (line 623)
> Node
> > /node1 has restarted, now UP again
> > ERROR [ReadStage:1] 2011-08-18 07:23:18,254
> > DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor
> > java.lang.IllegalArgumentException: Unknown ColumnFamily prjcache in
> > keyspace prjkeyspace
> > at
> >
> org.apache.cassandra.config.DatabaseDescriptor.getComparator(DatabaseDescriptor.java:966)
> > at
> >
> org.apache.cassandra.db.ColumnFamily.getComparatorFor(ColumnFamily.java:388)
> > at
> > org.apache.cassandra.db.ReadCommand.getComparator(ReadCommand.java:93)
> > at
> >
> org.apache.cassandra.db.SliceByNamesReadCommand.(SliceByNamesReadCommand.java:44)
> > at
> >
> org.apache.cassandra.db.SliceByNamesReadCommandSerializer.deserialize(SliceByNamesReadCommand.java:110)
> > at
> >
> org.apache.cassandra.db.ReadCommandSerializer.deserialize(ReadCommand.java:122)
> > at
> > org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:67)
> >

Re: node restart taking too long

2011-08-20 Thread Jonathan Ellis
I'm not sure what problem you're trying to solve.  The exception you
pasted should stop once your clients are no longer trying to use the
dropped CF.

On Sat, Aug 20, 2011 at 10:09 PM, Yan Chunlu  wrote:
> that could be the reason, I did nodetool repair(unfinished, data size
> increased 6 times bigger 30G vs 170G) and there should be some unclean
> sstables on that node.
> however upgrade it a tough work for me right now.  could the nodetool scrub
> help?  or decommission the node and join it again?
>
> On Sun, Aug 21, 2011 at 5:56 AM, Jonathan Ellis  wrote:
>>
>> This means you should upgrade, because we've fixed bugs about ignoring
>> deleted CFs since 0.7.4.
>>
>> On Fri, Aug 19, 2011 at 9:26 AM, Yan Chunlu  wrote:
>> > the log file shows as follows, not sure what does 'Couldn't find
>> > cfId=1000'
>> > means(google just returned useless results):
>> >
>> > INFO [main] 2011-08-18 07:23:17,688 DatabaseDescriptor.java (line 453)
>> > Found
>> > table data in data directories. Consider using JMX to call
>> > org.apache.cassandra.service.StorageService.loadSchemaFromYaml().
>> >  INFO [main] 2011-08-18 07:23:17,705 CommitLogSegment.java (line 50)
>> > Creating new commitlog segment
>> > /cassandra/commitlog/CommitLog-1313670197705.log
>> >  INFO [main] 2011-08-18 07:23:17,716 CommitLog.java (line 155) Replaying
>> > /cassandra/commitlog/CommitLog-1313670030512.log
>> >  INFO [main] 2011-08-18 07:23:17,734 CommitLog.java (line 314) Finished
>> > reading /cassandra/commitlog/CommitLog-1313670030512.log
>> >  INFO [main] 2011-08-18 07:23:17,744 CommitLog.java (line 163) Log
>> > replay
>> > complete
>> >  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 364)
>> > Cassandra version: 0.7.4
>> >  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 365)
>> > Thrift
>> > API version: 19.4.0
>> >  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 378)
>> > Loading
>> > persisted ring state
>> >  INFO [main] 2011-08-18 07:23:17,766 StorageService.java (line 414)
>> > Starting
>> > up server gossip
>> >  INFO [main] 2011-08-18 07:23:17,771 ColumnFamilyStore.java (line 1048)
>> > Enqueuing flush of Memtable-LocationInfo@832310230(29 bytes, 1
>> > operations)
>> >  INFO [FlushWriter:1] 2011-08-18 07:23:17,772 Memtable.java (line 157)
>> > Writing Memtable-LocationInfo@832310230(29 bytes, 1 operations)
>> >  INFO [FlushWriter:1] 2011-08-18 07:23:17,822 Memtable.java (line 164)
>> > Completed flushing /cassandra/data/system/LocationInfo-f-66-Data.db (80
>> > bytes)
>> >  INFO [CompactionExecutor:1] 2011-08-18 07:23:17,823
>> > CompactionManager.java
>> > (line 396) Compacting
>> >
>> > [SSTableReader(path='/cassandra/data/system/LocationInfo-f-63-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-64-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-65-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-66-Data.db')]
>> >  INFO [main] 2011-08-18 07:23:17,853 StorageService.java (line 478)
>> > Using
>> > saved token 113427455640312821154458202477256070484
>> >  INFO [main] 2011-08-18 07:23:17,854 ColumnFamilyStore.java (line 1048)
>> > Enqueuing flush of Memtable-LocationInfo@18895884(53 bytes, 2
>> > operations)
>> >  INFO [FlushWriter:1] 2011-08-18 07:23:17,854 Memtable.java (line 157)
>> > Writing Memtable-LocationInfo@18895884(53 bytes, 2 operations)
>> > ERROR [MutationStage:28] 2011-08-18 07:23:18,246
>> > RowMutationVerbHandler.java
>> > (line 86) Error in row mutation
>> > org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't
>> > find
>> > cfId=1000
>> >     at
>> >
>> > org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:117)
>> >     at
>> >
>> > org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:380)
>> >     at
>> >
>> > org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:50)
>> >     at
>> >
>> > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
>> >     at
>> >
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> >     at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> >     at java.lang.Thread.run(Thread.java:636)
>> >  INFO [GossipStage:1] 2011-08-18 07:23:18,255 Gossiper.java (line 623)
>> > Node
>> > /node1 has restarted, now UP again
>> > ERROR [ReadStage:1] 2011-08-18 07:23:18,254
>> > DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor
>> > java.lang.IllegalArgumentException: Unknown ColumnFamily prjcache in
>> > keyspace prjkeyspace
>> >     at
>> >
>> > org.apache.cassandra.config.DatabaseDescriptor.getComparator(DatabaseDescriptor.java:966)
>> >     at
>> >
>> > org.apache.cassandra.db.ColumnFamily.getComparatorFor(ColumnFamily.java:388)
>> >     at
>> > org.apache.cassandra.db.ReadCommand.getComparator(ReadCommand.java:93)
>> >     at
>> >
>> > org.apache.cassandra.db.Sl

Re: nodetool repair caused high disk space usage

2011-08-20 Thread Philippe
>
> Do you have an indication that at least the disk space is in fact
> consistent with the amount of data being streamed between the nodes? I
> think you had 90 -> ~ 450 gig with RF=3, right? Still sounds like a
> lot assuming repairs are not running concurrently (and compactions are
> able to run after a repair before the next repair of a neighbor
> starts).
>
Hi Peter,
When a repair was running on the 40GB keyspace I'd usually see range repairs
for about up to a couple thousand ranges for each CF. If range = #keys then
that's a very small amount of data being moved around.
However, at the time, I hadn't noticed that there were multiple repairs
running concurrently on the same nodes and on the neighbors so I suppose my
experience is invalid for possibly finding a bug. But I suspect it will help
someone out along the way because they'll have multiple repairs going on too
and I have a much better understanding of what's going on myself.

I've reloaded all my data in my cluster now, the load is 140GB on each node
and I've been able to run a repair on each CF that comes out almost 100%
consistent. I'm now starting to run the daily repair crons again to see if
they go out of whack or not.