Re: node restart taking too long

Yan Chunlu Sat, 20 Aug 2011 20:10:10 -0700

that could be the reason, I did nodetool repair(unfinished, data size
increased 6 times bigger 30G vs 170G) and there should be some unclean
sstables on that node.


however upgrade it a tough work for me right now.  could the nodetool scrub
help?  or decommission the node and join it again?


On Sun, Aug 21, 2011 at 5:56 AM, Jonathan Ellis <jbel...@gmail.com> wrote:

> This means you should upgrade, because we've fixed bugs about ignoring
> deleted CFs since 0.7.4.
>
> On Fri, Aug 19, 2011 at 9:26 AM, Yan Chunlu <springri...@gmail.com> wrote:
> > the log file shows as follows, not sure what does 'Couldn't find
> cfId=1000'
> > means(google just returned useless results):
> >
> > INFO [main] 2011-08-18 07:23:17,688 DatabaseDescriptor.java (line 453)
> Found
> > table data in data directories. Consider using JMX to call
> > org.apache.cassandra.service.StorageService.loadSchemaFromYaml().
> >  INFO [main] 2011-08-18 07:23:17,705 CommitLogSegment.java (line 50)
> > Creating new commitlog segment
> > /cassandra/commitlog/CommitLog-1313670197705.log
> >  INFO [main] 2011-08-18 07:23:17,716 CommitLog.java (line 155) Replaying
> > /cassandra/commitlog/CommitLog-1313670030512.log
> >  INFO [main] 2011-08-18 07:23:17,734 CommitLog.java (line 314) Finished
> > reading /cassandra/commitlog/CommitLog-1313670030512.log
> >  INFO [main] 2011-08-18 07:23:17,744 CommitLog.java (line 163) Log replay
> > complete
> >  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 364)
> > Cassandra version: 0.7.4
> >  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 365)
> Thrift
> > API version: 19.4.0
> >  INFO [main] 2011-08-18 07:23:17,756 StorageService.java (line 378)
> Loading
> > persisted ring state
> >  INFO [main] 2011-08-18 07:23:17,766 StorageService.java (line 414)
> Starting
> > up server gossip
> >  INFO [main] 2011-08-18 07:23:17,771 ColumnFamilyStore.java (line 1048)
> > Enqueuing flush of Memtable-LocationInfo@832310230(29 bytes, 1
> operations)
> >  INFO [FlushWriter:1] 2011-08-18 07:23:17,772 Memtable.java (line 157)
> > Writing Memtable-LocationInfo@832310230(29 bytes, 1 operations)
> >  INFO [FlushWriter:1] 2011-08-18 07:23:17,822 Memtable.java (line 164)
> > Completed flushing /cassandra/data/system/LocationInfo-f-66-Data.db (80
> > bytes)
> >  INFO [CompactionExecutor:1] 2011-08-18 07:23:17,823
> CompactionManager.java
> > (line 396) Compacting
> >
> [SSTableReader(path='/cassandra/data/system/LocationInfo-f-63-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-64-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-65-Data.db'),SSTableReader(path='/cassandra/data/system/LocationInfo-f-66-Data.db')]
> >  INFO [main] 2011-08-18 07:23:17,853 StorageService.java (line 478) Using
> > saved token 113427455640312821154458202477256070484
> >  INFO [main] 2011-08-18 07:23:17,854 ColumnFamilyStore.java (line 1048)
> > Enqueuing flush of Memtable-LocationInfo@18895884(53 bytes, 2
> operations)
> >  INFO [FlushWriter:1] 2011-08-18 07:23:17,854 Memtable.java (line 157)
> > Writing Memtable-LocationInfo@18895884(53 bytes, 2 operations)
> > ERROR [MutationStage:28] 2011-08-18 07:23:18,246
> RowMutationVerbHandler.java
> > (line 86) Error in row mutation
> > org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't
> find
> > cfId=1000
> >     at
> >
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:117)
> >     at
> >
> org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:380)
> >     at
> >
> org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:50)
> >     at
> >
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
> >     at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> >     at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> >     at java.lang.Thread.run(Thread.java:636)
> >  INFO [GossipStage:1] 2011-08-18 07:23:18,255 Gossiper.java (line 623)
> Node
> > /node1 has restarted, now UP again
> > ERROR [ReadStage:1] 2011-08-18 07:23:18,254
> > DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor
> > java.lang.IllegalArgumentException: Unknown ColumnFamily prjcache in
> > keyspace prjkeyspace
> >     at
> >
> org.apache.cassandra.config.DatabaseDescriptor.getComparator(DatabaseDescriptor.java:966)
> >     at
> >
> org.apache.cassandra.db.ColumnFamily.getComparatorFor(ColumnFamily.java:388)
> >     at
> > org.apache.cassandra.db.ReadCommand.getComparator(ReadCommand.java:93)
> >     at
> >
> org.apache.cassandra.db.SliceByNamesReadCommand.<init>(SliceByNamesReadCommand.java:44)
> >     at
> >
> org.apache.cassandra.db.SliceByNamesReadCommandSerializer.deserialize(SliceByNamesReadCommand.java:110)
> >     at
> >
> org.apache.cassandra.db.ReadCommandSerializer.deserialize(ReadCommand.java:122)
> >     at
> > org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:67)
> >
> >
> > On Fri, Aug 19, 2011 at 5:44 AM, aaron morton <aa...@thelastpickle.com>
> > wrote:
> >>
> >> Look in the logs to work find out why the migration did not get to
> node2.
> >> Otherwise yes you can drop those files.
> >> Cheers
> >> -----------------
> >> Aaron Morton
> >> Freelance Cassandra Developer
> >> @aaronmorton
> >> http://www.thelastpickle.com
> >> On 18/08/2011, at 11:25 PM, Yan Chunlu wrote:
> >>
> >> just found out that changes via cassandra-cli, the schema change didn't
> >> reach node2. and node2 became unreachable....
> >> I did as this
> >> document:http://wiki.apache.org/cassandra/FAQ#schema_disagreement
> >> but after that I just got two schema versons:
> >>
> >>
> >> ddcada52-c96a-11e0-99af-3bd951658d61: [node1, node3]
> >> 2127b2ef-6998-11e0-b45b-3bd951658d61: [node2]
> >>
> >> is that enough delete Schema* && Migrations* sstables and restart the
> >> node?
> >>
> >>
> >> On Thu, Aug 18, 2011 at 5:13 PM, Yan Chunlu <springri...@gmail.com>
> wrote:
> >>>
> >>> thanks a lot for  all the help!  I have gone through the steps and
> >>> successfully brought up the node2 :)
> >>>
> >>> On Thu, Aug 18, 2011 at 10:51 AM, Boris Yen <yulin...@gmail.com>
> wrote:
> >>> > Because the file only preserve the "key" of records, not the whole
> >>> > record.
> >>> > Records for those saved key will be loaded into cassandra during the
> >>> > startup
> >>> > of cassandra.
> >>> >
> >>> > On Wed, Aug 17, 2011 at 5:52 PM, Yan Chunlu <springri...@gmail.com>
> >>> > wrote:
> >>> >>
> >>> >> but the data size in the saved_cache are relatively small:
> >>> >>
> >>> >> will that cause the load problem?
> >>> >>
> >>> >>  ls  -lh  /cassandra/saved_caches/
> >>> >> total 32M
> >>> >> -rw-r--r-- 1 cass cass 2.9M 2011-08-12 19:53
> >>> >> cass-CommentSortsCache-KeyCache
> >>> >> -rw-r--r-- 1 cass cass 2.9M 2011-08-17 04:29
> >>> >> cass-CommentSortsCache-RowCache
> >>> >> -rw-r--r-- 1 cass cass 2.7M 2011-08-12 18:50
> cass-CommentVote-KeyCache
> >>> >> -rw-r--r-- 1 cass cass 140K 2011-08-12 19:53
> >>> >> cass-device_images-KeyCache
> >>> >> -rw-r--r-- 1 cass cass  33K 2011-08-12 18:51 cass-Hide-KeyCache
> >>> >> -rw-r--r-- 1 cass cass 4.6M 2011-08-12 19:53 cass-images-KeyCache
> >>> >> -rw-r--r-- 1 cass cass 2.6M 2011-08-12 19:53
> cass-LinksByUrl-KeyCache
> >>> >> -rw-r--r-- 1 cass cass 2.5M 2011-08-12 18:50 cass-LinkVote-KeyCache
> >>> >> -rw-r--r-- 1 cass cass 7.5M 2011-08-12 18:50 cass-cache-KeyCache
> >>> >> -rw-r--r-- 1 cass cass 3.7M 2011-08-12 21:51 cass-cache-RowCache
> >>> >> -rw-r--r-- 1 cass cass 1.8M 2011-08-12 18:51 cass-Save-KeyCache
> >>> >> -rw-r--r-- 1 cass cass 111K 2011-08-12 19:50
> >>> >> cass-SavesByAccount-KeyCache
> >>> >> -rw-r--r-- 1 cass cass  864 2011-08-12 19:49
> cass-VotesByDay-KeyCache
> >>> >> -rw-r--r-- 1 cass cass 249K 2011-08-12 19:49
> cass-VotesByLink-KeyCache
> >>> >> -rw-r--r-- 1 cass cass   28 2011-08-14 12:50
> >>> >> system-HintsColumnFamily-KeyCache
> >>> >> -rw-r--r-- 1 cass cass    5 2011-08-14 12:50
> >>> >> system-LocationInfo-KeyCache
> >>> >> -rw-r--r-- 1 cass cass   54 2011-08-13 13:30
> >>> >> system-Migrations-KeyCache
> >>> >> -rw-r--r-- 1 cass cass   76 2011-08-13 13:30 system-Schema-KeyCache
> >>> >>
> >>> >> On Wed, Aug 17, 2011 at 4:31 PM, aaron morton
> >>> >> <aa...@thelastpickle.com>
> >>> >> wrote:
> >>> >> > If you have a node that cannot start up due to issues loading the
> >>> >> > saved
> >>> >> > cache delete the files in the saved_cache directory before
> starting
> >>> >> > it.
> >>> >> >
> >>> >> > The settings to save the row and key cache are per CF. You can
> >>> >> > change
> >>> >> > them with an update column family statement via the CLI when
> >>> >> > attached to any
> >>> >> > node. You may then want to check the saved_caches directory and
> >>> >> > delete any
> >>> >> > files that are left (not sure if they are automatically deleted).
> >>> >> >
> >>> >> > i would recommend:
> >>> >> > - stop node 2
> >>> >> > - delete it's saved_cache
> >>> >> > - make the schema change via another node
> >>> >> > - startup node 2
> >>> >> >
> >>> >> > Cheers
> >>> >> >
> >>> >> > -----------------
> >>> >> > Aaron Morton
> >>> >> > Freelance Cassandra Developer
> >>> >> > @aaronmorton
> >>> >> > http://www.thelastpickle.com
> >>> >> >
> >>> >> > On 17/08/2011, at 2:59 PM, Yan Chunlu wrote:
> >>> >> >
> >>> >> >> does this need to be cluster wide? or I could just modify the
> >>> >> >> caches
> >>> >> >> on one node?   since I could not connect to the node with
> >>> >> >> cassandra-cli, it says "connection refused"
> >>> >> >>
> >>> >> >>
> >>> >> >> [default@unknown] connect node2/9160;
> >>> >> >> Exception connecting to node2/9160. Reason: Connection refused.
> >>> >> >>
> >>> >> >>
> >>> >> >> so if I change the cache size via other nodes, how could node2 be
> >>> >> >> notified the changing?    kill cassandra and start it again could
> >>> >> >> make
> >>> >> >> it update the schema?
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> On Wed, Aug 17, 2011 at 5:59 AM, Teijo Holzer
> >>> >> >> <thol...@wetafx.co.nz>
> >>> >> >> wrote:
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> yes, we saw exactly the same messages. We got rid of these by
> >>> >> >>> doing
> >>> >> >>> the
> >>> >> >>> following:
> >>> >> >>>
> >>> >> >>> * Set all row & key caches in your CFs to 0 via cassandra-cli
> >>> >> >>> * Kill Cassandra
> >>> >> >>> * Remove all files in the saved_caches directory
> >>> >> >>> * Start Cassandra
> >>> >> >>> * Slowly bring back row & key caches (if desired, we left them
> >>> >> >>> off)
> >>> >> >>>
> >>> >> >>> Cheers,
> >>> >> >>>
> >>> >> >>>        T.
> >>> >> >>>
> >>> >> >>> On 16/08/11 23:35, Yan Chunlu wrote:
> >>> >> >>>>
> >>> >> >>>>  I saw alot slicequeryfilter things if changed the log level to
> >>> >> >>>> DEBUG.
> >>> >> >>>>  just
> >>> >> >>>> thought even bring up a new node will be faster than start the
> >>> >> >>>> old
> >>> >> >>>> one..... it
> >>> >> >>>> is wired
> >>> >> >>>>
> >>> >> >>>> DEBUG [main] 2011-08-16 06:32:49,213 SliceQueryFilter.java
> (line
> >>> >> >>>> 123)
> >>> >> >>>> collecting 0 of 2147483647:
> 76616c7565:false:225@1313068845474382
> >>> >> >>>> DEBUG [main] 2011-08-16 06:32:49,245 SliceQueryFilter.java
> (line
> >>> >> >>>> 123)
> >>> >> >>>> collecting 0 of 2147483647:
> 76616c7565:false:453@1310999270198313
> >>> >> >>>> DEBUG [main] 2011-08-16 06:32:49,251 SliceQueryFilter.java
> (line
> >>> >> >>>> 123)
> >>> >> >>>> collecting 0 of 2147483647:
> 76616c7565:false:26@1313199902088827
> >>> >> >>>> DEBUG [main] 2011-08-16 06:32:49,576 SliceQueryFilter.java
> (line
> >>> >> >>>> 123)
> >>> >> >>>> collecting 0 of 2147483647:
> 76616c7565:false:157@1313097239332314
> >>> >> >>>> DEBUG [main] 2011-08-16 06:32:50,674 SliceQueryFilter.java
> (line
> >>> >> >>>> 123)
> >>> >> >>>> collecting 0 of 2147483647:
> >>> >> >>>> 76616c7565:false:41729@1313190821826229
> >>> >> >>>> DEBUG [main] 2011-08-16 06:32:50,811 SliceQueryFilter.java
> (line
> >>> >> >>>> 123)
> >>> >> >>>> collecting 0 of 2147483647:
> 76616c7565:false:6@1313174157301203
> >>> >> >>>> DEBUG [main] 2011-08-16 06:32:50,867 SliceQueryFilter.java
> (line
> >>> >> >>>> 123)
> >>> >> >>>> collecting 0 of 2147483647:
> 76616c7565:false:98@1312011362250907
> >>> >> >>>> DEBUG [main] 2011-08-16 06:32:50,881 SliceQueryFilter.java
> (line
> >>> >> >>>> 123)
> >>> >> >>>> collecting 0 of 2147483647:
> 76616c7565:false:42@1313201711997005
> >>> >> >>>> DEBUG [main] 2011-08-16 06:32:50,910 SliceQueryFilter.java
> (line
> >>> >> >>>> 123)
> >>> >> >>>> collecting 0 of 2147483647:
> 76616c7565:false:96@1312939986190155
> >>> >> >>>> DEBUG [main] 2011-08-16 06:32:50,954 SliceQueryFilter.java
> (line
> >>> >> >>>> 123)
> >>> >> >>>> collecting 0 of 2147483647:
> 76616c7565:false:621@1313192538616112
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> On Tue, Aug 16, 2011 at 7:32 PM, Yan Chunlu
> >>> >> >>>> <springri...@gmail.com
> >>> >> >>>> <mailto:springri...@gmail.com>> wrote:
> >>> >> >>>>
> >>> >> >>>>    but it seems the row cache is cluster wide, how will  the
> >>> >> >>>> change
> >>> >> >>>> of row
> >>> >> >>>>    cache affect the read speed?
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>    On Mon, Aug 15, 2011 at 7:33 AM, Jonathan Ellis
> >>> >> >>>> <jbel...@gmail.com
> >>> >> >>>>    <mailto:jbel...@gmail.com>> wrote:
> >>> >> >>>>
> >>> >> >>>>        Or leave row cache enabled but disable cache saving (and
> >>> >> >>>> remove the
> >>> >> >>>>        one already on disk).
> >>> >> >>>>
> >>> >> >>>>        On Sun, Aug 14, 2011 at 5:05 PM, aaron morton
> >>> >> >>>> <aa...@thelastpickle.com
> >>> >> >>>>        <mailto:aa...@thelastpickle.com>> wrote:
> >>> >> >>>>         >  INFO [main] 2011-08-14 09:24:52,198
> >>> >> >>>> ColumnFamilyStore.java
> >>> >> >>>> (line 547)
> >>> >> >>>>         > completed loading (1744370 ms; 200000 keys) row cache
> >>> >> >>>> for
> >>> >> >>>> COMMENT
> >>> >> >>>>         >
> >>> >> >>>>         > It's taking 29 minutes to load 200,000 rows in the
>  row
> >>> >> >>>> cache.
> >>> >> >>>> Thats a
> >>> >> >>>>         > pretty big row cache, I would suggest reducing or
> >>> >> >>>> disabling
> >>> >> >>>> it.
> >>> >> >>>>         > Background
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> http://www.datastax.com/dev/blog/maximizing-cache-benefit-with-cassandra
> >>> >> >>>>         >
> >>> >> >>>>         > and server can not afford the load then crashed.
> after
> >>> >> >>>> come
> >>> >> >>>> back,
> >>> >> >>>>        node 3 can
> >>> >> >>>>         > not return for more than 96 hours
> >>> >> >>>>         >
> >>> >> >>>>         > Crashed how ?
> >>> >> >>>>         > You may be seeing
> >>> >> >>>> https://issues.apache.org/jira/browse/CASSANDRA-2280
> >>> >> >>>>         > Watch nodetool compactionstats to see when the Merkle
> >>> >> >>>> tree
> >>> >> >>>> build
> >>> >> >>>>        finishes
> >>> >> >>>>         > and nodetool netstats to see which CF's are
> streaming.
> >>> >> >>>>         > Cheers
> >>> >> >>>>         > -----------------
> >>> >> >>>>         > Aaron Morton
> >>> >> >>>>         > Freelance Cassandra Developer
> >>> >> >>>>         > @aaronmorton
> >>> >> >>>>         > http://www.thelastpickle.com
> >>> >> >>>>         > On 15 Aug 2011, at 04:23, Yan Chunlu wrote:
> >>> >> >>>>         >
> >>> >> >>>>         >
> >>> >> >>>>         > I got 3 nodes and RF=3, when I repairing ndoe3, it
> >>> >> >>>> seems
> >>> >> >>>> alot
> >>> >> >>>> data
> >>> >> >>>>         > generated.  and server can not afford the load then
> >>> >> >>>> crashed.
> >>> >> >>>>         > after come back, node 3 can not return for more than
> 96
> >>> >> >>>> hours
> >>> >> >>>>         >
> >>> >> >>>>         > for 34GB data, the node 2 could restart and back
> online
> >>> >> >>>> within 1
> >>> >> >>>> hour.
> >>> >> >>>>         >
> >>> >> >>>>         > I am not sure what's wrong with node3 and should I
> >>> >> >>>> restart
> >>> >> >>>> node
> >>> >> >>>> 3 again?
> >>> >> >>>>         > thanks!
> >>> >> >>>>         >
> >>> >> >>>>         > Address         Status State   Load            Owns
> >>> >> >>>>  Token
> >>> >> >>>>         >
> >>> >> >>>>         > 113427455640312821154458202477256070484
> >>> >> >>>>         > node1     Up     Normal  34.11 GB        33.33%  0
> >>> >> >>>>         > node2     Up     Normal  31.44 GB        33.33%
> >>> >> >>>>         > 56713727820156410577229101238628035242
> >>> >> >>>>         > node3     Down   Normal  177.55 GB       33.33%
> >>> >> >>>>         > 113427455640312821154458202477256070484
> >>> >> >>>>         >
> >>> >> >>>>         >
> >>> >> >>>>         > the log shows it is still going on, not sure why it
> is
> >>> >> >>>> so
> >>> >> >>>> slow:
> >>> >> >>>>         >
> >>> >> >>>>         >
> >>> >> >>>>         >  INFO [main] 2011-08-14 08:55:47,734
> SSTableReader.java
> >>> >> >>>> (line
> >>> >> >>>> 154)
> >>> >> >>>>        Opening
> >>> >> >>>>         > /cassandra/data/COMMENT
> >>> >> >>>>         >  INFO [main] 2011-08-14 08:55:47,828
> >>> >> >>>> ColumnFamilyStore.java
> >>> >> >>>> (line 275)
> >>> >> >>>>         > reading saved cache
> >>> >> >>>> /cassandra/saved_caches/COMMENT-RowCache
> >>> >> >>>>         >  INFO [main] 2011-08-14 09:24:52,198
> >>> >> >>>> ColumnFamilyStore.java
> >>> >> >>>> (line 547)
> >>> >> >>>>         > completed loading (1744370 ms; 200000 keys) row cache
> >>> >> >>>> for
> >>> >> >>>> COMMENT
> >>> >> >>>>         >  INFO [main] 2011-08-14 09:24:52,299
> >>> >> >>>> ColumnFamilyStore.java
> >>> >> >>>> (line 275)
> >>> >> >>>>         > reading saved cache
> >>> >> >>>> /cassandra/saved_caches/COMMENT-RowCache
> >>> >> >>>>         >  INFO [CompactionExecutor:1] 2011-08-14 10:24:55,480
> >>> >> >>>>        CacheWriter.java (line
> >>> >> >>>>         > 96) Saved COMMENT-RowCache (200000 items) in 2535 ms
> >>> >> >>>>         >
> >>> >> >>>>         >
> >>> >> >>>>         >
> >>> >> >>>>         >
> >>> >> >>>>         >
> >>> >> >>>>         >
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>        --
> >>> >> >>>>        Jonathan Ellis
> >>> >> >>>>        Project Chair, Apache Cassandra
> >>> >> >>>>        co-founder of DataStax, the source for professional
> >>> >> >>>> Cassandra
> >>> >> >>>> support
> >>> >> >>>>        http://www.datastax.com
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>
> >>> >> >>>
> >>> >> >
> >>> >> >
> >>> >
> >>> >
> >>>
> >>
> >>
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: node restart taking too long

Reply via email to