This has happened before the save caches files were not compatible between 0.6 and 0.7. I have ran into this a couple other times before. The good news is the save key cache is just an optimization, you can blow it away and it is not usually a big deal.
On Fri, Apr 5, 2013 at 2:55 PM, Arya Goudarzi <gouda...@gmail.com> wrote: > Here is a chunk of bloom filter sstable skip messages from the node I > enabled DEBUG on: > > DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line > 737) Bloom filter allows skipping sstable 39459 > DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line > 737) Bloom filter allows skipping sstable 39483 > DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line > 737) Bloom filter allows skipping sstable 39332 > DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line > 737) Bloom filter allows skipping sstable 39335 > DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line > 737) Bloom filter allows skipping sstable 39438 > DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line > 737) Bloom filter allows skipping sstable 39478 > DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line > 737) Bloom filter allows skipping sstable 39456 > DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line > 737) Bloom filter allows skipping sstable 39469 > DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,451 SSTableReader.java (line > 737) Bloom filter allows skipping sstable 39334 > DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,451 SSTableReader.java (line > 737) Bloom filter allows skipping sstable 39406 > > This is the last chunk of log before C* gets stuck, right before I stop > the process, remove key caches and start again (This is from another node > that I upgraded 2 days ago): > > INFO [SSTableBatchOpen:2] 2013-04-03 01:59:39,769 SSTableReader.java (line > 166) Opening > /var/lib/cassandra/data/cardspring_production/UniqueIndexes/cardspring_production-UniqueIndexes-hf-316499 > (5273270 bytes) > INFO [SSTableBatchOpen:1] 2013-04-03 01:59:39,858 SSTableReader.java > (line 166) Opening > /var/lib/cassandra/data/cardspring_production/UniqueIndexes/cardspring_production-UniqueIndexes-hf-314755 > (5264359 bytes) > INFO [SSTableBatchOpen:2] 2013-04-03 01:59:39,894 SSTableReader.java > (line 166) Opening > /var/lib/cassandra/data/cardspring_production/UniqueIndexes/cardspring_production-UniqueIndexes-hf-314762 > (5260887 bytes) > INFO [SSTableBatchOpen:1] 2013-04-03 01:59:39,980 SSTableReader.java > (line 166) Opening > /var/lib/cassandra/data/cardspring_production/UniqueIndexes/cardspring_production-UniqueIndexes-hf-315886 > (5262864 bytes) > INFO [OptionalTasks:1] 2013-04-03 01:59:40,298 AutoSavingCache.java (line > 112) reading saved cache > /var/lib/cassandra/saved_caches/cardspring_production-UniqueIndexes-KeyCache > > > I finally upgrade all 12 nodes in our test environment yesterday. This > issue seemed to exists on 7 nodes out of 12 nodes. They didn't alway get > stuck on the same CF loading its saved KeyCache. > > > On Fri, Apr 5, 2013 at 9:56 AM, aaron morton <aa...@thelastpickle.com>wrote: > >> skipping sstable due to bloom filter debug messages >> >> What were these messages? >> >> Do you have the logs from the start up ? >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Consultant >> New Zealand >> >> @aaronmorton >> http://www.thelastpickle.com >> >> On 4/04/2013, at 6:11 AM, Arya Goudarzi <gouda...@gmail.com> wrote: >> >> Hi, >> >> I have upgraded 2 nodes out of a 12 mode test cluster from 1.1.10 to >> 1.2.3. During startup while tailing C*'s system.log, I observed a series of >> SSTable batch load messages and skipping sstable due to bloom filter debug >> messages which is normal for startup, but when it reached loading saved key >> caches, it gets stuck forever. The I/O wait stays high in the CPU graph and >> I/O ops are sent to disk, but C* never passes that step of loading the key >> cache file successfully. The saved key cache file was about 75MB on one >> node and 125MB on the other node and they were for different CFs. >> >> <image.jpeg> >> >> The CPU I/O wait constantly stayed at 40%~ while system.log was stuck at >> loading one saved key cache file. I have marked that on the graph above. >> The workaround was to delete the saved cache files and things loaded fine >> (See marked Normal Startup). >> >> These machines are m1.xlarge EC2 instances. And this issue happened on >> both nodes upgraded. This did not happen during exercise of upgrade from >> 1.1.6 to 1.2.2 using the same snapshot. >> >> Should I raise a JIRA? >> >> -Arya >> >> >> >