Thanks to driftx from cassandra IRC channel for helping out. This was resolved by increasing the rpc timeout for the bootstrap process.
On Wed, Sep 15, 2010 at 11:43 AM, Gurpreet Singh <gurpreet.si...@gmail.com>wrote: > This problem still stays unresolved despite numerous restarts to the > cluster. I cant seem to find a way out of this one, and I am not really > looking for a workaround, kinda need this to work if i need to go to > production. > > Turned on the ALL logging in log4j, and now I see the following exception > (EOFException) on the destination. After receiving each file, it seems to be > throwing this exception. The transfer is successful except for this > exception. The source successful declares the transfer complete. But the > destination does not move out of the bootstrapping mode, and just sits > there. > > DEBUG [Thread-15] 2010-09-15 10:56:59,767 IncomingStreamReader.java (line > 65) Receiving stream: finished reading chunk, awaiting more > DEBUG [Thread-15] 2010-09-15 10:56:59,767 IncomingStreamReader.java (line > 87) Removing stream context > /data/cassandra/datadir/cassandradb/userdata/user_list_items-tmp-1-Index.db:522051369 > DEBUG [Thread-15] 2010-09-15 10:56:59,767 StreamCompletionHandler.java > (line 73) Sending a streaming finished message with > org.apache.cassandra.streaming.completedfilesta...@54828e7 to IP1 > TRACE [Thread-15] 2010-09-15 10:56:59,769 IncomingTcpConnection.java (line > 82) eof reading from socket; closing > java.io.EOFException > at java.io.DataInputStream.readInt(Unknown Source) > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:59) > DEBUG [Thread-16] 2010-09-15 10:56:59,812 IncomingStreamReader.java (line > 51) Receiving stream > DEBUG [Thread-16] 2010-09-15 10:56:59,812 IncomingStreamReader.java (line > 54) Creating file for > /data/cassandra/datadir/cassandradb/userdata/user_list_items-tmp-1-Filter.db > DEBUG [Thread-16] 2010-09-15 10:56:59,876 IncomingStreamReader.java (line > 65) Receiving stream: finished reading chunk, awaiting more > DEBUG [Thread-16] 2010-09-15 10:56:59,876 IncomingStreamReader.java (line > 87) Removing stream context > /data/cassandra/datadir/cassandradb/userdata/user_list_items-tmp-1-Filter.db:7489045 > DEBUG [Thread-16] 2010-09-15 10:56:59,876 StreamCompletionHandler.java > (line 73) Sending a streaming finished message with > org.apache.cassandra.streaming.completedfilesta...@7b41a32f to IP1 > TRACE [Thread-16] 2010-09-15 10:56:59,876 IncomingTcpConnection.java (line > 82) eof reading from socket; closing > java.io.EOFException > at java.io.DataInputStream.readInt(Unknown Source) > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:59) > > /G > > On Tue, Sep 14, 2010 at 11:40 AM, Gurpreet Singh <gurpreet.si...@gmail.com > > wrote: > >> Hi Vineet, >> I have tracked the nodetool streams to completion each time. Below are the >> logs on the source and destination node. There are 3 sstables being >> transferred, and the transfer seems to be successful. However, after the >> streams finish, the source prints out messages about the dropped messages, >> which may point to the problem. ideas? I checked port 7000 is open for >> communication. 9160 is not up on the node being bootstrapped, but that comes >> up after the node is bootstrapped, is that right? >> >> Thanks a ton, >> /G >> >> *Logs on the source node (IP2):* >> * >> * >> INFO [STREAM-STAGE:1] 2010-09-14 09:54:07,900 StreamOut.java (line 79) >> Flushing memtables for userdata... >> INFO [STREAM-STAGE:1] 2010-09-14 09:54:07,900 StreamOut.java (line 95) >> Performing anticompaction ... >> INFO [COMPACTION-POOL:1] 2010-09-14 09:54:07,900 CompactionManager.java >> (line 339) AntiCompacting >> [org.apache.cassandra.io.SSTableReader(path='/data/cassandra/datadir/cassandradb/userdata/user_list_items-5823-Data.db')] >> INFO [GC inspection] 2010-09-14 09:56:54,712 GCInspector.java (line 129) >> GC for ParNew: 212 ms, 29033016 reclaimed leaving 579419360 used; max is >> 4415946752 >> INFO [COMPACTION-POOL:1] 2010-09-14 10:18:06,508 CompactionManager.java >> (line 396) AntiCompacted to >> /data/cassandra/datadir/cassandradb/userdata/stream/user_list_items-5825-Data.db. >> 49074138589/36770836242 bytes for 5990912 keys. Time: 1438607ms. >> INFO [COMPACTION-POOL:1] 2010-09-14 10:18:06,528 CompactionManager.java >> (line 339) AntiCompacting >> [org.apache.cassandra.io.SSTableReader(path='/data/cassandra/datadir/cassandradb/userdata/user-22-Data.db')] >> INFO [COMPACTION-POOL:1] 2010-09-14 10:18:08,839 CompactionManager.java >> (line 396) AntiCompacted to >> /data/mysql/cassandrastorage/userdata/stream/user-24-Data.db. >> 28185244/21126422 bytes for 47722 keys. Time: 2310ms. >> INFO [COMPACTION-POOL:1] 2010-09-14 10:18:08,840 CompactionManager.java >> (line 339) AntiCompacting >> [org.apache.cassandra.io.SSTableReader(path='/data/cassandra/datadir/cassandradb/userdata/user_lists-502-Data.db')] >> INFO [COMPACTION-POOL:1] 2010-09-14 10:21:08,606 CompactionManager.java >> (line 396) AntiCompacted to >> /data/mysql/cassandrastorage/userdata/stream/user_lists-504-Data.db. >> 2927724285/2195768325 bytes for 3976118 keys. Time: 179766ms. >> INFO [STREAM-STAGE:1] 2010-09-14 10:21:08,607 StreamOut.java (line 127) >> Stream context metadata >> /data/cassandra/datadir/cassandradb/userdata/stream/user_list_items-5825-Index.db:522051369, >> 3 >> sstables./data/cassandra/datadir/cassandradb/userdata/stream/user_list_items-5825-Filter.db:7489045, >> 3 >> sstables./data/cassandra/datadir/cassandradb/userdata/stream/user_list_items-5825-Data.db:36770836242, >> 3 >> sstables./data/mysql/cassandrastorage/userdata/stream/user-24-Index.db:3373143, >> 3 >> sstables./data/mysql/cassandrastorage/userdata/stream/user-24-Filter.db:59965, >> 3 >> sstables./data/mysql/cassandrastorage/userdata/stream/user-24-Data.db:21126422, >> 3 >> sstables./data/mysql/cassandrastorage/userdata/stream/user_lists-504-Index.db:282956452, >> 3 >> sstables./data/mysql/cassandrastorage/userdata/stream/user_lists-504-Filter.db:4970125, >> 3 >> sstables./data/mysql/cassandrastorage/userdata/stream/user_lists-504-Data.db:2195768325 >> INFO [STREAM-STAGE:1] 2010-09-14 10:21:08,608 StreamOut.java (line 132) >> Sending a stream initiate message to IP1... >> INFO [STREAM-STAGE:1] 2010-09-14 10:21:08,608 StreamOut.java (line 137) >> Waiting for transfer to IP1 to complete >> *WARN [DroppedMessagesLogger] 2010-09-14 10:28:00,592 >> MessagingService.java (line 501) Dropped 9 messages in the last 1000ms* >> * *INFO [STREAM-STAGE:1] 2010-09-14 10:28:00,605 StreamOut.java (line >> 141) Done with transfer to IP1 >> INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 10:28:00,670 >> SSTableDeletingReference.java (line 104) Deleted >> /data/cassandra/datadir/cassandradb/system/LocationInfo-17-Data.db >> *WARN [DroppedMessagesLogger] 2010-09-14 10:28:01,602 >> MessagingService.java (line 501) Dropped 1 messages in the last 1000ms* >> * *INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 10:28:06,133 >> SSTableDeletingReference.java (line 104) Deleted >> /data/mysql/cassandrastorage/system/LocationInfo-19-Data.db >> INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 10:28:06,134 >> SSTableDeletingReference.java (line 104) Deleted >> /data/cassandra/datadir/cassandradb/system/LocationInfo-18-Data.db >> INFO [SSTABLE-CLEANUP-TIMER] 2010-09-14 10:28:06,134 >> SSTableDeletingReference.java (line 104) Deleted >> /data/cassandra/datadir/cassandradb/system/LocationInfo-20-Data.db >> >> >> >> *Logs on new node being bootstrapped (IP1):* >> >> INFO [main] 2010-09-14 09:53:37,788 BootStrapper.java (line 104) New token >> will be 149298847080838649048722691811739653739 to assume load from IP2 >> INFO [main] 2010-09-14 09:53:37,789 StorageService.java (line 388) >> Joining: sleeping 30000 ms for pending range setup >> INFO [main] 2010-09-14 09:54:07,792 StorageService.java (line 388) >> Bootstrapping >> INFO [Thread-17] 2010-09-14 10:26:40,699 SSTableReader.java (line 120) >> Sampling index for >> /data/mysql/cassandrastorage/userdata/user_list_items-2-Data.db >> INFO [Thread-17] 2010-09-14 10:26:51,453 StreamCompletionHandler.java >> (line 64) Streaming added >> /data/mysql/cassandrastorage/userdata/user_list_items-2-Data.db >> INFO [Thread-23] 2010-09-14 10:26:52,219 SSTableReader.java (line 120) >> Sampling index for >> /data/cassandra/datadir/cassandradb/userdata/user-2-Data.db >> INFO [Thread-23] 2010-09-14 10:26:52,311 StreamCompletionHandler.java >> (line 64) Streaming added >> /data/cassandra/datadir/cassandradb/userdata/user-2-Data.db >> INFO [Timer-0] 2010-09-14 10:27:13,676 Gossiper.java (line 180) >> InetAddress IP2 is now dead. >> INFO [Thread-26] 2010-09-14 10:27:18,858 SSTableReader.java (line 120) >> Sampling index for >> /data/cassandra/datadir/cassandradb/userdata/user_lists-2-Data.db >> INFO [Thread-26] 2010-09-14 10:27:24,645 StreamCompletionHandler.java >> (line 64) Streaming added >> /data/cassandra/datadir/cassandradb/userdata/user_lists-2-Data.db >> INFO [HINTED-HANDOFF-POOL:1] 2010-09-14 10:28:00,605 >> HintedHandOffManager.java (line 165) Started hinted handoff for endPoint IP2 >> INFO [GMFD:1] 2010-09-14 10:28:00,605 Gossiper.java (line 578) >> InetAddress IP2 is now UP >> INFO [HINTED-HANDOFF-POOL:1] 2010-09-14 10:28:00,606 >> HintedHandOffManager.java (line 222) Finished hinted handoff of 0 rows to >> endpoint IP2 >> INFO [FLUSH-TIMER] 2010-09-14 10:57:52,381 ColumnFamilyStore.java (line >> 357) LocationInfo has reached its threshold; switching in a fresh Memtable >> at >> CommitLogContext(file='/data/cassandra/datadir/commitlog/CommitLog-1284483127473.log', >> position=1353) >> INFO [FLUSH-TIMER] 2010-09-14 10:57:52,382 ColumnFamilyStore.java (line >> 609) Enqueuing flush of memtable-locationi...@2057079871(210 bytes, 6 >> operations) >> INFO [FLUSH-WRITER-POOL:1] 2010-09-14 10:57:52,382 Memtable.java (line >> 148) Writing memtable-locationi...@2057079871(210 bytes, 6 operations) >> INFO [FLUSH-WRITER-POOL:1] 2010-09-14 10:57:52,464 Memtable.java (line >> 162) Completed flushing >> /data/cassandra/datadir/cassandradb/system/LocationInfo-10-Data.db >> >> >> On Tue, Sep 14, 2010 at 10:52 AM, vineet daniel >> <vineetdan...@gmail.com>wrote: >> >>> Hi Gurpreet >>> >>> What is the output of nodetool -h <hostname/IP> streams -->( to see >>> what is going on between the nodes) . If you dont see anything happening >>> try switching off firewall or iptables. >>> >>> >>> Regards >>> Vineet Daniel >>> Cell : +918106217121 >>> Websites : >>> Blog <http://vinetedaniel.blogspot.com> | >>> Linkedin<http://in.linkedin.com/in/vineetdaniel> >>> | Twitter <https://twitter.com/vineetdaniel> >>> >>> >>> >>> >>> >>> On Tue, Sep 14, 2010 at 11:11 PM, Gurpreet Singh < >>> gurpreet.si...@gmail.com> wrote: >>> >>>> I tried this again, it happenned yet again. >>>> This time while the transfer messages seemed tobe in order, i also >>>> noticed that the source logs talk about having 9 dropped messages in the >>>> last 1000 ms. The only activity on the whole cluster is this bootstrapping, >>>> there is no read/write traffic going on. >>>> >>>> /G >>>> >>>> On Tue, Sep 14, 2010 at 10:05 AM, Gurpreet Singh < >>>> gurpreet.si...@gmail.com> wrote: >>>> >>>>> I am using cassandra 0.6.5. >>>>> >>>>> >>>>> On Tue, Sep 14, 2010 at 9:16 AM, Gurpreet Singh < >>>>> gurpreet.si...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> I have a cassandra cluster of 4 machines, and I am trying to bootstrap >>>>>> 2 more machines, one at a time. >>>>>> For both these machines, the bootstrapping stays stuck after the >>>>>> streaming is done. >>>>>> >>>>>> When the nodes come up for bootstrapping, I see all the relevant >>>>>> messages about getting a new token, assuming load from a particular >>>>>> host. I >>>>>> see a couple of nodes anticompacting data to send, and at a later point >>>>>> the >>>>>> node that is bootstrapping prints the right streaming mesgs. However, >>>>>> once >>>>>> the streaming is over, the node just doesnt do anything. Previously while >>>>>> bootstrapping, I have seen that after the streaming is done, the node >>>>>> restarts and becomes part of the ring by itself. I dont see this >>>>>> behaviour >>>>>> with both the nodes I tried today. >>>>>> I even restarted all the nodes in the cluster, and tried bootstrapping >>>>>> one of the nodes again, but it again was stuck after streaming. It seems >>>>>> to >>>>>> have copied the relevant load as well. >>>>>> Any ideas as to what could be going on here? >>>>>> >>>>>> /G >>>>>> >>>>> >>>>> >>>> >>> >> >