I can't seem to be able to recover a failed node on a database where i
did many updates to the schema.
I have a small cluster with 2 nodes, around 1000 CF (I know it's a lot,
but it can't be changed right now), and ReplicationFactor=2.
I shut down a node and cleaned its data entirely, then tried to bring it
back up. The node starts fetching schema updates from the live node, but
the operation fails halfway with an OOME.
After some investigation, what I found is that:
- I have a lot of schema updates (there are 2067 rows in the
system.Schema CF).
- The live node loads migrations 1-1000, and sends them to the
recovering node (Migration.getLocalMigrations())
- Soon afterwards, the live node checks the schema version on the
recovering node and finds it has moved by a little - say it has applied
the first 3 migrations. It then loads migrations 3-1003, and sends them
to the node.
- This process is repeated very quickly (sends migrations 6-1006,
9-1009, etc).
Analyzing the memory dump and the logs, it looks like each of these 1000
migration blocks are composed in a single message and sent to the
OutboundTcpConnection queue. However, since the schema is big, the
messages occupy a lot of space, and are built faster than the connection
can send them. Therefore, they accumulate in
OutboundTcpConnection.queue, until memory is completely filled.
Any suggestions? Can I change something to make this work, apart from
reducing the number of CFs?
Flavio