Hello, I'm loading data from HDFS Cassandra using Spotify's hdfs2cass. The setup is a 4-node cluster running Cassandra 2.1.6, RF=2, STCS, raw data size is about 1tb before loading and 3.8tb after. The process works fine, but I do have a few questions.
1. Some Hadoop jobs fail due to streaming timeouts. That's fine, because subsequent attempts succeed, but why do I get the timeouts in the first place? Would this be something network-related or does Cassandra have a limit on how much streaming it can handle? 2. The server logs show errors like the one quoted below, for "malformed input around byte N" -- ERROR [STREAM-IN-/10.84.30.209] 2015-07-06 11:30:10,915 StreamSession.java:499 - [Stream #e1e4f470-23fb-11e5-9c95-9b249a189cad] Streaming error occurred java.io.UTFDataFormatException: malformed input around byte 10 at java.io.DataInputStream.readUTF(DataInputStream.java:656) ~[na:1.7.0_67] at java.io.DataInputStream.readUTF(DataInputStream.java:564) ~[na:1.7.0_67] at org.apache.cassandra.streaming.messages.FileMessageHeader$FileMessageHeaderSerializer.deserialize(FileMessageHeader.java:143) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.streaming.messages.FileMessageHeader$FileMessageHeaderSerializer.deserialize(FileMessageHeader.java:120) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:42) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:38) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:250) ~[apache-cassandra-2.1.6.jar:2.1.6] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67] Is this a familiar issue? I'd expect the data to be the same across all streaming attempts. The timeouts I can theorize about, any thoughts on what might be causing these though? Is it normal? 3. About compaction. There's a RESTful service in front of Cassandra and I see the average response time is positively correlated with the number of compactions pending (it drops as they drop). Is there a way to stream such that the number of compactions once the streaming is done is minimal? 4. Also about compaction: I understand that while STCS is write-optimized and reduces the number of SSTables, LCS is read-optimized and might increase it. The aforementioned service needs read-only access to Cassandra. Loading with LCS resulted in an order of magnitude more compactions and dramatically higher server load. Given I want minimal response time ASAP, what approach should I be taking? Right now I load with STCS, wait for compactions to finish, and I consider a switch to LCS once it's done. Does it make sense? Any thoughts on improving this process? (Ideally - is there anything close to a one-shot process where compaction is barely required?) I'll gladly provide additional information if needed. I'll also be happy to hear about others' experience in similar scenarios. Thanks, Elad