Streaming data to Cassandra with Hadoop

Elad Efrat Tue, 07 Jul 2015 07:36:07 -0700

Hello,

I'm loading data from HDFS Cassandra using Spotify's hdfs2cass. The
setup is a 4-node cluster running Cassandra 2.1.6, RF=2, STCS, raw
data size is about 1tb before loading and 3.8tb after. The process
works fine, but I do have a few questions.


1. Some Hadoop jobs fail due to streaming timeouts. That's fine,
because subsequent attempts succeed, but why do I get the timeouts in
the first place? Would this be something network-related or does
Cassandra have a limit on how much streaming it can handle?

2. The server logs show errors like the one quoted below, for
"malformed input around byte N" --

    ERROR [STREAM-IN-/10.84.30.209] 2015-07-06 11:30:10,915
StreamSession.java:499 - [Stream
#e1e4f470-23fb-11e5-9c95-9b249a189cad] Streaming error occurred
    java.io.UTFDataFormatException: malformed input around byte 10
    at java.io.DataInputStream.readUTF(DataInputStream.java:656) ~[na:1.7.0_67]
    at java.io.DataInputStream.readUTF(DataInputStream.java:564) ~[na:1.7.0_67]
    at 
org.apache.cassandra.streaming.messages.FileMessageHeader$FileMessageHeaderSerializer.deserialize(FileMessageHeader.java:143)
~[apache-cassandra-2.1.6.jar:2.1.6]
    at 
org.apache.cassandra.streaming.messages.FileMessageHeader$FileMessageHeaderSerializer.deserialize(FileMessageHeader.java:120)
~[apache-cassandra-2.1.6.jar:2.1.6]
    at 
org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:42)
~[apache-cassandra-2.1.6.jar:2.1.6]
    at 
org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:38)
~[apache-cassandra-2.1.6.jar:2.1.6]
    at 
org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55)
~[apache-cassandra-2.1.6.jar:2.1.6]
    at 
org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:250)
~[apache-cassandra-2.1.6.jar:2.1.6]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]

Is this a familiar issue? I'd expect the data to be the same across
all streaming attempts. The timeouts I can theorize about, any
thoughts on what might be causing these though? Is it normal?

3. About compaction. There's a RESTful service in front of Cassandra
and I see the average response time is positively correlated with the
number of compactions pending (it drops as they drop). Is there a way
to stream such that the number of compactions once the streaming is
done is minimal?

4. Also about compaction: I understand that while STCS is
write-optimized and reduces the number of SSTables, LCS is
read-optimized and might increase it. The aforementioned service needs
read-only access to Cassandra. Loading with LCS resulted in an order
of magnitude more compactions and dramatically higher server load.
Given I want minimal response time ASAP, what approach should I be
taking? Right now I load with STCS, wait for compactions to finish,
and I consider a switch to LCS once it's done. Does it make sense? Any
thoughts on improving this process? (Ideally - is there anything close
to a one-shot process where compaction is barely required?)

I'll gladly provide additional information if needed. I'll also be
happy to hear about others' experience in similar scenarios.

Thanks,

Elad

Streaming data to Cassandra with Hadoop

Reply via email to