Hello Spark-users,
  I modified the org.apache.spark.streaming.examples.JavaNetworkWordCount that 
uses a Netcat server to instead read data from a SocketServer implementation. 
The SocketServer Java program accepts connections on port 9999; simulates 1 
million records and streams it to the socket client (in this case, the Spark 
Streaming driver program). The records simulated are String of format: 
[msisdn|ip_addr|start_time|end_time]

The Spark Streaming driver program connects to this SocketServer program on 
port 9999 and uses a batch interval of 3s to process this
data. I am able to run this program locally (i.e. using local[2]). The Spark 
Streaming program is able to establish a socket
connection with the SocketServer Java program (running on the same node i.e. 
localhost), all 1 million records are received via the
socket stream and processed in batch of 3s.

However, I am unable to get this to work when submitted to a Spark Cluster of 2 
nodes. The Spark job is submitted to the cluster successfully, the driver 
reports that socket connection is established with the SocketServer. The 
program goes inside the Spark streaming loop and keeps looking for data in the 
stream; however, no data is received on the stream. On the SocketServer, no 
client connection seems to have been established and it keeps blocking on the 
accept() call. Therefore, it doesn't get to simulate the 1 million records at 
all.

Spark Streaming driver program code snippet:

JavaDStream<String> lines = ssc.socketTextStream(config.hostname, 
Integer.parseInt(config.port));
System.out.println("Socket connection to read raw data established with server: 
" + config.hostname + ":" + config.port);

The above s.o.p is printed.

SocketServer Java program code snippet:

while (true) {
    Socket clientSock = serverSocket.accept();
    System.out.println("Connection accepted from Spark Streaming job: " + 
clientSock);

This s.o.p. is never printed when running the Spark Job on a cluster. The same 
works if cluster is: local[2]. I am stuck on how to debug this; any advice on 
why socketTextStream() doesn't work when using cluster?

Regards,
Vikram

Reply via email to