Hi
I just started a new spark streaming project. In this phase of the system
all we want to do is save the data we received to hdfs. I after running for
a couple of days it looks like I am missing a lot of data. I wonder if
saveAsTextFile("hdfs:///rawSteamingData²); is overwriting the data I capture
in previous window? I noticed that after running for a couple of days my
hdfs file system has 25 file. The names are something like ³part-00006². I
used 'hadoop fs dus¹ to check the total data captured. While the system was
running I would periodically call dus¹ I was surprised sometimes the
numbers of total bytes actually dropped.
Is there a better way to save write my data to disk?
Any suggestions would be appreciated
Andy
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName(appName);
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaStreamingContext ssc = new JavaStreamingContext(jsc, new
Duration(5 * 1000));
[ deleted code ]
data.foreachRDD(new Function<JavaRDD<String>, Void>(){
private static final long serialVersionUID =
-7957854392903581284L;
@Override
public Void call(JavaRDD<String> jsonStr) throws Exception {
jsonStr.saveAsTextFile("hdfs:///rawSteamingData²); //
/rawSteamingData is a directory
return null;
}
});
ssc.checkpoint(checkPointUri);
ssc.start();
ssc.awaitTermination();
}