Re: spark streaming doesn't pick new files from HDFS

srimugunthan dhandapani Wed, 09 Mar 2016 21:55:02 -0800

I doubt if thats the problem. Thats how hdfs lists a directory.
 Output of few more commands below.


*$ hadoop fs -ls /tmp/*
Found 7 items
drwxrwxrwx   - hdfs     supergroup          0 2016-03-10 11:09
/tmp/.cloudera_health_monitoring_canary_files
-rw-r--r--   3 ndsuser1 supergroup  447873024 2016-03-09 15:28
/tmp/cpcode_r.csv._COPYING_
-rw-r--r--   3 ndsuser1 supergroup         40 2016-03-09 15:58 /tmp/file.txt
drwxr-xr-x   - yarn     supergroup          0 2016-03-09 17:13
/tmp/hadoop-yarn
drwx-wx-wx   - hive     supergroup          0 2016-03-02 00:11 /tmp/hive
drwxrwxrwt   - mapred   hadoop              0 2016-03-04 14:38 /tmp/logs
*drwxr-xr-x   - tomcat7  supergroup          0 2016-03-10 11:10 /tmp/swg*


*$ hadoop fs -copyFromLocal file.txt /tmp/swg/*

*$ hadoop fs -ls /tmp/*
Found 7 items
drwxrwxrwx   - hdfs     supergroup          0 2016-03-10 11:10
/tmp/.cloudera_health_monitoring_canary_files
-rw-r--r--   3 ndsuser1 supergroup  447873024 2016-03-09 15:28
/tmp/cpcode_r.csv._COPYING_
-rw-r--r--   3 ndsuser1 supergroup         40 2016-03-09 15:58 /tmp/file.txt
drwxr-xr-x   - yarn     supergroup          0 2016-03-09 17:13
/tmp/hadoop-yarn
drwx-wx-wx   - hive     supergroup          0 2016-03-02 00:11 /tmp/hive
drwxrwxrwt   - mapred   hadoop              0 2016-03-04 14:38 /tmp/logs
*drwxr-xr-x   - tomcat7  supergroup          0 2016-03-10 11:10 /tmp/swg*

*$ hadoop fs -ls /tmp/swg/*
Found 1 items
*-rw-r--r--   3 ndsuser1 supergroup         40 2016-03-10 11:10
/tmp/swg/file.txt*


I found a similar problem in stackoverflow:
http://stackoverflow.com/questions/33384299/spark-streaming-does-not-read-files-moved-from-hdfs-to-hdfs


Wondering if i am doing anything wrong.

-mugunthan




On Thu, Mar 10, 2016 at 4:29 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. drwxr-xr-x   - tomcat7  supergroup          0 2016-03-09 23:16
>  /tmp/swg
>
> If I read the above line correctly, the size of the file was 0.
>
> On Wed, Mar 9, 2016 at 10:00 AM, srimugunthan dhandapani <
> srimugunthan.dhandap...@gmail.com> wrote:
>
>> Hi all
>> I am working in cloudera CDH5.6 and  version of spark is 1.5.0-cdh5.6.0
>>
>> I have a strange problem that spark streaming works on a directory in
>> local filesystem but doesnt work for hdfs.
>>
>> My spark streaming program:
>>
>>
>> package com.oreilly.learningsparkexamples.java;
>>
>> import java.util.concurrent.atomic.AtomicLong;
>> import org.apache.spark.api.java.JavaRDD;
>> import org.apache.spark.api.java.JavaSparkContext;
>> import org.apache.spark.api.java.function.Function;
>> import org.apache.spark.streaming.api.java.JavaDStream;
>> import org.apache.spark.streaming.api.java.JavaStreamingContext;
>> import org.apache.spark.streaming.Duration;
>>
>> public class StreamingLogInput {
>>
>>
>>
>>     public static void main(String[] args) throws Exception {
>>         String master = args[0];
>>         String target = args[1];
>>         JavaSparkContext sc = new JavaSparkContext(master,
>> "StreamingLogInput");
>>         // Create a StreamingContext with a 1 second batch size
>>         JavaStreamingContext jssc = new JavaStreamingContext(sc, new
>> Duration(1000));
>>
>>         if (target.equals("localfs")) {
>>             JavaDStream<String> lines =
>> jssc.textFileStream("file:///tmp/swg/");
>>             lines.print();
>>
>>             lines.foreachRDD(new Function<JavaRDD<String>, Void>() {
>>                 @Override
>>                 public Void call(JavaRDD<String> rdd) throws Exception {
>>
>>                     if (rdd.count() > 0) {
>>                         //
>> errorLines.dstream().saveAsTextFiles("errlines", "txt");
>>
>> rdd.saveAsTextFile("hdfs://hadoop-host-1:8020/tmp/inputswg/file-" +
>> System.nanoTime()+".out");
>>
>>
>>                     }
>>
>>                     return null;
>>                 }
>>             });
>>         } else if (target.equals("hdfs")) {
>>             JavaDStream<String> lines =
>> jssc.textFileStream("hdfs://hadoop-host-1:8020/tmp/swg");
>>            //  JavaDStream<String> lines =
>> jssc.textFileStream("tempstream");
>>             lines.print();
>>
>>             lines.foreachRDD(new Function<JavaRDD<String>, Void>() {
>>                 @Override
>>                 public Void call(JavaRDD<String> rdd) throws Exception {
>>                     if (rdd.count() > 0) {
>>                         //
>> errorLines.dstream().saveAsTextFiles("errlines", "txt");
>>
>> rdd.saveAsTextFile("hdfs://hadoop-host-1:8020/tmp/inputswg/file-" +
>> System.nanoTime()+".out");
>>                     }
>>
>>                     return null;
>>                 }
>>             });
>>         }
>>         // Filter our DStream for lines with "error"
>>
>>         // Print out the lines with errors, which causes this DStream to
>> be evaluated
>>         // start our streaming context and wait for it to "finish"
>>         jssc.start();
>>         // Wait for 10 seconds then exit. To run forever call without a
>> timeout
>>         jssc.awaitTermination(5 * 60 * 1000);
>>         // Stop the streaming context
>>         jssc.stop();
>>     }
>> }
>>
>> Testing the above program for local filesystem by the following command
>> spark-submit  --class
>> com.oreilly.learningsparkexamples.java.StreamingLogInput
>> target/StreamingLogInput-1.0-SNAPSHOT.jar local[4] localfs
>> creates the new files in HDFS (hdfs://hadoop-host-1:8020/tmp/inputswg)
>>
>> but when tried with hdfs
>> spark-submit  --class
>> com.oreilly.learningsparkexamples.java.StreamingLogInput
>> target/StreamingLogInput-1.0-SNAPSHOT.jar local[4] hdfs
>> it doesnt seem to detect the new file.
>>
>> I am creating the new file by manual copy.
>> cp file.txt /tmp/swg (for localfs)
>> hadoop fs -copyFromLocal file.txt /tmp/swg (for hdfs)
>>
>> I checked the permissions for /tmp/swg and it has got read permissions.
>> drwxr-xr-x   - tomcat7  supergroup          0 2016-03-09 23:16 /tmp/swg
>>
>> whats the problem?
>>
>>
>>
>

Re: spark streaming doesn't pick new files from HDFS

Reply via email to