> On Feb 19, 2015, at 7:29 PM, Pavel Velikhov <pavel.velik...@icloud.com> wrote:
> 
> I have a simple Spark job that goes out to Cassandra, runs a pipe and stores 
> results:
> 
> val sc = new SparkContext(conf)
> val rdd = sc.cassandraTable(“keyspace", “table")
>       .map(r => r.getInt(“column") + "\t" + 
> write(get_lemmas(r.getString("tags"))))
>       .pipe("python3 /tmp/scripts_and_models/scripts/run.py")
>       .map(r => convertStr(r) )
>       .coalesce(1,true)
>       .saveAsTextFile("/tmp/pavel/CassandraPipeTest.txt")
>       //.saveToCassandra(“keyspace", “table", SomeColumns(“id”,"data”))
> 
> When run on a single machine, everything is fine if I save to an hdfs file or 
> save to Cassandra.
> When run in cluster neither works:
> 
>  - When saving to file, I get an exception: User class threw exception: 
> Output directory hdfs://hadoop01:54310/tmp/pavel/CassandraPipeTest.txt 
> <hdfs://hadoop01:54310/tmp/pavel/CassandraPipeTest.txt> already exists
>  - When saving to Cassandra, only 4 rows are updated with empty data (I test 
> on a 4-machine Spark cluster)
> 
> Any hints on how to debug this and where the problem could be?
> 
> - I delete the hdfs file before running
> - Would really like the output to hdfs to work, so I can debug
> - Then it would be nice to save to Cassandra

- Btw I *don’t* want to use .coalesce(1,true) in the future as well. 

Reply via email to