> On Feb 19, 2015, at 7:29 PM, Pavel Velikhov <pavel.velik...@icloud.com> wrote: > > I have a simple Spark job that goes out to Cassandra, runs a pipe and stores > results: > > val sc = new SparkContext(conf) > val rdd = sc.cassandraTable(“keyspace", “table") > .map(r => r.getInt(“column") + "\t" + > write(get_lemmas(r.getString("tags")))) > .pipe("python3 /tmp/scripts_and_models/scripts/run.py") > .map(r => convertStr(r) ) > .coalesce(1,true) > .saveAsTextFile("/tmp/pavel/CassandraPipeTest.txt") > //.saveToCassandra(“keyspace", “table", SomeColumns(“id”,"data”)) > > When run on a single machine, everything is fine if I save to an hdfs file or > save to Cassandra. > When run in cluster neither works: > > - When saving to file, I get an exception: User class threw exception: > Output directory hdfs://hadoop01:54310/tmp/pavel/CassandraPipeTest.txt > <hdfs://hadoop01:54310/tmp/pavel/CassandraPipeTest.txt> already exists > - When saving to Cassandra, only 4 rows are updated with empty data (I test > on a 4-machine Spark cluster) > > Any hints on how to debug this and where the problem could be? > > - I delete the hdfs file before running > - Would really like the output to hdfs to work, so I can debug > - Then it would be nice to save to Cassandra
- Btw I *don’t* want to use .coalesce(1,true) in the future as well.