Re: Writing partitioned Avro data to HDFS

Yash Sharma Tue, 22 Dec 2015 05:44:37 -0800

Well this will indeed hit the error if the next run has similar year and
months and writing would not be possible.


You can try working around by introducing a runCount in partition or in the
output path.

Something like-

/tmp/data/year/month/01
/tmp/data/year/month/02

Or,
/tmp/data/01/year/month
/tmp/data/02/year/month

This is a work around.

Am sure other better approaches would follow.

- Thanks, via mobile,  excuse brevity.
On Dec 22, 2015 7:01 PM, "Jan Holmberg" <jan.holmb...@perigeum.fi> wrote:

> Hi Yash,
>
> the error is caused by the fact that first run creates the base directory
> ie. "/tmp/data" and the second batch stumbles to the existing base
> directory. I understand that the existing base directory is a challenge but
> I do not understand how to make this work with streaming example where each
> batch would create a new distinct directory.
>
> Granularity has no impact. No matter how data is partitioned, second
> 'batch' always fails with existing base dir.
>
> scala> df2.write.partitionBy("year").avro("/tmp/data")
> org.apache.spark.sql.AnalysisException: path hdfs://nameservice1/tmp/data
> already exists.;
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
> at
> com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:37)
> at
> com.databricks.spark.avro.package$AvroDataFrameWriter$$anonfun$avro$1.apply(package.scala:37)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:38)
>
>
> On 22 Dec 2015, at 14:06, Yash Sharma <yash...@gmail.com> wrote:
>
> Hi Jan,
> Is the error because a past run of the job has already written to the
> location?
>
> In that case you can add more granularity with 'time' along with year and
> month. That should give you a distinct path for every run.
>
> Let us know if it helps or if i missed anything.
>
> Goodluck
>
> - Thanks, via mobile,  excuse brevity.
> On Dec 22, 2015 2:31 PM, "Jan Holmberg" <jan.holmb...@perigeum.fi> wrote:
>
>> Hi,
>> I'm stuck with writing partitioned data to hdfs. Example below ends up
>> with 'already exists' -error.
>>
>> I'm wondering how to handle streaming use case.
>>
>> What is the intended way to write streaming data to hdfs? What am I
>> missing?
>>
>> cheers,
>> -jan
>>
>>
>> import com.databricks.spark.avro._
>>
>> import org.apache.spark.sql.SQLContext
>>
>> val sqlContext = new SQLContext(sc)
>>
>> import sqlContext.implicits._
>>
>> val df = Seq(
>> (2012, 8, "Batman", 9.8),
>> (2012, 8, "Hero", 8.7),
>> (2012, 7, "Robot", 5.5),
>> (2011, 7, "Git", 2.0)).toDF("year", "month", "title", "rating")
>>
>> df.write.partitionBy("year", "month").avro("/tmp/data")
>>
>> val df2 = Seq(
>> (2012, 10, "Batman", 9.8),
>> (2012, 10, "Hero", 8.7),
>> (2012, 9, "Robot", 5.5),
>> (2011, 9, "Git", 2.0)).toDF("year", "month", "title", "rating")
>>
>> df2.write.partitionBy("year", "month").avro("/tmp/data")
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Writing partitioned Avro data to HDFS

Reply via email to