Re: spark sql writing in avro

M. Dale Fri, 13 Mar 2015 13:59:22 -0700

I probably did not do a good enough job explaining the problem. Ifyou used Maven with thedefault Maven repository you have an old version of spark-avro that doesnot contain AvroSaver and does not have the saveAsAvro method implemented:


Assuming you use the default Maven repo location:
cd ~/.m2/repository/com/databricks/spark-avro_2.10/0.1
jar tvf spark-avro_2.10-0.1.jar | grep AvroSaver

Comes up empty. The jar file does not contain this class becauseAvroSaver.scala wasn't added until Jan 21. The jar file is from 14 November.


So:
git clone g...@github.com:databricks/spark-avro.git
cd spark-avro
sbt publish-m2

This publishes the latest master code (this includes AvroSaver etc.) toyour local Maven repo and Maven will pick up the latest version ofspark-avro (for this machine).


Now you should be able to compile and run.

HTH,
Markus

On 03/12/2015 11:55 PM, Kevin Peng wrote:

Dale,

I basically have the same maven dependency above, but my code will notcompile due to not being able to reference to AvroSaver, though thesaveAsAvro reference compiles fine, which is weird. EventhoughsaveAsAvro compiles for me, it errors out when running the spark jobdue to it not being implemented (the job quits and says nonimplemented method or something along those lines).

I will try going the spark shell and passing in the jar built fromgithub since I haven't tried that quite yet.

On Thu, Mar 12, 2015 at 6:44 PM, M. Dale <medal...@yahoo.com<mailto:medal...@yahoo.com>> wrote:


    Short answer: if you downloaded spark-avro from the
    repo.maven.apache.org <http://repo.maven.apache.org>
    repo you might be using an old version (pre-November 14, 2014) -
    see timestamps at
    http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
    Lots of changes at https://github.com/databricks/spark-avro since
    then.

    Databricks, thank you for sharing the Avro code!!!

    Could you please push out the latest version or update the version
    number and republish to repo.maven.apache.org
    <http://repo.maven.apache.org> (I have no idea how jars get
    there). Or is there a different repository that users should point
    to for
    this artifact?

    Workaround: Download from https://github.com/databricks/spark-avro
    and build
    with latest functionality (still version 0.1) and add to your
    local Maven
    or Ivy repo.

    Long version:
    I used a default Maven build and declared my dependency on:

            <dependency>
                <groupId>com.databricks</groupId>
                <artifactId>spark-avro_2.10</artifactId>
                <version>0.1</version>
            </dependency>

    Maven downloaded the 0.1 version from
    http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
    and included it in my app code jar.

    From spark-shell:

    import com.databricks.spark.avro._
    import org.apache.spark.sql.SQLContext
    val sqlContext = new SQLContext(sc)

    # This schema includes LONG for time in millis
    
(https://github.com/medale/spark-mail/blob/master/mailrecord/src/main/avro/com/uebercomputing/mailrecord/MailRecord.avdl)
    val recordsSchema =
    sqlContext.avroFile("/opt/rpm1/enron/enron-tiny.avro")
    java.lang.RuntimeException: Unsupported type LONG

    However, checking out the spark-avro code from its GitHub repo and
    adding
    a test case against the MailRecord avro everything ran fine.

    So I built the databricks spark-avro locally on my box and then
    put it in my
    local Maven repo - everything worked from spark-shell when adding
    that jar
    as dependency.

    Hope this helps for the "save" case as well. On the pre-14NOV
    version, avro.scala
    says:
     // TODO: Implement me.
      implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) {
        def saveAsAvroFile(path: String): Unit = ???
      }

    Markus

    On 03/12/2015 07:05 PM, kpeng1 wrote:

        Hi All,

        I am current trying to write out a scheme RDD to avro.  I
        noticed that there
        is a databricks spark-avro library and I have included that in my
        dependencies, but it looks like I am not able to access the
        AvroSaver
        object.  On compilation of the job I get this:
        error: not found: value AvroSaver
        [ERROR]     AvroSaver.save(resultRDD, args(4))

        I also tried calling saveAsAvro on the resultRDD(the actual
        rdd with the
        results) and that passes compilation, but when I run the code
        I get an error
        that says the saveAsAvro is not implemented.  I am using
        version 0.1 of
        spark-avro_2.10




        --
        View this message in context:
        
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-writing-in-avro-tp22021.html
        Sent from the Apache Spark User List mailing list archive at
        Nabble.com.

        ---------------------------------------------------------------------
        To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
        <mailto:user-unsubscr...@spark.apache.org>
        For additional commands, e-mail: user-h...@spark.apache.org
        <mailto:user-h...@spark.apache.org>

Re: spark sql writing in avro

Reply via email to