Spark SQL using the Data Source API can also do this with much less code <https://twitter.com/michaelarmbrust/status/579346328636891136>.
https://github.com/databricks/spark-avro On Thu, May 7, 2015 at 8:41 AM, Jonathan Coveney <[email protected]> wrote: > A helpful example of how to convert: > http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/ > > As far as performance, that depends on your data. If you have a lot of > columns and use all of them, parquet deserialization is expensive. If you > have a column and only need a few (or have some filters you can push down), > the savings can be huge. > > 2015-05-07 11:29 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) <[email protected]>: > > 1) What is the best way to convert data from Avro to Parquet so that it >> can be later read and processed ? >> >> 2) Will the performance of processing (join, reduceByKey) be better if >> both datasets are in Parquet format when compared to Avro + Sequence ? >> >> -- >> Deepak >> >> >
