Hi Michael,

This is what I did. I was thinking if there is a more efficient way to
accomplish this.

I was doing a very simple benchmark: Convert lzo compressed json files to
parquet files using SparkSQL vs. Hadoop MR.

Spark SQL seems to require 2 stages to accomplish this task:
Stage 1: read the lzo files using newAPIHadoopFile with LzoTextInputFormat
and then convert it to JsonRDD
Stage 2: saveAsParquetFile from the JsonRDD

In Hadoop, it takes 1 step, a map-only job to read the data and then output
the json to the parquet file (I'm using elephant bird LzoJsonLoader to load
the files)

In some scenarios, Hadoop is faster because it is saving one stage. Did I
do something wrong?

Best Regards,

Jerry


On Wed, Dec 17, 2014 at 1:29 PM, Michael Armbrust <mich...@databricks.com>
wrote:
>
> You can create an RDD[String] using whatever method and pass that to
> jsonRDD.
>
> On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>> Hi Ted,
>>
>> Thanks for your help.
>> I'm able to read lzo files using sparkContext.newAPIHadoopFile but I
>> couldn't do the same for sqlContext because sqlContext.josnFile does not
>> provide ways to configure the input file format. Do you know if there are
>> some APIs to do that?
>>
>> Best Regards,
>>
>> Jerry
>>
>> On Wed, Dec 17, 2014 at 11:27 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>> See this thread: http://search-hadoop.com/m/JW1q5HAuFv
>>> which references https://issues.apache.org/jira/browse/SPARK-2394
>>>
>>> Cheers
>>>
>>> On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam <chiling...@gmail.com> wrote:
>>>>
>>>> Hi spark users,
>>>>
>>>> Do you know how to read json files using Spark SQL that are LZO
>>>> compressed?
>>>>
>>>> I'm looking into sqlContext.jsonFile but I don't know how to configure
>>>> it to read lzo files.
>>>>
>>>> Best Regards,
>>>>
>>>> Jerry
>>>>
>>>

Reply via email to