Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
*To:* Thomas, Jordan <mailto:jordan.tho...@accenture.com>; mich...@databricks.com <mailto:mich...@databricks.com> *Cc:* user@spark.apache.org <mailto:user@spark.apache.org> *Subject:* Re: Performance when iterating over many parquet files Could you please elaborate on

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
ian.cs@gmail.com] *Sent:* Monday, September 28, 2015 6:46 PM *To:* Thomas, Jordan ; mich...@databricks.com *Cc:* user@spark.apache.org *Subject:* Re: Performance when iterating over many parquet files Probably parquet-tools and the following shell script helps: root="/path/to/your/data"

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
ian.cs@gmail.com] *Sent:* Monday, September 28, 2015 6:46 PM *To:* Thomas, Jordan ; mich...@databricks.com *Cc:* user@spark.apache.org *Subject:* Re: Performance when iterating over many parquet files Probably parquet-tools and the following shell script helps: root="/path/to/your/data"

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas
months. From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Monday, September 28, 2015 6:46 PM To: Thomas, Jordan ; mich...@databricks.com Cc: user@spark.apache.org Subject: Re: Performance when iterating over many parquet files Probably parquet-tools and the following shell script helps: root

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
nd re-transferred. Thanks, Jordan *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Monday, September 28, 2015 6:15 PM *To:* Thomas, Jordan ; mich...@databricks.com *Cc:* user@spark.apache.org *Subject:* Re: Performance when iterating over many parquet files Could you please elaborate

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas
@spark.apache.org Subject: Re: Performance when iterating over many parquet files Could you please elaborate on what kind of errors are those bad Parquet files causing? In what ways are they miswritten? Cheng On 9/28/15 4:03 PM, jordan.tho...@accenture.com<mailto:jordan.tho...@accenture.com> wrote: Ah,

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
...@databricks.com *Cc:* user@spark.apache.org *Subject:* Re: Performance when iterating over many parquet files Also, you may find more details in the programming guide: - http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging - http://spark.apache.org/docs/latest/sql-programming

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas
[mailto:lian.cs@gmail.com] Sent: Monday, September 28, 2015 5:56 PM To: Thomas, Jordan ; mich...@databricks.com Cc: user@spark.apache.org Subject: Re: Performance when iterating over many parquet files Also, you may find more details in the programming guide: - http://spark.apache.org/docs/latest

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
do correspond to logical entities and there are a number of use-case specific reasons to keep them separate. Thanks, Jordan *From:*Michael Armbrust [mailto:mich...@databricks.com] *Sent:* Monday, September 28, 2015 4:02 PM *To:* Thomas, Jordan *Cc:* user *Subject:* Re: Performance wh

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
or so. I could coalesce, but they do correspond to logical entities and there are a number of use-case specific reasons to keep them separate. Thanks, Jordan *From:*Michael Armbrust [mailto:mich...@databricks.com] *Sent:* Monday, September 28, 2015 4:02 PM *To:* Thomas, Jordan *Cc:* user *Subj

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas
entities and there are a number of use-case specific reasons to keep them separate. Thanks, Jordan From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Monday, September 28, 2015 4:02 PM To: Thomas, Jordan Cc: user Subject: Re: Performance when iterating over many parquet files

Re: Performance when iterating over many parquet files

2015-09-28 Thread Michael Armbrust
Another note: for best performance you are going to want your parquet files to be pretty big (100s of mb). You could coalesce them and write them out for more efficient repeat querying. On Mon, Sep 28, 2015 at 2:00 PM, Michael Armbrust wrote: > sqlContext.read.parquet >

Re: Performance when iterating over many parquet files

2015-09-28 Thread Michael Armbrust
sqlContext.read.parquet takes lists of files. val fileList = sc.textFile("file_list.txt").collect() // this works but using spark is possibly overkill val dataFrame = sqlContext.re