Re: How to reflect dynamic registration udf?

2016-12-16 Thread Cheng Lian
Could you please provide more context about what you are trying to do here? On Thu, Dec 15, 2016 at 6:27 PM 李斌松 wrote: > How to reflect dynamic registration udf? > > java.lang.UnsupportedOperationException: Schema for type _$13 is not > supported > at > org.apache.spark.sql.catalyst.ScalaReflect

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-24 Thread Cheng Lian
On 10/22/16 6:18 AM, Steve Loughran wrote: ... On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: What version of Spark are you using and how many output files does the job writes out? By default, Spark versions before 1.6 (not includin

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-24 Thread Cheng Lian
ing along those lines? Exactly. On Fri, Oct 21, 2016 at 3:39 PM Cheng Lian <mailto:lian.cs@gmail.com>> wrote: Efe - You probably hit this bug: https://issues.apache.org/jira/browse/SPARK-18058 On 10/21/16 2:03 AM, Agraj Mangal wrote: I have seen this error someti

Re: RDD groupBy() then random sort each group ?

2016-10-21 Thread Cheng Lian
I think it would much easier to use DataFrame API to do this by doing local sort using randn() as key. For example, in Spark 2.0: val df = spark.range(100) val shuffled = df.repartition($"id" % 10).sortWithinPartitions(randn(42)) Replace df with a DataFrame wrapping your RDD, and $"id" % 10 wit

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-21 Thread Cheng Lian
Efe - You probably hit this bug: https://issues.apache.org/jira/browse/SPARK-18058 On 10/21/16 2:03 AM, Agraj Mangal wrote: I have seen this error sometimes when the elements in the schema have different nullabilities. Could you print the schema for data and for someCode.thatReturnsADataset()

Re: How to iterate the element of an array in DataFrame?

2016-10-21 Thread Cheng Lian
You may either use SQL function "array" and "named_struct" or define a case class with expected field names. Cheng On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote: My expectation is: root |-- tag: vector namely, I want to extract from: [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| to: Vectors.

Re: Dataframe schema...

2016-10-21 Thread Cheng Lian
Yea, confirmed. While analyzing unions, we treat StructTypes with different field nullabilities as incompatible types and throws this error. Opened https://issues.apache.org/jira/browse/SPARK-18058 to track this issue. Thanks for reporting! Cheng On 10/21/16 3:15 PM, Cheng Lian wrote: Hi

Re: Dataframe schema...

2016-10-21 Thread Cheng Lian
Hi Muthu, What is the version of Spark are you using? This seems to be a bug in the analysis phase. Cheng On 10/21/16 12:50 PM, Muthu Jayakumar wrote: Sorry for the late response. Here is what I am seeing... Schema from parquet file. d1.printSchema() root |-- task_id: string (nullable =

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-21 Thread Cheng Lian
What version of Spark are you using and how many output files does the job writes out? By default, Spark versions before 1.6 (not including) writes Parquet summary files when committing the job. This process reads footers from all Parquet files in the destination directory and merges them toge

Re: Where condition on columns of Arrays does no longer work in spark 2

2016-10-21 Thread Cheng Lian
Thanks for reporting! It's a bug, just filed a ticket to track it: https://issues.apache.org/jira/browse/SPARK-18053 Cheng On 10/20/16 1:54 AM, filthysocks wrote: I have a Column in a DataFrame that contains Arrays and I wanna filter for equality. It does work fine in spark 1.6 but not in 2.0

Re: Consuming parquet files built with version 1.8.1

2016-10-17 Thread Cheng Lian
Hi Dinesh, Thanks for reporting. This is kinda weird and I can't reproduce this. Were doing the experiments using a clean compiled Spark master branch? And I don't think you have to use parquet-mr 1.8.1 to read Parquet files generated using parquet-mr 1.8.1 unless you are using something not

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-12 Thread Cheng Lian
OK, I've merged this PR to master and branch-2.0. On 8/11/16 8:27 AM, Cheng Lian wrote: Haven't figured out the exactly way how it failed, but the leading underscore in the partition directory name looks suspicious. Could you please try this PR to see whether it fixes the iss

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-10 Thread Cheng Lian
Haven't figured out the exactly way how it failed, but the leading underscore in the partition directory name looks suspicious. Could you please try this PR to see whether it fixes the issue: https://github.com/apache/spark/pull/14585/files Cheng On 8/9/16 5:38 PM, immerrr again wrote: Anot

Re: 回复: Bug about reading parquet files

2016-07-09 Thread Cheng Lian
According to our offline discussion, the target table consists of 1M+ small Parquet files (~12M by average). The OOM occurred at driver side while listing input files. My theory is that the total size of all listed FileStatus objects is too large for the driver and caused the OOM. Suggestion

Re: Bug about reading parquet files

2016-07-08 Thread Cheng Lian
What's the Spark version? Could you please also attach result of explain(extended = true)? On Fri, Jul 8, 2016 at 4:33 PM, Sea <261810...@qq.com> wrote: > I have a problem reading parquet files. > sql: > select count(1) from omega.dwd_native where year='2016' and month='07' > and day='05' and h

Re: Hive 1.0.0 not able to read Spark 1.6.1 parquet output files on EMR 4.7.0

2016-06-15 Thread Cheng Lian
Spark 1.6.1 is also using 1.7.0. Could you please share the schema of your Parquet file as well as the exact exception stack trace reported by Hive? Cheng On 6/13/16 12:56 AM, mayankshete wrote: Hello Team , I am facing an issue where output files generated by Spark 1.6.1 are not read by

Re: update mysql in spark

2016-06-15 Thread Cheng Lian
Spark SQL doesn't support update command yet. On Wed, Jun 15, 2016, 9:08 AM spR wrote: > hi, > > can we write a update query using sqlcontext? > > sqlContext.sql("update act1 set loc = round(loc,4)") > > what is wrong in this? I get the following error. > > Py4JJavaError: An error occurred while

Re: feedback on dataset api explode

2016-05-25 Thread Cheng Lian
Agree, since they can be easily replaced by .flatMap (to do explosion) and .select (to rename output columns) Cheng On 5/25/16 12:30 PM, Reynold Xin wrote: Based on this discussion I'm thinking we should deprecate the two explode functions. On Wednesday, May 25, 2016, Koert Kuipers

Re: How to delete a record from parquet files using dataframes

2016-02-24 Thread Cheng Lian
Parquet is a read-only format. So the only way to remove data from a written Parquet file is to write a new Parquet file without unwanted rows. Cheng On 2/17/16 5:11 AM, SRK wrote: Hi, I am saving my records in the form of parquet files using dataframes in hdfs. How to delete the records usin

Re: cast column string -> timestamp in Parquet file

2016-01-25 Thread Cheng Lian
The following snippet may help: sqlContext.read.parquet(path).withColumn("col_ts", $"col".cast(TimestampType)).drop("col") Cheng On 1/21/16 6:58 AM, Muthu Jayakumar wrote: DataFrame and udf. This may be more performant than doing an RDD transformation as you'll only transform just the colu

Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Cheng Lian
You may try DataFrame.repartition(partitionExprs: Column*) to shuffle all data belonging to a single (data) partition into a single (RDD) partition: |df.coalesce(1)|||.repartition("entity", "year", "month", "day", "status")|.write.partitionBy("entity", "year", "month", "day", "status").mode(S

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-12 Thread Cheng Lian
oblem. Best, Gavin On Mon, Jan 11, 2016 at 4:31 PM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: Hey Gavin, Could you please provide a snippet of your code to show how did you disabled "parquet.enable.summary-metadata" and wrote the files? Especially,

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-11 Thread Cheng Lian
Hey Gavin, Could you please provide a snippet of your code to show how did you disabled "parquet.enable.summary-metadata" and wrote the files? Especially, you mentioned you saw "3000 jobs" failed. Were you writing each Parquet file with an individual job? (Usually people use write.partitionBy

Re: [SparkSQL][Parquet] Read from nested parquet data

2015-12-31 Thread Cheng Lian
Hey Lin, This is a good question. The root cause of this issue lies in the analyzer. Currently, Spark SQL can only resolve a name to a top level column. (Hive suffers the same issue.) Take the SQL query and struct you provided as an example, col_b.col_d.col_g is resolved as two nested GetStru

Re: memory leak when saving Parquet files in Spark

2015-12-14 Thread Cheng Lian
quot;false") Thanks, -Matt On Fri, Dec 11, 2015 at 1:58 AM, Cheng Lian <mailto:l...@databricks.com>> wrote: This is probably caused by schema merging. Were you using Spark 1.4 or earlier versions? Could you please try the following snippet to see whether it helps:

Re: memory leak when saving Parquet files in Spark

2015-12-10 Thread Cheng Lian
This is probably caused by schema merging. Were you using Spark 1.4 or earlier versions? Could you please try the following snippet to see whether it helps: df.write .format("parquet") .option("mergeSchema", "false") .partitionBy(partitionCols: _*) .mode(saveMode) .save(targetPath) I

Re: About the bottleneck of parquet file reading in Spark

2015-12-10 Thread Cheng Lian
Cc Spark user list since this information is generally useful. On Thu, Dec 10, 2015 at 3:31 PM, Lionheart <87249...@qq.com> wrote: > Dear, Cheng > I'm a user of Spark. Our current Spark version is 1.4.1 > In our project, I find there is a bottleneck when loading huge amount > of parquet

Re: parquet file doubts

2015-12-08 Thread Cheng Lian
eet mailto:absi...@informatica.com>> wrote: Yes, Parquet has min/max. *From:*Cheng Lian [mailto:l...@databricks.com <mailto:l...@databricks.com>] *Sent:* Monday, December 07, 2015 11:21 AM *To:* Ted Yu *Cc:* Shushant Arora; user@spark.apache.org <mailto

Re: parquet file doubts

2015-12-06 Thread Cheng Lian
, Ted Yu wrote: Cheng: I only see user@spark in the CC. FYI On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian <mailto:l...@databricks.com>> wrote: cc parquet-dev list (it would be nice to always do so for these general questions.) Cheng On 12/6/15 3:10 PM, Shushant Ar

Re: parquet file doubts

2015-12-06 Thread Cheng Lian
cc parquet-dev list (it would be nice to always do so for these general questions.) Cheng On 12/6/15 3:10 PM, Shushant Arora wrote: Hi I have few doubts on parquet file format. 1.Does parquet keeps min max statistics like in ORC. how can I see parquet version(whether its1.1,1.2or1.3) for pa

Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Cheng Lian
You may try to set Hadoop conf "parquet.enable.summary-metadata" to false to disable writing Parquet summary files (_metadata and _common_metadata). By default Parquet writes the summary files by collecting footers of all part-files in the dataset while committing the job. Spark also follows

Re: Parquet files not getting coalesced to smaller number of files

2015-11-29 Thread Cheng Lian
RDD.coalesce(n) returns a new RDD rather than modifying the original RDD. So what you need is: metricsToBeSaved.coalesce(1500).saveAsNewAPIHadoopFile(...) Cheng On 11/29/15 12:21 PM, SRK wrote: Hi, I have the following code that saves the parquet files in my hourly batch to hdfs. My idea

Re: DateTime Support - Hive Parquet

2015-11-29 Thread Cheng Lian
icit conversion for this case? Do you convert on insert or on RDD to DF conversion? Regards, Bryan Jeffrey Sent from Outlook Mail *From: *Cheng Lian *Sent: *Tuesday, November 24, 2015 6:49 AM *To: *Bryan;user *Subject: *Re: DateTime Support - Hive Parquet I see, then this is actually irrelevan

Re: DateTime Support - Hive Parquet

2015-11-24 Thread Cheng Lian
nanos, Timestamp, etc) prior to writing records to hive. Regards, Bryan Jeffrey Sent from Outlook Mail *From: *Cheng Lian *Sent: *Tuesday, November 24, 2015 1:42 AM *To: *Bryan Jeffrey;user *Subject: *Re: DateTime Support - Hive Parquet Hey Bryan, What do you mean by "DateTime prope

Re: DateTime Support - Hive Parquet

2015-11-23 Thread Cheng Lian
Hey Bryan, What do you mean by "DateTime properties"? Hive and Spark SQL both support DATE and TIMESTAMP types, but there's no DATETIME type. So I assume you are referring to Java class DateTime (possibly the one in joda)? Could you please provide a sample snippet that illustrates your requir

Re: dounbts on parquet

2015-11-19 Thread Cheng Lian
t works fine. my requirement is now to handle writing in multiple folders at same time. Basically the JavaPairrdd I want to write to multiple folders based on final hive partitions where this rdd will lend.Have you used multiple output formats in spark? On Fri, Nov 13, 2015 at 3:56 PM, Che

Re: dounbts on parquet

2015-11-19 Thread Cheng Lian
t works fine. my requirement is now to handle writing in multiple folders at same time. Basically the JavaPairrdd I want to write to multiple folders based on final hive partitions where this rdd will lend.Have you used multiple output formats in spark? On Fri, Nov 13, 2015 at 3:56 PM, Che

Re: Unwanted SysOuts in Spark Parquet

2015-11-10 Thread Cheng Lian
This is because of PARQUET-369 , which prevents users or other libraries to override Parquet's JUL logging settings via SLF4J. It has been fixed in the most recent parquet-format master (PR #32

Re: very slow parquet file write

2015-11-06 Thread Cheng Lian
none of your responses are there either. I am definitely subscribed to the list though (I get daily digests). Any clue how to fix it? Sorry, no idea :-/ On Nov 6, 2015, at 9:26 AM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: I'd expect writing Parquet files slower than

Re: very slow parquet file write

2015-11-06 Thread Cheng Lian
I'd expect writing Parquet files slower than writing JSON files since Parquet involves more complicated encoders, but maybe not that slow. Would you mind to try to profile one Spark executor using tools like YJP to see what's the hotspot? Cheng On 11/6/15 7:34 AM, rok wrote: Apologies if thi

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-04 Thread Cheng Lian
Is there any chance that " spark.sql.hive.convertMetastoreParquet" is turned off? Cheng On 11/4/15 5:15 PM, Rex Xiong wrote: Thanks Cheng Lian. I found in 1.5, if I use spark to create this table with partition discovery, the partition pruning can be performed, but for my

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-03 Thread Cheng Lian
SPARK-11153 should be irrelevant because you are filtering on a partition key while SPARK-11153 is about Parquet filter push-down and doesn't affect partition pruning. Cheng On 11/3/15 7:14 PM, Rex Xiong wrote: We found the query performance is very poor due to this issue https://issues.apac

Re: Filter applied on merged Parquet shemsa with new column fails.

2015-10-28 Thread Cheng Lian
Hey Hyukjin, Sorry that I missed the JIRA ticket. Thanks for bring this issue up here, your detailed investigation. From my side, I think this is a bug of Parquet. Parquet was designed to support schema evolution. When scanning a Parquet, if a column exists in the requested schema but missin

Re: Fixed writer version as version1 for Parquet as wring a Parquet file.

2015-10-09 Thread Cheng Lian
Hi Hyukjin, Thanks for bringing this up. Could you please make a PR for this one? We didn't use PARQUET_2_0 mostly because it's less mature than PARQUET_1_0, but we should let users choose the writer version, as long as PARQUET_1_0 remains the default option. Cheng On 10/8/15 11:04 PM, Hyuk

Re: Parquet file size

2015-10-08 Thread Cheng Lian
l.com<mailto:younes.nag...@streamtheworld.com>** *From:* odeach...@gmail.com [odeach...@gmail.com] on behalf of Deng Ching-Mallete [och...@apache.org] *Sent:* Wednesday, October 07, 2015 9:14 PM *To:* Younes Naguib *Cc:* Cheng

Re: Parquet file size

2015-10-07 Thread Cheng Lian
, without month and day). Cheng So you want to dump all data into a single large Parquet file? On 10/7/15 1:55 PM, Younes Naguib wrote: The TSV original files is 600GB and generated 40k files of 15-25MB. y *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* October-07-15 3:18 PM *To

Re: Parquet file size

2015-10-07 Thread Cheng Lian
Why do you want larger files? Doesn't the result Parquet file contain all the data in the original TSV file? Cheng On 10/7/15 11:07 AM, Younes Naguib wrote: Hi, I’m reading a large tsv file, and creating parquet files using sparksql: insert overwrite table tbl partition(year, month, day)..

Re: Metadata in Parquet

2015-09-30 Thread Cheng Lian
Unfortunately this isn't supported at the moment https://issues.apache.org/jira/browse/SPARK-10803 Cheng On 9/30/15 10:54 AM, Philip Weaver wrote: Hi, I am using org.apache.spark.sql.types.Metadata to store extra information along with each of my fields. I'd also like to store Metadata for th

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
430-L431>, which reads the actual Parquet footers and probably take most of the time). Cheng On 9/28/15 6:51 PM, Cheng Lian wrote: Oh I see, then probably this one, basically the parallel Spark version of my last script, using ParquetFileReader: import org.apache.parquet.hadoop.ParquetFile

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
g very similar this weekend. It works but is very slow. The Spark method I included in my original post is about 5-6 times faster. Just wondering if there is something even faster than that. I see this as being a recurring problem over the next few months. *From:*Cheng Lian [mailto:l

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
g very similar this weekend. It works but is very slow. The Spark method I included in my original post is about 5-6 times faster. Just wondering if there is something even faster than that. I see this as being a recurring problem over the next few months. *From:*Cheng Lian [mailto:l

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
nd re-transferred. Thanks, Jordan *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Monday, September 28, 2015 6:15 PM *To:* Thomas, Jordan ; mich...@databricks.com *Cc:* user@spark.apache.org *Subject:* Re: Performance when iterating over many parquet files Could you please elaborate

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
! The problem now is to filter out bad (miswritten) Parquet files, as they are causing this operation to fail. Any suggestions on detecting them quickly and easily? *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Monday, September 28, 2015 5:56 PM *To:* Thomas, Jordan ; mich

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
Also, you may find more details in the programming guide: - http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging - http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration Cheng On 9/28/15 3:54 PM, Cheng Lian wrote: I guess you're probably

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
I guess you're probably using Spark 1.5? Spark SQL does support schema merging, but we disabled it by default since 1.5 because it introduces extra performance costs (it's turned on by default in 1.4 and 1.3). You may enable schema merging via either the Parquet data source specific option "me

Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
BTW, just checked that this bug should have been fixed since Hive 0.14.0. So the SQL option I mentioned is mostly used for reading legacy Parquet files generated by older versions of Hive. Cheng On 9/25/15 2:42 PM, Cheng Lian wrote: Please set the the SQL option

Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
Please set the the SQL option spark.sql.parquet.binaryAsString to true when reading Parquet files containing strings generated by Hive. This is actually a bug of parquet-hive. When generating Parquet schema for a string field, Parquet requires a "UTF8" annotation, something like: message hive

Re: Using Map and Basic Operators yield java.lang.ClassCastException (Parquet + Hive + Spark SQL 1.5.0 + Thrift)

2015-09-25 Thread Cheng Lian
uble are handled. Handling INT is all good but float and double are causing the exception. Thanks. Dominic Ricard Triton Digital -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Thursday, September 24, 2015 5:47 PM To: Dominic Ricard; user@spark.apache.org Subj

Re: Using Map and Basic Operators yield java.lang.ClassCastException (Parquet + Hive + Spark SQL 1.5.0 + Thrift)

2015-09-24 Thread Cheng Lian
On 9/24/15 11:34 AM, Dominic Ricard wrote: Hi, I stumbled on the following today. We have Parquet files that expose a column in a Map format. This is very convenient as we have data parts that can vary in time. Not knowing what the data will be, we simply split it in tuples and insert it as

Re: spark + parquet + schema name and metadata

2015-09-24 Thread Cheng Lian
I am planning to use "stable" metadata - so those will be same across all parquet files inside directory hierarchy... On Tue, 22 Sep 2015 at 18:54 Cheng Lian mailto:lian.cs@gmail.com>> wrote: Michael reminded me that although we don't support direct

Re: spark + parquet + schema name and metadata

2015-09-22 Thread Cheng Lian
mp/parquet/meta/part-r-0-77cb2237-e6a8-4cb6-a452-ae205ba7b660.gz.parquet creator: parquet-mr version 1.6.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long",&qu

Re: spark + parquet + schema name and metadata

2015-09-22 Thread Cheng Lian
uot;tag" them in some way (giving the schema appropriate name or attaching some key/values) and then it is fairly easy to get basic metadata about parquet files when processing and discovering those later on. On Mon, 21 Sep 2015 at 18:17 Cheng Lian <mailto:lian.cs@gmail.com>> w

Re: spark + parquet + schema name and metadata

2015-09-21 Thread Cheng Lian
Currently Spark SQL doesn't support customizing schema name and metadata. May I know why these two matters in your use case? Some Parquet data models, like parquet-avro, do support it, while some others don't (e.g. parquet-hive). Cheng On 9/21/15 7:13 AM, Borisa Zivkovic wrote: Hi, I am try

Re: parquet error

2015-09-18 Thread Cheng Lian
Not sure what's happening here, but I guess it's probably a dependency version issue. Could you please give vanilla Apache Spark a try to see whether its a CDH specific issue or not? Cheng On 9/17/15 11:44 PM, Chengi Liu wrote: Hi, I did some digging.. I believe the error is caused by jets3

Re: Spark-shell throws Hive error when SQLContext.parquetFile, v1.3

2015-09-10 Thread Cheng Lian
If you don't need to interact with Hive, you may compile Spark without using the -Phive flag to eliminate Hive dependencies. In this way, the sqlContext instance in Spark shell will be of type SQLContext instead of HiveContext. The reason behind the Hive metastore error is probably due to Hive

Re: How to read compressed parquet file

2015-09-09 Thread Cheng Lian
You need to use "har://" instead of "hdfs://" to read HAR files. Just tested against Spark 1.5, and it works as expected. Cheng On 9/9/15 3:29 PM, 李铖 wrote: I think too many parquet files may be affect reading capability,so I use hadoop archive to combine them,but sql_context.read.parquet(ou

Re: Split content into multiple Parquet files

2015-09-08 Thread Cheng Lian
In Spark 1.4 and 1.5, you can do something like this: df.write.partitionBy("key").parquet("/datasink/output-parquets") BTW, I'm curious about how did you do it without partitionBy using saveAsHadoopFile? Cheng On 9/8/15 2:34 PM, Adrien Mogenet wrote: Hi there, We've spent several hours to

Re: Parquet Array Support Broken?

2015-09-08 Thread Cheng Lian
if the file is created in Spark On Mon, Sep 7, 2015 at 3:06 PM, Ruslan Dautkhanov mailto:dautkha...@gmail.com>> wrote: Read response from Cheng Lian mailto:lian.cs@gmail.com>> on Aug/27th - it looks the same problem. Workarounds 1. write that parquet file in Spark;

Re: Parquet partitioning for unique identifier

2015-09-04 Thread Cheng Lian
ue: double (valueContainsNull = false) |-- imp2: map (nullable = true) ||-- key: string ||-- value: double (valueContainsNull = false) |-- imp3: map (nullable = true) ||-- key: string ||-- value: double (valueContainsNull = false) On Thu, Sep 3, 2015 at 11:27 PM, Cheng Lian <m

Re: Parquet partitioning for unique identifier

2015-09-03 Thread Cheng Lian
Could you please provide the full stack track of the OOM exception? Another common case of Parquet OOM is super wide tables, say hundred or thousands of columns. And in this case, the number of rows is mostly irrelevant. Cheng On 9/4/15 1:24 AM, Kohki Nishio wrote: let's say I have a data li

Re: Group by specific key and save as parquet

2015-09-01 Thread Cheng Lian
Starting from Spark 1.4, you can do this via dynamic partitioning: sqlContext.table("trade").write.partitionBy("date").parquet("/tmp/path") Cheng On 9/1/15 8:27 AM, gtinside wrote: Hi , I have a set of data, I need to group by specific key and then save as parquet. Refer to the code snippet b

Re: Schema From parquet file

2015-09-01 Thread Cheng Lian
What exactly do you mean by "get schema from a parquet file"? - If you are trying to inspect Parquet files, parquet-tools can be pretty neat: https://github.com/Parquet/parquet-mr/issues/321 - If you are trying to get Parquet schema of Parquet MessageType, you may resort to readFooterX() and re

Re: reading multiple parquet file using spark sql

2015-09-01 Thread Cheng Lian
sqlContext.read.parquet(file1, file2, file3) On 9/1/15 7:31 PM, Hafiz Mujadid wrote: Hi I want to read multiple parquet files using spark sql load method. just like we can pass multiple comma separated path to sc.textfile method. Is ther anyway to do the same ? Thanks -- View this message

Re: Array column stored as “.bag” in parquet file instead of “REPEATED INT64"

2015-08-27 Thread Cheng Lian
Hi Jim, Unfortunately this is neither possible in Spark nor a standard practice for Parquet. In your case, actually repeated int64 c1 doesn't catch the full semantics. Because it represents a *required* array of long values containing zero or more *non-null* elements. However, when inferring sche

Re: Spark 1.3.1 saveAsParquetFile hangs on app exit

2015-08-26 Thread Cheng Lian
Could you please show jstack result of the hanged process? Thanks! Cheng On 8/26/15 10:46 PM, cingram wrote: I have a simple test that is hanging when using s3a with spark 1.3.1. Is there something I need to do to cleanup the S3A file system? The write to S3 appears to have worked but this job

Re: How to overwrite partition when writing Parquet?

2015-08-20 Thread Cheng Lian
You can apply a filter first to filter out data of needed dates and then append them. Cheng On 8/20/15 4:59 PM, Hemant Bhanawat wrote: How can I overwrite only a given partition or manually remove a partition before writing? I don't know if (and I don't think) there is a way to do that usin

Re: Spark 1.3 + Parquet: "Skipping data using statistics"

2015-08-13 Thread Cheng Lian
On 8/13/15 6:11 AM, YaoPau wrote: I've seen this function referenced in a couple places, first this forum post and this talk by Michael Armbrust during the 42nd minute.

Re: Parquet without hadoop: Possible?

2015-08-12 Thread Cheng Lian
One thing to note is that, it would be good to add explicit file system scheme to the output path (i.e. "file:///var/..." instead of "/var/..."), esp. when you do have HDFS running. Because in this case the data might be written to HDFS rather than your local file system if Spark found Hadoop c

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-12 Thread Cheng Lian
initely worth to try. And you can sort the record before writing out, and then you will get the parquet files without overlapping keys. Let us know if that helps. Hao *From:*Philip Weaver [mailto:philip.wea...@gmail.com] *Sent:* Wednesday, August 12, 2015 4:05 AM *To:* Cheng Lian *Cc:* user

Re: Merge metadata error when appending to parquet table

2015-08-09 Thread Cheng Lian
The conflicting metadata values warning is a known issue https://issues.apache.org/jira/browse/PARQUET-194 The option "parquet.enable.summary-metadata" is a Hadoop option rather than a Spark option, so you need to either add it to your Hadoop configuration file(s) or add it via `sparkContext.h

Re: Spark failed while trying to read parquet files

2015-08-07 Thread Cheng Lian
It doesn't seem to be Parquet 1.7.0 since the package name isn't under "org.apache.parquet" (1.7.0 is the first official Apache release of Parquet). The version you were using is probably Parquet 1.6.0rc3 according to the line number information: https://github.com/apache/parquet-mr/blob/parque

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-07 Thread Cheng Lian
sm is mysteriously low... Cheng On 8/7/15 3:32 PM, Cheng Lian wrote: Hi Philip, Thanks for providing the log file. It seems that most of the time are spent on partition discovery. The code snippet you provided actually issues two jobs. The first one is for listing the input directories to find ou

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-07 Thread Cheng Lian
I may try to do what he did to construct a DataFrame manually, and see if I can query it with Spark SQL with reasonable performance. - Philip On Thu, Aug 6, 2015 at 8:37 AM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: Would you mind to provide the driver log? On 8/6/15 3:5

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-06 Thread Cheng Lian
<mailto:philip.wea...@gmail.com>> wrote: Absolutely, thanks! On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian mailto:lian.cs@gmail.com>> wrote: We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396 Could you give it a shot to see

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-05 Thread Cheng Lian
We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396 Could you give it a shot to see whether it helps in your case? We've observed ~50x performance boost with schema merging turned on. Cheng On 8/6/15 8:26 AM, Philip Weaver wrote: I have a parquet directory that was produce

Re: Parquet SaveMode.Append Trouble.

2015-08-04 Thread Cheng Lian
You need to import org.apache.spark.sql.SaveMode Cheng On 7/31/15 6:26 AM, satyajit vegesna wrote: Hi, I am new to using Spark and Parquet files, Below is what i am trying to do, on Spark-shell, val df = sqlContext.parquetFile("/data/LM/Parquet/Segment/pages/part-m-0.gz.parquet") Hav

Re: Safe to write to parquet at the same time?

2015-08-04 Thread Cheng Lian
It should be safe for Spark 1.4.1 and later versions. Now Spark SQL adds a job-wise UUID to output file names to distinguish files written by different write jobs. So those two write jobs you gave should play well with each other. And the job committed later will generate a summary file for al

Re: Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Cheng Lian
Hi Jerry, Thanks for the detailed report! I haven't investigate this issue in detail. But for the input size issue, I believe this is due to a limitation of HDFS API. It seems that Hadoop FileSystem adds the size of a whole block to the metrics even if you only touch a fraction of that block.

Re: Parquet writing gets progressively slower

2015-07-26 Thread Cheng Lian
tput committer. Cheng On 7/25/15 3:58 PM, Michael Kelly wrote: Thanks for the suggestion Cheng, I will try that today. Are there any implications when reading the parquet data if there are no summary files present? Michael On Sat, Jul 25, 2015 at 2:28 AM, Cheng Lian wrote: The time is pr

Re: Parquet writing gets progressively slower

2015-07-24 Thread Cheng Lian
The time is probably spent by ParquetOutputFormat.commitJob. While committing a successful write job, Parquet writes a pair of summary files, containing metadata like schema, user defined key-value metadata, and Parquet row group information. To gather all the necessary information, Parquet sca

Re: Partition parquet data by ENUM column

2015-07-24 Thread Cheng Lian
e real problem here is that Spark SQL can’t prevent pushing down a predicate over an ENUM field since it sees the field as a normal string field. Would you mind to file a JIRA ticket? Cheng On 7/24/15 2:14 PM, Cheng Lian wrote: Could you please provide the full stack trace of the exception? And

Re: writing/reading multiple Parquet files: Failed to merge incompatible data types StringType and StructType

2015-07-24 Thread Cheng Lian
I don’t think this is a bug either. For an empty JSON array |[]|, there’s simply no way to infer its actual data type, and in this case Spark SQL just tries to fill in the “safest” type, which is |StringType|, because basically you can cast any data type to |StringType|. In general, schema inf

Re: Partition parquet data by ENUM column

2015-07-23 Thread Cheng Lian
r(PrimitiveType: BINARY, OriginalType: ENUM) Valid types for this column are: null Is it because Spark does not recognize ENUM type in parquet? Best Regards, Jerry On Wed, Jul 22, 2015 at 12:21 AM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: On 7/22/15 9:03 AM, Ankit wrote:

Re: Spark-hive parquet schema evolution

2015-07-22 Thread Cheng Lian
g.com On Wed, Jul 22, 2015 at 4:36 AM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: Since Hive doesn’t support schema evolution, you’ll have to update the schema stored in metastore somehow. For example, you can create a new external table with the merged schema. Say you ha

Re: Parquet problems

2015-07-22 Thread Cheng Lian
How many columns are there in these Parquet files? Could you load a small portion of the original large dataset successfully? Cheng On 6/25/15 5:52 PM, Anders Arpteg wrote: Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause fai

Re: Spark-hive parquet schema evolution

2015-07-22 Thread Cheng Lian
table' from SparkSQLCLI I won't see the new column being added. I understand that this is because Hive doesn't support schema evolution. So what is the best way to support CLI queries in this situation? Do I need to manually alter the table everytime the underlying schema changes? Th

Re: Partition parquet data by ENUM column

2015-07-21 Thread Cheng Lian
p at newParquet.scala:573 but that is the same even with non partitioned data. Do you mean how to verify whether partition pruning is effective? You should be able to see log lines like this: 15/07/22 11:14:35 INFO DataSourceStrategy: Selected 1 partitions out of 3, pruned 66.6666

Re: Spark-hive parquet schema evolution

2015-07-21 Thread Cheng Lian
Hey Jerrick, What do you mean by "schema evolution with Hive metastore tables"? Hive doesn't take schema evolution into account. Could you please give a concrete use case? Are you trying to write Parquet data with extra columns into an existing metastore Parquet table? Cheng On 7/21/15 1:04

Re: Partition parquet data by ENUM column

2015-07-21 Thread Cheng Lian
Parquet support for Thrift/Avro/ProtoBuf ENUM types are just added to the master branch. https://github.com/apache/spark/pull/7048 ENUM types are actually not in the Parquet format spec, that's why we didn't have it at the first place. Basically, ENUMs are always treated as UTF8 strings in Spa

Re: what is : ParquetFileReader: reading summary file ?

2015-07-17 Thread Cheng Lian
Yeah, Spark SQL Parquet support need to do some metadata discovery when firstly importing a folder containing Parquet files, and discovered metadata is cached. Cheng On 7/17/15 1:56 PM, shsh...@tsmc.com wrote: Hi all, our scenario is to generate lots of folders containinig parquet file and t

  1   2   3   4   5   >