Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Oh, actually, in order to decouple Hadoop 3.2 and Hive 2.3 upgrades, we will need a hive-2.3 profile anyway, no matter having the hive-1.2 profile or not. On Wed, Nov 20, 2019 at 3:33 PM Cheng Lian wrote: > Just to summarize my points: > >1. Let's still keep the Hive 1.2 depen

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Cheng Lian
Sean, thanks for the corner cases you listed. They make a lot of sense. Now I do incline to have Hive 2.3 as the default version. Dongjoon, apologize if I didn't make it clear before. What made me concerned initially was only the following part: > can we remove the usage of forked `hive` in Apach

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-20 Thread Cheng Lian
the hadoop-3.2 > profile. > > What do you mean by "only meaningful under the hadoop-3.2 profile"? > > On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian wrote: > >> Hey Steve, >> >> In terms of Maven artifact, I don't think the default Hadoop version &g

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
ult Hadoop/Hive versions in Spark 3.0, I personally do not have a preference as long as the above two are met. On Wed, Nov 20, 2019 at 3:22 PM Cheng Lian wrote: > Dongjoon, I don't think we have any conflicts here. As stated in other > threads multiple times, as long as Hive 2.3 a

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
0 end-users who really don't want to interact with this > Hive 1.2 fork, they can always use Hive 2.3 at their own risks. > > Specifically, what about having a profile `hive-1.2` at `3.0.0` with the > default Hive 2.3 pom at least? > How do you think about that way, Cheng? >

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Hey Dongjoon and Felix, I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we wouldn't even consider integrating with Hive 2.3 in Spark 3.0. However, *"Hive" and "Hive integration in Spark" are two quite different things*, and I don't think anybody has ever mentioned "the fork

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
h Hive 2.3` pre-built distribution, how do > you think about this, Sean? > The preparation is already started in another email thread and I believe > that is a keystone to prove `Hive 2.3` version stability > (which Cheng/Hyukjin/you asked). > > Bests, > Dongjoon. > > > On

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Cheng Lian
re >> concerned? >> >> -- >> *From:* Steve Loughran >> *Sent:* Sunday, November 17, 2019 9:22:09 AM >> *To:* Cheng Lian >> *Cc:* Sean Owen ; Wenchen Fan ; >> Dongjoon Hyun ; dev ; >> Yuming Wang >> *Sub

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
It's kinda like Scala version upgrade. Historically, we only remove the support of an older Scala version when the newer version is proven to be stable after one or more Spark minor versions. On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian wrote: > Hmm, what exactly did you mean by "rem

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
branch cut. >> >> I know that we have been reluctant to (1) and (2) due to its burden. >> But, it's time to prepare. Without them, we are going to be insufficient >> again and again. >> >> Bests, >> Dongjoon. >> >> >> >> >> O

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2 fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still buggy in terms of JDK 11 support. (BTW, I just found that our root POM is referring b

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Cheng Lian
Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I thought the original proposal was to replace Hive 1.2 with Hive 2.3, which seemed risky, and therefore we only introduced Hive 2.3 under the hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong here... Sean,

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-15 Thread Cheng Lian
Cc Yuming, Steve, and Dongjoon On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian wrote: > Similar to Xiao, my major concern about making Hadoop 3.2 the default > Hadoop version is quality control. The current hadoop-3.2 profile covers > too many major component upgrades, i.e.: > >

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-15 Thread Cheng Lian
Similar to Xiao, my major concern about making Hadoop 3.2 the default Hadoop version is quality control. The current hadoop-3.2 profile covers too many major component upgrades, i.e.: - Hadoop 3.2 - Hive 2.3 - JDK 11 We have already found and fixed some feature and performance regression

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-23 Thread Cheng Lian
+1 (binding) Passed all the tests, looks good. Cheng On 2/23/18 15:00, Holden Karau wrote: +1 (binding) PySpark artifacts install in a fresh Py3 virtual env On Feb 23, 2018 7:55 AM, "Denny Lee" > wrote: +1 (non-binding) On Fri, Feb 23, 2018 at 07:08 J

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-15 Thread Cheng Lian
+1 On 10/12/17 20:10, Liwei Lin wrote: +1 ! Cheers, Liwei On Thu, Oct 12, 2017 at 7:11 PM, vaquar khan > wrote: +1 Regards, Vaquar khan On Oct 11, 2017 10:14 PM, "Weichen Xu" mailto:weichen...@databricks.com>> wrote: +1 On Th

Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-23 Thread Cheng Lian
17733 ? -- Original -- *From: * "Cheng Lian-3 [via Apache Spark Developers List]";<[hidden email] >; *Send time:* Thursday, Feb 23, 2017 9:43 AM *To:* "Stan Zhai"<[hidden email] >; *Subject: * Re: The driver hangs at DataFrame.rdd in Spark 2.1.0 Just from t

Re: [VOTE] Release Apache Parquet 1.8.2 RC1

2017-01-23 Thread Cheng Lian
Sorry for being late, I'm building a Spark branch based on the most recent master to test out 1.8.2-rc1, will post my result here ASAP. Cheng On 1/23/17 11:43 AM, Julien Le Dem wrote: Hi Spark dev, Here is the voting thread for parquet 1.8.2 release. Cheng or someone else we would appreciate

Re: Parquet patch release

2017-01-09 Thread Cheng Lian
Finished reviewing the list and it LGTM now (left comments in the spreadsheet and Ryan already made corresponding changes). Ryan - Thanks a lot for pushing this and making it happen! Cheng On 1/6/17 3:46 PM, Ryan Blue wrote: Last month, there was interest in a Parquet patch release on PR #162

Re: Is `randomized aggregation test` testsuite stable?

2016-11-10 Thread Cheng Lian
JIRA: https://issues.apache.org/jira/browse/SPARK-18403 PR: https://github.com/apache/spark/pull/15845 Will merge it as soon as Jenkins passes. Cheng On 11/10/16 11:30 AM, Dongjoon Hyun wrote: Great! Thank you so much, Cheng! Bests, Dongjoon. On 2016-11-10 11:21 (-0800), Cheng Lian wrote

Re: Is `randomized aggregation test` testsuite stable?

2016-11-10 Thread Cheng Lian
Hey Dongjoon, Thanks for reporting. I'm looking into these OOM errors. Already reproduced them locally but haven't figured out the root cause yet. Gonna disable them temporarily for now. Sorry for the inconvenience! Cheng On 11/10/16 8:48 AM, Dongjoon Hyun wrote: Hi, All. Recently, I obs

Re: welcoming Xiao Li as a committer

2016-10-04 Thread Cheng Lian
Congratulations!!! Cheng On Tue, Oct 4, 2016 at 1:46 PM, Reynold Xin wrote: > Hi all, > > Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark > committer. Xiao has been a super active contributor to Spark SQL. Congrats > and welcome, Xiao! > > - Reynold > >

Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Cheng Lian
today. Should I take discussion to your PR? Pedro On Fri, Jun 17, 2016 at 11:12 PM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: Hey Pedro, SQL programming guide is being updated. Here's the PR, but not merged yet: https://github.com/apache/spark/pull/13592 Chen

Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Cheng Lian
Hey Pedro, SQL programming guide is being updated. Here's the PR, but not merged yet: https://github.com/apache/spark/pull/13592 Cheng On 6/17/16 9:13 PM, Pedro Rodriguez wrote: Hi All, At my workplace we are starting to use Datasets in 1.6.1 and even more with Spark 2.0 in place of Datafr

Re: Welcoming two new committers

2016-02-17 Thread Cheng Lian
Awesome! Congrats and welcome!! Cheng On Tue, Feb 9, 2016 at 2:55 AM, Shixiong(Ryan) Zhu wrote: > Congrats!!! Herman and Wenchen!!! > > > On Mon, Feb 8, 2016 at 10:44 AM, Luciano Resende > wrote: > >> >> >> On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia >> wrote: >> >>> Hi all, >>> >>> The PMC

Re: Welcoming two new committers

2016-02-17 Thread Cheng Lian
Awesome! Congrats and welcome!! On 2/9/16 2:55 AM, Shixiong(Ryan) Zhu wrote: Congrats!!! Herman and Wenchen!!! On Mon, Feb 8, 2016 at 10:44 AM, Luciano Resende > wrote: On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia mailto:matei.zaha...@gmail.com>> wrote:

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-26 Thread Cheng Lian
+1 On 12/23/15 12:39 PM, Yin Huai wrote: +1 On Tue, Dec 22, 2015 at 8:10 PM, Denny Lee > wrote: +1 On Tue, Dec 22, 2015 at 7:05 PM Aaron Davidson mailto:ilike...@gmail.com>> wrote: +1 On Tue, Dec 22, 2015 at 7:01 PM, Josh Rosen

Re: [build system] jenkins downtime, thursday 12/10/15 7am PDT

2015-12-10 Thread Cheng Lian
Hi Shane, I found that Jenkins has been in the status of "Jenkins is going to shut down" for at least 4 hours (from ~23:30 Dec 9 to 3:45 Dec 10, PDT). Not sure whether this is part of the schedule or related? Cheng On Thu, Dec 10, 2015 at 3:56 AM, shane knapp wrote: > here's the security advis

Re: Filter applied on merged Parquet shemsa with new column fails.

2015-10-28 Thread Cheng Lian
Hey Hyukjin, Sorry that I missed the JIRA ticket. Thanks for bring this issue up here, your detailed investigation. From my side, I think this is a bug of Parquet. Parquet was designed to support schema evolution. When scanning a Parquet, if a column exists in the requested schema but missin

Re: [ compress in-memory column storage used in sparksql cache table ]

2015-09-02 Thread Cheng Lian
Yeah, two of the reasons why the built-in in-memory columnar storage doesn't achieve comparable compression ratio as Parquet are: 1. The in-memory columnar representation doesn't handle nested types. So array/map/struct values are not compressed. 2. Parquet may use more than one kind of compres

Re: possible issues with listing objects in the HadoopFSrelation

2015-08-12 Thread Cheng Lian
Hi Gil, Sorry for the late reply and thanks for raising this question. The file listing logic in HadoopFsRelation is intentionally made different from Hadoop FileInputFormat. Here are the reasons: 1. Efficiency: when computing RDD partitions, FileInputFormat.listStatus() is called on the dri

Deleted unreleased version 1.6.0 from JIRA by mistake

2015-07-22 Thread Cheng Lian
Hi all, The unreleased version 1.6.0 has was removed from JIRA due to my misoperation. I've added it back, but JIRA tickets that once targeted to 1.6.0 now have empty target version/s. If you found tickets that should have targeted to 1.6.0, please help marking the target version/s field back

Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-12 Thread Cheng Lian
download from driver and setup classpath Right? But somehow, the first step fails. Even if I can make the first step works(use option1), it seems that the classpath in driver is not correctly set. Thanks Dong Lei *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Thursday, June 11, 2015 2:32

Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-10 Thread Cheng Lian
driver will not need to setup a HTTP file server for this scenario and the worker will fetch the jars and files from HDFS? Thanks Dong Lei *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Thursday, June 11, 2015 12:50 PM *To:* Dong Lei; dev@spark.apache.org *Cc:* Dianfei (Keith) Han *Subject

Re: How to support dependency jars and files on HDFS in standalone cluster mode?

2015-06-10 Thread Cheng Lian
Since the jars are already on HDFS, you can access them directly in your Spark application without using --jars Cheng On 6/11/15 11:04 AM, Dong Lei wrote: Hi spark-dev: I can not use a hdfs location for the “--jars” or “--files” option when doing a spark-submit in a standalone cluster mode.

Re: About akka used in spark

2015-06-10 Thread Cheng Lian
We only shaded protobuf dependencies because of compatibility issues. The source code is not modified. On 6/10/15 1:55 PM, wangtao (A) wrote: Hi guys, I see group id of akka used in spark is “org.spark-project.akka”. What is its difference with the typesafe one? What is its version? And whe

Re: Spark SQL 1.3.1 "saveAsParquetFile" will output tachyon file with different block size

2015-04-14 Thread Cheng Lian
Would you mind to open a JIRA for this? I think your suspicion makes sense. Will have a look at this tomorrow. Thanks for reporting! Cheng On 4/13/15 7:13 PM, zhangxiongfei wrote: Hi experts I run below code in Spark Shell to access parquet files in Tachyon. 1.First,created a DataFrame by l

Re: Spark ThriftServer encounter java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-04-14 Thread Cheng Lian
Yeah, SQL is the right component. Thanks! Cheng On 4/14/15 12:47 AM, Andrew Lee wrote: Hi Cheng, I couldn't find the component for Spark ThriftServer, will that be 'SQL' component? JIRA created. https://issues.apache.org/jira/browse/SPARK-6882 > Date: Sun, 15 Mar 2015 21:03:34 +0800 > Fro

Re: Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-12 Thread Cheng Lian
Thanks for reporting this! Would you mind to open JIRA tickets for both Spark and Parquet? I'm not sure whether Parquet declares somewhere the user mustn't reuse byte arrays when using binary type. If it does, then it's a Spark bug. Anyway, this should be fixed. Cheng On 4/12/15 1:50 PM, Yi

Re: IntelliJ Runtime error

2015-04-04 Thread Cheng Lian
I found in general it's a pain to build/run Spark inside IntelliJ IDEA. I guess most people resort to this approach so that they can leverage the integrated debugger to debug and/or learn Spark internals. A more convenient way I'm using recently is resorting to the remote debugging feature. In

Re: [sql] How to uniquely identify Dataframe?

2015-03-30 Thread Cheng Lian
This is because unlike SchemaRDD, DataFrame itself is no longer an RDD now. In the meanwhile, DataFrame.rdd is a function, which always returns a new RDD. I think you may use DataFrame.queryExecution.logical (the logical plan) as an ID. Maybe we should make it a "lazy val" rather than a "def".

Re: Lazy casting with Catalyst

2015-03-28 Thread Cheng Lian
op of that RDD it seemed to bring in the whole row. What version of Spark SQL are you using? Would you mind to provide a brief snippet that can reproduce this issue? This might be a bug depending on your concrete usage. Thanks in advance! Thanks! -Pat On Sat, Mar 28, 2015 at 11:35 AM,

Re: Lazy casting with Catalyst

2015-03-28 Thread Cheng Lian
Hi Pat, I don't understand what "lazy casting" mean here. Why do you think current Catalyst casting is "eager"? Casting happens at runtime, and doesn't disable column pruning. Cheng On 3/28/15 11:26 PM, Patrick Woody wrote: Hi all, In my application, we take input from Parquet files where

Re: Support for Hive 0.14 in secure mode on hadoop 2.6.0

2015-03-27 Thread Cheng Lian
We're planning to replace the current Hive version profiles and shim layer with an adaption layer in Spark SQL in 1.4. This adaption layer allows Spark SQL to connect to arbitrary Hive version greater than or equal to 0.12.0 (or maybe 0.13.1, not decided yet). However, it's not a promise yet,

Re: Understanding shuffle file name conflicts

2015-03-25 Thread Cheng Lian
jobs rely on the same ShuffledRDD. I think only shuffle write which generates shuffle files will have chance to meet name conflicts, multiple times of shuffle read is acceptable as the code snippet shows. Thanks Jerry -Original Message----- From: Cheng Lian [mailto:lian.cs@gmail

Re: Understanding shuffle file name conflicts

2015-03-25 Thread Cheng Lian
Hi Jerry & Josh It has been a while since the last time I looked into Spark core shuffle code, maybe I’m wrong here. But the shuffle ID is created along with ShuffleDependency, which is part of the RDD DAG. So if we submit multiple jobs over the same RDD DAG, I think the shuffle IDs in these

Re: parquet support - some questions about code

2015-03-18 Thread Cheng Lian
Hey Gil, ParquetRelation2 is based on the external data sources API, which is a more modular and non-intrusive way to add external data sources to Spark SQL. We are planning to replace ParquetRelation with ParquetRelation2 entirely after the latter is more mature and stable. That's why you see

Re: Wrong version on the Spark documentation page

2015-03-16 Thread Cheng Lian
Cheng - what if you hold shift+refresh? For me the /latest link correctly points to 1.3.0 On Sun, Mar 15, 2015 at 10:40 AM, Cheng Lian mailto:lian.cs@gmail.com>> wrote: > It's still marked as 1.2.1 here http://spark.apache.org/docs/latest/ > &g

Re: SparkSQL 1.3.0 cannot read parquet files from different file system

2015-03-16 Thread Cheng Lian
Oh sorry, I misread your question. I thought you were trying something like |parquetFile(“s3n://file1,hdfs://file2”)|. Yeah, it’s a valid bug. Thanks for opening the JIRA ticket and the PR! Cheng On 3/16/15 6:39 PM, Cheng Lian wrote: Hi Pei-Lun, We intentionally disallowed passing

Re: SparkSQL 1.3.0 cannot read parquet files from different file system

2015-03-16 Thread Cheng Lian
Hi Pei-Lun, We intentionally disallowed passing multiple comma separated paths in 1.3.0. One of the reason is that users report that this fail when a file path contain an actual comma in it. In your case, you may do something like this: |val s3nDF = parquetFile("s3n://...") val hdfsDF =

Wrong version on the Spark documentation page

2015-03-15 Thread Cheng Lian
It's still marked as 1.2.1 here http://spark.apache.org/docs/latest/ But this page is updated (1.3.0) http://spark.apache.org/docs/latest/index.html Cheng - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additio

Re: Spark ThriftServer encounter java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-03-15 Thread Cheng Lian
Hey Andrew, Would you please create a JIRA ticket for this? To preserve compatibility with existing Hive JDBC/ODBC drivers, Spark SQL's HiveThriftServer intercepts some HiveServer2 components and injects Spark stuff into it. This makes the implementation details are somewhat hacky (e.g. a bun

Re: number of partitions for hive schemaRDD

2015-02-26 Thread Cheng Lian
Hi Masaki, I guess what you saw is the partition number of the last stage, which must be 1 to perform the global phase of LIMIT. To tune partition number of normal shuffles like joins, you may resort to spark.sql.shuffle.partitions. Cheng On 2/26/15 5:31 PM, masaki rikitoku wrote: Hi all

Re: Spark SQL, Hive & Parquet data types

2015-02-23 Thread Cheng Lian
Ah, sorry for not being clear enough. So now in Spark 1.3.0, we have two Parquet support implementations, the old one is tightly coupled with the Spark SQL framework, while the new one is based on data sources API. In both versions, we try to intercept operations over Parquet tables registered

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Cheng Lian
My bad, had once fixed all Hive 12 test failures in PR #4107, but didn't got time to get it merged. Considering the release is close, I can cherry-pick those Hive 12 fixes from #4107 and open a more surgical PR soon. Cheng On 2/24/15 4:18 AM, Michael Armbrust wrote: On Sun, Feb 22, 2015 at

Re: Spark SQL - Long running job

2015-02-23 Thread Cheng Lian
I meant using |saveAsParquetFile|. As for partition number, you can always control it with |spark.sql.shuffle.partitions| property. Cheng On 2/23/15 1:38 PM, nitin wrote: I believe calling processedSchemaRdd.persist(DISK) and processedSchemaRdd.checkpoint() only persists data and I will lose

Re: Spark SQL, Hive & Parquet data types

2015-02-23 Thread Cheng Lian
Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its own Parquet support to read partitioned Parquet tables declared in Hive metastore. Only writing to partitioned tables is not covered yet. These improvements will be included in Spark 1.3.0. Just created SPARK-5948 to tr

Re: Spark SQL - Long running job

2015-02-22 Thread Cheng Lian
How about persisting the computed result table first before caching it? So that you only need to cache the result table after restarting your service without recomputing it. Somewhat like checkpointing. Cheng On 2/22/15 12:55 AM, nitin wrote: Hi All, I intend to build a long running spark ap

Re: Spark SQL, Hive & Parquet data types

2015-02-20 Thread Cheng Lian
For the second question, we do plan to support Hive 0.14, possibly in Spark 1.4.0. For the first question: 1. In Spark 1.2.0, the Parquet support code doesn’t support timestamp type, so you can’t. 2. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its own Parquet support

Re: Get size of rdd in memory

2015-02-02 Thread Cheng Lian
It's already fixed in the master branch. Sorry that we forgot to update this before releasing 1.2.0 and caused you trouble... Cheng On 2/2/15 2:03 PM, ankits wrote: Great, thank you very much. I was confused because this is in the docs: https://spark.apache.org/docs/1.2.0/sql-programming-guid

Re: Get size of rdd in memory

2015-02-02 Thread Cheng Lian
Actually |SchemaRDD.cache()| behaves exactly the same as |cacheTable| since Spark 1.2.0. The reason why your web UI didn’t show you the cached table is that both |cacheTable| and |sql("SELECT ...")| are lazy :-) Simply add a |.collect()| after the |sql(...)| call. Cheng On 2/2/15 12:23 PM, an

Re: Get size of rdd in memory

2015-01-30 Thread Cheng Lian
Here is a toy |spark-shell| session snippet that can show the memory consumption difference: |import org.apache.spark.sql.SQLContext import sc._ val sqlContext = new SQLContext(sc) import sqlContext._ setConf("spark.sql.shuffle.partitions","1") case class KV(key:Int, value:String) p

Re: renaming SchemaRDD -> DataFrame

2015-01-29 Thread Cheng Lian
Yes, when a DataFrame is cached in memory, it's stored in an efficient columnar format. And you can also easily persist it on disk using Parquet, which is also columnar. Cheng On 1/29/15 1:24 PM, Koert Kuipers wrote: to me the word DataFrame does come with certain expectations. one of them is

Re: renaming SchemaRDD -> DataFrame

2015-01-29 Thread Cheng Lian
Forgot to mention that you can find it here <https://github.com/apache/spark/blob/f9e569452e2f0ae69037644170d8aa79ac6b4ccf/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala>. On 1/29/15 1:59 PM, Cheng Lian wrote: Yes, when a DataFrame is cached in

Re: Data source API | Support for dynamic schema

2015-01-28 Thread Cheng Lian
Hi Aniket, In general the schema of all rows in a single table must be same. This is a basic assumption made by Spark SQL. Schema union does make sense, and we're planning to support this for Parquet. But as you've mentioned, it doesn't help if types of different versions of a column differ fr

Re: Is there any way to support multiple users executing SQL on thrift server?

2015-01-20 Thread Cheng Lian
Hey Yi, I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would like to investigate this issue later. Would you please open an JIRA for it? Thanks! Cheng On 1/19/15 1:00 AM, Yi Tian wrote: Is there any way to support multiple users executing SQL on one thrift server? I

Re: [SPARK-5100][SQL] Spark Thrift server monitor page

2015-01-06 Thread Cheng Lian
Talked with Yi offline, personally I think this feature is pretty useful, and the design makes sense, and he's already got a running prototype. Yi, would you mind to open a PR for this? Thanks! Cheng On 1/6/15 5:25 PM, Yi Tian wrote: Hi, all I have create a JIRA ticket about adding a monito

Re: How spark and hive integrate in long term?

2014-11-22 Thread Cheng Lian
Should emphasize that this is still a quick and rough conclusion, will investigate this in more detail after 1.2.0 release. Anyway we really like to provide Hive support in Spark SQL as smooth and clean as possible for both developers and end users. On 11/22/14 11:05 PM, Cheng Lian wrote

Re: How spark and hive integrate in long term?

2014-11-22 Thread Cheng Lian
Hey Zhan, This is a great question. We are also seeking for a stable API/protocol that works with multiple Hive versions (esp. 0.12+). SPARK-4114 was opened for this. Did some research into HCatalog recently, but I must confess that I’m not a

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-14 Thread Cheng Lian
+1 Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues are fixed. Hive version inspection works as expected. On 11/15/14 8:25 AM, Zach Fry wrote: +0 I expect to start testing on Monday but won't have enough results to change my vote from +0 until Monday night or Tuesday mo

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Cheng Lian
one more question - does that mean that we still need enough memory in the cluster to uncompress the data before it can be compressed again or does that just read the raw data as is? On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: Currently there

Re: Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Cheng Lian
Currently there’s no way to cache the compressed sequence file directly. Spark SQL uses in-memory columnar format while caching table rows, so we must read all the raw data and convert them into columnar format. However, you can enable in-memory columnar compression by setting |spark.sql.inMemo

Re: thrift jdbc server probably running queries as hive query

2014-11-11 Thread Cheng Lian
4-89.compute-1.amazonaws.com%3A8100%2Fproxy%2Fapplication_1414084656759_0142%2F&si=6222577584832512&pi=626685a9-b628-43cc-91a1-93636171ce77> Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1414084656759_0142 On Mon, Nov 10, 2014 at 9:59 PM, Cheng Lian <mailto:lian.cs@g

Re: thrift jdbc server probably running queries as hive query

2014-11-10 Thread Cheng Lian
Hey Sadhan, I really don't think this is Spark log... Unlike Shark, Spark SQL doesn't even provide a Hive mode to let you execute queries against Hive. Would you please check whether there is an existing HiveServer2 running there? Spark SQL HiveThriftServer2 is just a Spark port of HiveServer

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Cheng Lian
+1 since this is already the de facto model we are using. On Thu, Nov 6, 2014 at 12:40 PM, Wangfei (X) wrote: > +1 > > 发自我的 iPhone > > > 在 2014年11月5日,20:06,"Denny Lee" 写道: > > > > +1 great idea. > >> On Wed, Nov 5, 2014 at 20:04 Xiangrui Meng wrote: > >> > >> +1 (binding) > >> > >> On Wed, Nov

Re: sbt scala compiler crashes on spark-sql

2014-11-02 Thread Cheng Lian
I often see this when I first build the whole Spark project with SBT, then modify some code and tries to build and debug within IDEA, or vice versa. A clean rebuild can always solve this. On Mon, Nov 3, 2014 at 11:28 AM, Patrick Wendell wrote: > Does this happen if you clean and recompile? I've

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Cheng Lian
e-0.13.1" profiles. I was able to run spark core tests from within IntelliJ. Didn't try anything beyond that, but FWIW this worked. - Patrick On Tue, Oct 28, 2014 at 9:54 PM, Cheng Lian wrote: You may first open the root pom.xml file in IDEA, and then go for menu View / Tool Windows /

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Cheng Lian
2014 at 9:42 PM, Stephen Boesch wrote: I am interested specifically in how to build (and hopefully run/debug..) under Intellij. Your posts sound like command line maven - which has always been working already. Do you have instructions for building in IJ? 2014-10-28 21:38 GMT-07:00 Cheng Lian : Ye

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Cheng Lian
nd like command line maven - which has always been working already. Do you have instructions for building in IJ? 2014-10-28 21:38 GMT-07:00 Cheng Lian <mailto:lian.cs@gmail.com>>: Yes, these two combinations work for me. On 10/29/14 12:32 PM, Zhan Zhang wrote:

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Cheng Lian
Yes, these two combinations work for me. On 10/29/14 12:32 PM, Zhan Zhang wrote: -Phive is to enable hive-0.13.1 and "-Phive -Phive-0.12.0” is to enable hive-0.12.0. Note that the thrift-server is not supported yet in hive-0.13, but expected to go to upstream soon (Spark-3720). Thanks. Zhan

Re: best IDE for scala + spark development?

2014-10-28 Thread Cheng Lian
My two cents for Mac Vim/Emacs users. Fixed a Scala ctags Mac compatibility bug months ago, and you may want to use the most recent version here https://github.com/scala/scala-dist/blob/master/tool-support/src/emacs/contrib/dot-ctags On Tue, Oct 28, 2014 at 4:26 PM, Duy Huynh wrote: > thanks e

Re: HiveContext bug?

2014-10-28 Thread Cheng Lian
Hi Marcelo, yes this is a known Spark SQL bug and we've got PRs to fix it (2887 & 2967). Not merged yet because newly merged Hive 0.13.1 support causes some conflicts. Thanks for reporting this :) On Tue, Oct 28, 2014 at 6:41 AM, Marcelo Vanzin wrote: > Well, looks like a huge coincidence, but t

Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Cheng Lian
It's a new pull request builder written by Josh, integrated into our state-of-the-art PR dashboard :) On 10/21/14 9:33 PM, Nan Zhu wrote: just curious…what is this “NewSparkPullRequestBuilder”? Best, -- Nan Zhu On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote: Hm, seems that

Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Cheng Lian
Hm, seems that 7u71 comes back again. Observed similar Kinesis compilation error just now: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull Checked Jenkins slave nodes, saw /usr/java/latest points to jdk1.7.0_71. However, /usr/bin/javac -version says:

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread Cheng Lian
The foreign data source API PR also matters here https://www.github.com/apache/spark/pull/2475 Foreign data source like ORC can be added more easily and systematically after this PR is merged. On 10/9/14 8:22 AM, James Yu wrote: Thanks Mark! I will keep eye on it. @Evan, I saw people use bo

Re: HiveContext: cache table not supported for partitioned table?

2014-10-02 Thread Cheng Lian
Cache table works with partitioned table. I guess you’re experimenting with a default local metastore and the metastore_db directory doesn’t exist at the first place. In this case, all metastore tables/views don’t exist at first and will throw the error message you saw when the |PARTITIONS| me

Re: Extending Scala style checks

2014-10-01 Thread Cheng Lian
Since we can easily catch the list of all changed files in a PR, I think we can start with adding the no trailing space check for newly changed files only? On 10/2/14 9:24 AM, Nicholas Chammas wrote: Yeah, I remember that hell when I added PEP 8 to the build checks and fixed all the outstandin

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Cheng Lian
Would you mind to provide the DDL of this partitioned table together with the query you tried? The stacktrace suggests that the query was trying to cast a map into something else, which is not supported in Spark SQL. And I doubt whether Hive support casting a complex type to some other type.

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Cheng Lian
Would you mind to provide the DDL of this partitioned table together with the query you tried? The stacktrace suggests that the query was trying to cast a map into something else, which is not supported in Spark SQL. And I doubt whether Hive support casting a complex type to some other type.

Re: Question about SparkSQL and Hive-on-Spark

2014-09-24 Thread Cheng Lian
I don’t think so. For example, we’ve already added extended syntax like CACHE TABLE. ​ On Wed, Sep 24, 2014 at 3:27 PM, Yi Tian wrote: > Hi Reynold! > > Will sparkSQL strictly obey the HQL syntax ? > > For example, the cube function. > > In other words, the hiveContext of sparkSQL should only im

Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Cheng Lian
+1. Tested locally on OSX 10.9, built with Hadoop 2.4.1 - Checked Datanucleus jar files - Tested Spark SQL Thrift server and CLI under local mode and standalone cluster against MySQL backed metastore On Wed, Sep 3, 2014 at 11:25 AM, Josh Rosen wrote: > +1. Tested on Windows and EC2. Confir

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Cheng Lian
+1 - Tested Thrift server and SQL CLI locally on OSX 10.9. - Checked datanucleus dependencies in distribution tarball built by make-distribution.sh without SPARK_HIVE defined. ​ On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wrote: > +1 > > Tested Scala/MLlib apps on Fedora 20 (OpenJDK

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Cheng Lian
Welcome Shane! Glad to see that finally a hero jumping out to tame Jenkins :) On Tue, Sep 2, 2014 at 12:44 PM, Henry Saputra wrote: > Welcome Shane =) > > > - Henry > > On Tue, Sep 2, 2014 at 10:35 AM, shane knapp wrote: > > so, i had a meeting w/the databricks guys on friday and they recommen

Re: about spark assembly jar

2014-09-02 Thread Cheng Lian
On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com) > wrote: > > Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :) > > Maybe we should add a "developer notes" page to document all these useful > black magic. > > > On Tue, Sep 2, 2

Re: about spark assembly jar

2014-09-02 Thread Cheng Lian
Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :) Maybe we should add a "developer notes" page to document all these useful black magic. On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin wrote: > Having a SSD help tremendously with assembly time. > > Without that, you can do the following

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Cheng Lian
Just noticed one thing: although --with-hive is deprecated by -Phive, make-distribution.sh still relies on $SPARK_HIVE (which was controlled by --with-hive) to determine whether to include datanucleus jar files. This means we have to do something like SPARK_HIVE=true ./make-distribution.sh ... to e

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-28 Thread Cheng Lian
+1. Tested Spark SQL Thrift server and CLI against a single node standalone cluster. On Thu, Aug 28, 2014 at 9:27 PM, Timothy Chen wrote: > +1 Make-distrubtion works, and also tested simple spark jobs on Spark > on Mesos on 8 node Mesos cluster. > > Tim > > On Thu, Aug 28, 2014 at 8:53 PM, Bura

Re: Jira tickets for starter tasks

2014-08-28 Thread Cheng Lian
You can just start the work :) On Thu, Aug 28, 2014 at 3:52 PM, Bill Bejeck wrote: > Hi, > > How do I get a starter task jira ticket assigned to myself? Or do I just do > the work and issue a pull request with the associated jira number? > > Thanks, > Bill >

Re: deleted: sql/hive/src/test/resources/golden/case sensitivity on windows

2014-08-28 Thread Cheng Lian
Colon is not allowed to be part of a Windows file name and I think Git just cannot create this file while cloning. Remove the colon in the name string of this test case

Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-27 Thread Cheng Lian
I believe in your case, the “magic” happens in TableReader.fillObject . Here we unwrap the field value according to the object inspector of that f

  1   2   >