Re: wait time between start master and start slaves

2015-04-14 Thread Nicholas Chammas
For the record, this is what I came up with (ignoring the configurable port for now): spark/sbin/start-master.sh master_ui_response_code=0 while [ "$master_ui_response_code" -ne 200 ]; do sleep 1 master_ui_response_code="$( curl --head --silent --output /dev/null \ --

saveAsTextFile and tmp files generations in tasks

2015-04-14 Thread Gil Vernik
Hi, I run very simple operation via ./spark-shell (version 1.3.0 ): val data = Array(1, 2, 3, 4) val distd = sc.parallelize(data) distd.saveAsTextFile(.. ) When i executed it, I saw that 4 tasks very created in Spark. Each task created 2 temp files at different stages, there was 1st tmp file (

Question regarding some of the changes in [SPARK-3477]

2015-04-14 Thread Chester Chen
While working on upgrading to Spark 1.3.x, notice that the Client and ClientArgument classes in yarn module are now defined as private[spark]. I know that these code are mostly used by spark-submit code; but we call Yarn client directly ( without going through spark-submit) in our spark integration

Re: [VOTE] Release Apache Spark 1.2.2

2015-04-14 Thread Patrick Wendell
I'd like to close this vote to coincide with the 1.3.1 release, however, it would be great to have more people test this release first. I'll leave it open for a bit longer and see if others can give a +1. On Tue, Apr 14, 2015 at 9:55 PM, Patrick Wendell wrote: > +1 from me ass well. > > On Tue, A

[RESULT] [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-14 Thread Patrick Wendell
This vote passes with 10 +1 votes (5 binding) and no 0 or -1 votes. +1: Sean Owen* Reynold Xin* Krishna Sankar Denny Lee Mark Hamstra* Sean McNamara* Sree V Marcelo Vanzin GuoQiang Li Patrick Wendell* 0: -1: I will work on packaging this release in the next 48 hours. - Patrick ---

Re: [VOTE] Release Apache Spark 1.2.2

2015-04-14 Thread Patrick Wendell
+1 from me ass well. On Tue, Apr 7, 2015 at 4:36 AM, Sean Owen wrote: > I think that's close enough for a +1: > > Signatures and hashes are good. > LICENSE, NOTICE still check out. > Compiles for a Hadoop 2.6 + YARN + Hive profile. > > JIRAs with target version = 1.2.x look legitimate; no blocker

Re: [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-14 Thread Patrick Wendell
+1 from myself as well On Mon, Apr 13, 2015 at 8:35 PM, GuoQiang Li wrote: > +1 (non-binding) > > > -- Original -- > From: "Patrick Wendell";; > Date: Sat, Apr 11, 2015 02:05 PM > To: "dev@spark.apache.org"; > Subject: [VOTE] Release Apache Spark 1.3.1 (RC3) >

Re:Re: Spark SQL 1.3.1 "saveAsParquetFile" will output tachyon file with different block size

2015-04-14 Thread zhangxiongfei
JIRA opened:https://issues.apache.org/jira/browse/SPARK-6921 At 2015-04-15 00:57:24, "Cheng Lian" wrote: >Would you mind to open a JIRA for this? > >I think your suspicion makes sense. Will have a look at this tomorrow. >Thanks for reporting! > >Cheng > >On 4/13/15 7:13 PM, zhangxiongfei w

Re: [jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-14 Thread Imran Rashid
These are great questions -- I dunno the answer to most of them, but I'll try to at least give my take on "What should be rejected and why?" For new features, I'm often really confused by our guidelines on what to include and what to exclude. Maybe we should ask that all new features make it clea

Re: DDL parser class parsing DDL in spark-sql cli

2015-04-14 Thread JaeSung Jun
Thanks Michael, I was wondering how HiveContext.sql() is hooked up HiveQL..I'll have a look at it. much appreciated. Thanks Jason On 15 April 2015 at 04:15, Michael Armbrust wrote: > HiveQL >

Re: Catching executor exception from executor in driver

2015-04-14 Thread Imran Rashid
(+dev) Hi Justin, short answer: no, there is no way to do that. I'm just guessing here, but I imagine this was done to eliminate serialization problems (eg., what if we got an error trying to serialize the user exception to send from the executors back to the driver?). Though, actually that isn'

Re: start-slaves.sh uses local path from master on remote slave nodes

2015-04-14 Thread Sree V
https://issues.apache.org/jira/browse/SPARK-967 Hi Team, The reporter hasn't replied for the suggested change for this issue. Also, there is a work around suggested. Change or Work-A-Round, so we can close this issue ? Thanking you. With Regards Sree

Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python & more)

2015-04-14 Thread shane knapp
yep, everything is installed (and i just checked again). the path for python 3.4 is /home/anaconda/bin/envs/py3k/bin, which you can find by either manually prepending it to the PATH variable or running 'source activate py3k' in the test. On Tue, Apr 14, 2015 at 11:41 AM, Davies Liu wrote: > Hey

Re: Regularization in MLlib

2015-04-14 Thread DB Tsai
Hi Theodore, I'm currently working on elastic-net regression in ML framework, and I decided not to have any extra layer of abstraction for now but focus on accuracy and performance. We may come out with proper solution later. Any idea is welcome. Sincerely, DB Tsai --

Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python & more)

2015-04-14 Thread Davies Liu
Hey Shane, Have you updated all the jenkins slaves? There is a run with old configurations (no Python 3, with 130 minutes timeout), see https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/666/consoleFull Davies On Thu, Apr 9, 2015 at 10:18 AM, shane knapp wrote: > ok, we're

Re: Eliminate partition filters in execution.Filter after filter pruning

2015-04-14 Thread Michael Armbrust
The contract of the DataSources API is that filters are advisory and you are allowed to ignore them . This is why we always evaluate them ourselves. Have you benchmarked you chan

Re: DDL parser class parsing DDL in spark-sql cli

2015-04-14 Thread Michael Armbrust
HiveQL On Tue, Apr 14, 2015 at 7:13 AM, JaeSung Jun wrote: > Hi, > > Wile I've been walking through spark-sql source code, I typed the following > HiveQL: > > CREATE EXTERNAL TABLE user (

Re: Spark Sql reading hive partitioned tables?

2015-04-14 Thread Michael Armbrust
We can try to add this as part of some hive refactoring we are doing for 1.4. I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-6910 On Tue, Apr 14, 2015 at 9:58 AM, Cheolsoo Park wrote: > Is there a plan to fix this? I also ran into this issue with a *"select * > from tbl where

Re: Using memory mapped file for shuffle

2015-04-14 Thread Kannan Rajah
Sandy, Can you clarify how it won't cause OOM? Is it anyway to related to memory being allocated outside the heap - native space? The reason I ask is that I have a use case to store shuffle data in HDFS. Since there is no notion of memory mapped files, I need to store it as a byte buffer. I want to

Re: Spark Sql reading hive partitioned tables?

2015-04-14 Thread Cheolsoo Park
Is there a plan to fix this? I also ran into this issue with a *"select * from tbl where ... limit 10"* query. Spark SQL is 100x slower than Presto in worst case (1.6M partitions table). This is a serious blocker for us since we have many tables with near (and over) 1M partitions, and any query aga

Re: Spark SQL 1.3.1 "saveAsParquetFile" will output tachyon file with different block size

2015-04-14 Thread Cheng Lian
Would you mind to open a JIRA for this? I think your suspicion makes sense. Will have a look at this tomorrow. Thanks for reporting! Cheng On 4/13/15 7:13 PM, zhangxiongfei wrote: Hi experts I run below code in Spark Shell to access parquet files in Tachyon. 1.First,created a DataFrame by l

Re: Spark ThriftServer encounter java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-04-14 Thread Cheng Lian
Yeah, SQL is the right component. Thanks! Cheng On 4/14/15 12:47 AM, Andrew Lee wrote: Hi Cheng, I couldn't find the component for Spark ThriftServer, will that be 'SQL' component? JIRA created. https://issues.apache.org/jira/browse/SPARK-6882 > Date: Sun, 15 Mar 2015 21:03:34 +0800 > Fro

Re: Fwd: [jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-14 Thread Nicholas Chammas
That's a great point, Burak. I think we are still figuring out when and how to redirect people when their work doesn't fit the main Spark repo, and Spark Packages should be a good destination in some cases. Btw, my comment on JIRA (which Sean referenced) regarding rejecting more patches is here

Re: Using memory mapped file for shuffle

2015-04-14 Thread Sandy Ryza
Hi Kannan, Both in MapReduce and Spark, the amount of shuffle data a task produces can exceed the tasks memory without risk of OOM. -Sandy On Tue, Apr 14, 2015 at 6:47 AM, Imran Rashid wrote: > That limit doesn't have anything to do with the amount of available > memory. Its just a tuning par

Re: Fwd: [jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-14 Thread Burak Yavuz
Hi Sean and fellow devs, I also wanted to chime in and remind people of . Just because the work of someone doesn't fit into the broader scope of things, devs should be encouraged to showcase their hard work in Spark Packages. We have been working hard to make it easier for devs to share their work

Fwd: [jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-14 Thread Sean Owen
Bringing a discussion to dev@. I think the general questions on the table are: - Should more changes be rejected? What are the pros/cons of that? - If no, how do you think about the very large backlog of PRs and JIRAs? - What should be rejected and why? - How much support is there for proactively

Re: Eliminate partition filters in execution.Filter after filter pruning

2015-04-14 Thread Yijie Shen
I’ve opened a PR on this: https://github.com/apache/spark/pull/5509 On April 14, 2015 at 11:57:34 AM, Yijie Shen (henry.yijies...@gmail.com) wrote: Hi, Suppose I have a table t(id: String, event: String) saved as parquet file, and have directory hierarchy:   hdfs://path/to/data/root/dt=2015-01-

DDL parser class parsing DDL in spark-sql cli

2015-04-14 Thread JaeSung Jun
Hi, Wile I've been walking through spark-sql source code, I typed the following HiveQL: CREATE EXTERNAL TABLE user (uid STRING, age INT, gender STRING, job STRING, ts STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/hive/user'; , and I finally came across ddl.scala after analysin

RE: Regularization in MLlib

2015-04-14 Thread Theodore Vasiloudis
Hello DB, could you elaborate a bit on how you are currently fixing this for the new ML pipeline framework? Are there any JIRAs/PR we could follow? Regards, Theodore -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Regularization-in-MLlib-tp11457p115

Re: Using memory mapped file for shuffle

2015-04-14 Thread Imran Rashid
That limit doesn't have anything to do with the amount of available memory. Its just a tuning parameter, as one version is more efficient for smaller files, the other is better for bigger files. I suppose the comment is a little better in FileSegmentManagedBuffer: https://github.com/apache/spark

SQL: Type mismatch when using codegen

2015-04-14 Thread A.M.Chan
When I run tests in DataFrameSuite with codegen on, some type mismatched error occured. test("average") { checkAnswer( decimalData.agg(avg('a)), Row(new java.math.BigDecimal(2.0))) } type mismatch; found : Int(0) required: org.apache.spark.sql.types.DecimalType#JvmType JIRA: https://issues.a