RE: MIMA Compatiblity Checks

2014-07-10 Thread Liu, Raymond
so how to run the check locally? On master tree, sbt mimaReportBinaryIssues Seems to lead to a lot of errors reported. Do we need to modify SparkBuilder.scala etc to run it locally? Could not figure out how Jekins run the check on its console outputs. Best Regards, Raymond Liu -Original M

Re: MIMA Compatiblity Checks

2014-07-10 Thread Reynold Xin
You can take a look at https://github.com/apache/spark/blob/master/dev/run-tests dev/mima On Thu, Jul 10, 2014 at 12:21 AM, Liu, Raymond wrote: > so how to run the check locally? > > On master tree, sbt mimaReportBinaryIssues Seems to lead to a lot of > errors reported. Do we need to modify S

when insert data into one table which is on tachyon, how can i control the data position?

2014-07-10 Thread qingyang li
when insert data (the data is small, it will not be partitioned automatically)into one table which is on tachyon, how can i control the data position, i mean how can i point which machine the data should exist on? if we can not control, what is the data assign strategy of tachyon or spark?

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread RJ Nowling
I went ahead and created JIRAs. JIRA for Hierarchical Clustering: https://issues.apache.org/jira/browse/SPARK-2429 JIRA for Standarized Clustering APIs: https://issues.apache.org/jira/browse/SPARK-2430 Before submitting a PR for the standardized API, I want to implement a few clustering algorith

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread Nick Pentreath
Might be worth checking out scikit-learn and mahout to get some broad ideas— Sent from Mailbox On Thu, Jul 10, 2014 at 4:25 PM, RJ Nowling wrote: > I went ahead and created JIRAs. > JIRA for Hierarchical Clustering: > https://issues.apache.org/jira/browse/SPARK-2429 > JIRA for Standarized Cluste

Feature selection interface

2014-07-10 Thread Ulanov, Alexander
Hi, I've implemented a class that does Chi-squared feature selection for RDD[LabeledPoint]. It also computes basic class/feature occurrence statistics and other methods like mutual information or information gain can be easily implemented. I would like to make a pull request. However, MLlib mas

Changes to sbt build have been merged

2014-07-10 Thread Patrick Wendell
Just a heads up, we merged Prashant's work on having the sbt build read all dependencies from Maven. Please report any issues you find on the dev list or on JIRA. One note here for developers, going forward the sbt build will use the same configuration style as the maven build (-D for options and

Re: Changes to sbt build have been merged

2014-07-10 Thread Sandy Ryza
Woot! On Thu, Jul 10, 2014 at 11:15 AM, Patrick Wendell wrote: > Just a heads up, we merged Prashant's work on having the sbt build read all > dependencies from Maven. Please report any issues you find on the dev list > or on JIRA. > > One note here for developers, going forward the sbt build w

Re: Changes to sbt build have been merged

2014-07-10 Thread yao
Cool~ On Thu, Jul 10, 2014 at 1:29 PM, Sandy Ryza wrote: > Woot! > > > On Thu, Jul 10, 2014 at 11:15 AM, Patrick Wendell > wrote: > > > Just a heads up, we merged Prashant's work on having the sbt build read > all > > dependencies from Maven. Please report any issues you find on the dev > list

EC2 clusters ready in launch time + 30 seconds

2014-07-10 Thread Nicholas Chammas
Hi devs! Right now it takes a non-trivial amount of time to launch EC2 clusters. Part of this time is spent starting the EC2 instances, which is out of our control. Another part of this time is spent installing stuff on and configuring the instances. This, we can control. I’d like to explore appr

sparkSQL thread safe?

2014-07-10 Thread Ian O'Connell
Had a few quick questions... Just wondering if right now spark sql is expected to be thread safe on master? doing a simple hadoop file -> RDD -> schema RDD -> write parquet will fail in reflection code if i run these in a thread pool. The SparkSqlSerializer, seems to create a new Kryo instance

Re: sparkSQL thread safe?

2014-07-10 Thread Michael Armbrust
Hey Ian, Thanks for bringing these up! Responses in-line: Just wondering if right now spark sql is expected to be thread safe on > master? > doing a simple hadoop file -> RDD -> schema RDD -> write parquet > will fail in reflection code if i run these in a thread pool. > You are probably hittin

RE: EC2 clusters ready in launch time + 30 seconds

2014-07-10 Thread Nate D'Amico
You are partially correct. It's not terribly complex, but also not easy to accomplish. Sounds like you want to manage some partially/fully baked AMI's with the core spark libs and dependencies already on the image. Main issues that crop up are: 1) image sprawl, as libs/config/defaults/etc cha

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-10 Thread Gary Malouf
-1 I honestly do not know the voting rules for the Spark community, so please excuse me if I am out of line or if Mesos compatibility is not a concern at this point. We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos 0.18.2. All of our jobs with data above a few gigabytes hun

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-10 Thread Gary Malouf
Just realized the deadline was Monday, my apologies. The issue nevertheless stands. On Thu, Jul 10, 2014 at 9:28 PM, Gary Malouf wrote: > -1 I honestly do not know the voting rules for the Spark community, so > please excuse me if I am out of line or if Mesos compatibility is not a > concern a

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-10 Thread Patrick Wendell
Hey Gary, The vote technically doesn't close until I send the vote summary e-mail, but I was planning to close and package this tonight. It's too bad if there is a regression, it might be worth holding the release but it really requires narrowing down the issue to get more information about the sc

Re: PySpark Driver from Jython

2014-07-10 Thread davies
The function run in worker is serialized in driver, so the driver and worker should be run in the same Python interpreter. If you do not need c extension support, then Jython will be better than CPython, because of the cost of serialization is much lower. Davies -- View this message in context

what is the difference between org.spark-project.hive and org.apache.hadoop.hive

2014-07-10 Thread kingfly
-- Best Regards Frank Wang | Software Engineer Mobile: +86 18505816792 Phone: +86 571 63547 Fax: Email: wangf...@huawei.com Huawei Technologies Co., Ltd. Hangzhou R&D Center NO.410, JiangHong Road, Binjiang Area, H

Re: what is the difference between org.spark-project.hive and org.apache.hadoop.hive

2014-07-10 Thread Patrick Wendell
There are two differences: 1. We publish hive with a shaded protobuf dependency to avoid conflicts with some Hadoop versions. 2. We publish a proper hive-exec jar that only includes hive packages. The upstream version of hive-exec bundles a bunch of other random dependencies in it which makes it r