Re: [VOTE] Deprecate SparkR

2024-08-21 Thread Xiangrui Meng
+1 On Wed, Aug 21, 2024, 10:24 AM Mridul Muralidharan wrote: > +1 > > > Regards, > Mridul > > > On Wed, Aug 21, 2024 at 11:46 AM Reynold Xin > wrote: > >> +1 >> >> On Wed, Aug 21, 2024 at 6:42 PM Shivaram Venkataraman < >> shivaram.venkatara...@gmail.com> wrote: >> >>> Hi all >>> >>> Based on t

Re: [DISCUSS] Deprecating SparkR

2024-08-13 Thread Xiangrui Meng
+1 On Tue, Aug 13, 2024, 2:43 PM Jungtaek Lim wrote: > +1 > > Looks to be sufficient to VOTE? > > 2024년 8월 14일 (수) 오전 1:10, Wenchen Fan 님이 작성: > >> +1 >> >> On Tue, Aug 13, 2024 at 10:50 PM L. C. Hsieh wrote: >> >>> +1 >>> >>> On Tue, Aug 13, 2024 at 2:54 AM Dongjoon Hyun >>> wrote: >>> > >>>

Re: [VOTE] SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Xiangrui Meng
+1 On Wed, Sep 21, 2022 at 6:53 PM Kent Yao wrote: > +1 > > *Kent Yao * > @ Data Science Center, Hangzhou Research Institute, NetEase Corp. > *a spark enthusiast* > *kyuubi is a unified multi-tenant JDBC > interface for large-scale data processing and analytic

Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-28 Thread Xiangrui Meng
+1. And we should start testing 3.7 and maybe 3.8 in Jenkins. On Thu, Oct 24, 2019 at 9:34 AM Dongjoon Hyun wrote: > Thank you for starting the thread. > > In addition to that, we currently are testing Python 3.6 only in Apache > Spark Jenkins environment. > > Given that Python 3.8 is already ou

Re: SparkGraph review process

2019-10-04 Thread Xiangrui Meng
lo dear Spark community > > We are the developers behind the SparkGraph SPIP, which is a project > created out of our work on openCypher Morpheus ( > https://github.com/opencypher/morpheus). During this year we have > collaborated with mainly Xiangrui Meng of Databricks to define and deve

[ANNOUNCEMENT] Plan for dropping Python 2 support

2019-06-03 Thread Xiangrui Meng
Hi all, Today we announced the plan for dropping Python 2 support [1] in Apache Spark: As many of you already knew, Python core development team and many utilized Python packages like Pandas and NumPy will drop Python 2 suppor

Re: Should python-2 be supported in Spark 3.0?

2019-06-03 Thread Xiangrui Meng
-- > *From:* shane knapp > *Sent:* Friday, May 31, 2019 7:38:10 PM > *To:* Denny Lee > *Cc:* Holden Karau; Bryan Cutler; Erik Erlandson; Felix Cheung; Mark > Hamstra; Matei Zaharia; Reynold Xin; Sean Owen; Wenchen Fen; Xiangrui Meng; > dev; user > *Subj

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng
, I'm going to upload it to Spark website and announce it here. Let me know if you think we should do a VOTE instead. On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng wrote: > I created https://issues.apache.org/jira/browse/SPARK-27884 to track the > work. > > On Thu, May 30, 2019

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng
From:* Reynold Xin > *Sent:* Thursday, May 30, 2019 12:59:14 AM > *To:* shane knapp > *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen > Fen; Xiangrui Meng; dev; user > *Subject:* Re: Should python-2 be supported in Spark 3.0? > > +1 on Xiangrui’s plan. &

Re: Should python-2 be supported in Spark 3.0?

2019-05-29 Thread Xiangrui Meng
Hi all, I want to revive this old thread since no action was taken so far. If we plan to mark Python 2 as deprecated in Spark 3.0, we should do it as early as possible and let users know ahead. PySpark depends on Python, numpy, pandas, and pyarrow, all of which are sunsetting Python 2 support by 2

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-13 Thread Xiangrui Meng
My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following: 1. Link the POC mentioned in Q4. So people can verify the POC result. 2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarB

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Xiangrui Meng
s that are read in a > different way). It’s a bit harder for a Java API, but maybe Spark could > just expose byte arrays directly and work on those if the API is not > guaranteed to stay stable (that is, we’d still use our own classes to > manipulate the data internally, and end users coul

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-19 Thread Xiangrui Meng
I posted my comment in the JIRA . Main concerns here: 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 rele

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng
e users much happier. Tom and Andy from NVIDIA are certainly more calibrated on the usefulness of the current proposal. > > On Mon, Mar 25, 2019 at 7:39 PM Xiangrui Meng wrote: > >> There are certainly use cases where different stages require different >> number of CPUs o

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng
rrier mode scheduling -- barrier mode stages having > a need for an inter-task channel resource, gpu-ified stages needing gpu > resources, etc. Have I mentioned that I'm not a fan of the current barrier > mode API, Xiangrui? :) Yes, I know: "Show me something better." &

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng
docs, but just wanted to highlight one thing. In >> page 5 of the SPIP, when we talk about DRA, I see: >> >> "For instance, if each executor consists 4 CPUs and 2 GPUs, and each >> task requires 1 CPU and 1GPU, then we shall throw an error on application >> start bec

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-20 Thread Xiangrui Meng
SCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit#> >> and stories >> <https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit#heading=h.udyua28eu3sg>, >> I hope it now contains clear scope of the changes and enough details for >> SPIP vote. >> Please review the updated docs, thanks!

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-19 Thread Xiangrui Meng
gt; possible. That seems like a fine position. > > On Mon, Mar 18, 2019 at 1:56 PM Xingbo Jiang > wrote: > > > > Hi all, > > > > I updated the SPIP doc and stories, I hope it now contains clear scope > of the changes and enough details for SPIP vote. > >

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-05 Thread Xiangrui Meng
or take the discussion of what a SPIP is to a different thread >> and then come back to this, thoughts? >> >> Note there is a high level design for at least the core piece, which is >> what people seem concerned with, already so including it in the SPIP should >>

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng
On Mon, Mar 4, 2019 at 3:10 PM Mark Hamstra wrote: > :) Sorry, that was ambiguous. I was seconding Imran's comment. > Could you also help review Xingbo's design sketch and help evaluate the cost? > > On Mon, Mar 4, 2019 at 3:09 PM Xiangrui Meng wrote: > >> &g

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng
On Mon, Mar 4, 2019 at 1:56 PM Mark Hamstra wrote: > +1 > Mark, just to be clear, are you +1 on the SPIP or Imran's point? > > On Mon, Mar 4, 2019 at 12:52 PM Imran Rashid wrote: > >> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng wrote: >> >>> On

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng
On Mon, Mar 4, 2019 at 8:23 AM Xiangrui Meng wrote: > > > On Mon, Mar 4, 2019 at 7:24 AM Sean Owen wrote: > >> To be clear, those goals sound fine to me. I don't think voting on >> those two broad points is meaningful, but, does no harm per se. If you >>

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng
e changes. * How to implement it? There is a sketch in the companion doc. Yinan mentioned three options to expose the inferences to users. We need to finalize the design and discuss which option is the best to go. You see that such discussions can be done in parallel. It is not efficient if we bloc

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng
> greatly concerning, like “oh scheduler is allocating GPU, but how does it > affect memory” and many more, and so I think finer “high level” goals > should be defined. > > > > > -- > *From:* Sean Owen > *Sent:* Sunday, March 3, 2019 5:24 PM > *T

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-03 Thread Xiangrui Meng
Hi Felix, Just to clarify, we are voting on the SPIP, not the companion scoping doc. What is proposed and what we are voting on is to make Spark accelerator-aware. The companion scoping doc and the design sketch are to help demonstrate that what features could be implemented based on the use cases

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Xiangrui Meng
+1 Btw, as Ryan pointed out las time, +0 doesn't mean "Don't really care." Official definitions here: https://www.apache.org/foundation/voting.html#expressing-votes-1-0-1-and-fractions - +0: 'I don't feel strongly about it, but I'm okay with this.' - -0: 'I won't get in the way, b

Re: SPIP: Accelerator-aware Scheduling

2019-02-26 Thread Xiangrui Meng
In case there are issues visiting Google doc, I attached PDF files to the JIRA. On Tue, Feb 26, 2019 at 7:41 AM Xingbo Jiang wrote: > Hi all, > > I want send a revised SPIP on implementing Accelerator(GPU)-aware > Scheduling. It improves Spark by making it aware of GPUs exposed by cluster > mana

[VOTE] [RESULT] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-02-12 Thread Xiangrui Meng
Hi all, The vote passed with the following +1s (* = binding) and no 0s/-1s: * Denny Lee * Jules Damji * Xiao Li* * Dongjoon Hyun * Mingjie Tang * Yanbo Liang* * Marco Gaido * Joseph Bradley* * Xiangrui Meng* Please watch SPARK-25994 and join future discussions there. Thanks! Best, Xiangrui

Re: [VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-02-12 Thread Xiangrui Meng
+1 from myself. The vote passed with the following +1s and no -1s: * Denny Lee * Jules Damji * Xiao Li* * Dongjoon Hyun * Mingjie Tang * Yanbo Liang* * Marco Gaido * Joseph Bradley* * Xiangrui Meng* I will send a result email soon. Please watch SPARK-25994 for future discussions. Thanks! Best

Re: [VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-30 Thread Xiangrui Meng
lso changed the access permissions for the SPIP and design sketch docs > so that anyone can comment. > > Best, > > Martin > On 29.01.19 18:59, Dongjoon Hyun wrote: > > Hi, Xiangrui Meng. > > +1 for the proposal. > > However, please update the following se

[VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-29 Thread Xiangrui Meng
Hi all, I want to call for a vote of SPARK-25994 . It introduces a new DataFrame-based component to Spark, which supports property graph construction, Cypher queries, and graph algorithms. The proposal

SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-15 Thread Xiangrui Meng
Hi all, I want to re-send the previous SPIP on introducing a DataFrame-based graph component to collect more feedback. It supports property graphs, Cypher graph queries, and graph algorithms built on top of the DataFrame API. If you are a GraphX user or your workload is essentially graph queries,

Re: barrier execution mode with DataFrame and dynamic allocation

2018-12-19 Thread Xiangrui Meng
(don't know why your email ends with ".invalid") On Wed, Dec 19, 2018 at 9:13 AM Xiangrui Meng wrote: > > > On Wed, Dec 19, 2018 at 7:34 AM Ilya Matiach > wrote: > > > > [Note: I sent this earlier but it looks like the email was blocked > becaus

Re: barrier execution mode with DataFrame and dynamic allocation

2018-12-19 Thread Xiangrui Meng
On Wed, Dec 19, 2018 at 7:34 AM Ilya Matiach wrote: > > [Note: I sent this earlier but it looks like the email was blocked because I had another email group on the CC line] > > Hi Spark Dev, > > I would like to use the new barrier execution mode introduced in spark 2.4 with LightGBM in the spark p

Re: SPIP: Property Graphs, Cypher Queries, and Algorithms

2018-11-13 Thread Xiangrui Meng
jira/browse/SPARK-26028 > Google Doc: > https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI/edit?usp=sharing > > Thanks, > > Martin (on behalf of the Neo4j Cypher for Apache Spark team) > -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://d

Re: [VOTE] SPARK 2.4.0 (RC5)

2018-11-01 Thread Xiangrui Meng
gt; Chitral Verma > Dilip Biswal > Denny Lee > Felix Cheung (binding) > Dongjoon Hyun > > +0: > DB Tsai (binding) > > -1: None > > Thanks, everyone! > > On Thu, Nov 1, 2018 at 1:26 PM Dongjoon Hyun > wrote: > >> +1 >> >> Cheers, &

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-10-01 Thread Xiangrui Meng
gt;>>> > an existing Spark workload and running on this release candidate, >>>>> then >>>>> > reporting any regressions. >>>>> > >>>>> > If you're working in PySpark you can set up a virtual env and install >

Re: 2.4.0 Blockers, Critical, etc

2018-09-21 Thread Xiangrui Meng
treaming.ReceiverSuite."receiver_life_cycle" > >> SPARK-22809 pyspark is sensitive to imports with dots > >> SPARK-22739 Additional Expression Support for Objects > >> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested > >> list of structures > >> SPARK-21030 extend hint syntax to support any expression for Python and > R > >> SPARK-22386 Data Source V2 improvements > >> SPARK-15117 Generate code that get a value in each compressed column > >> from CachedBatch when DataFrame.cache() is called > >> > >> - > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >> > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Xiangrui Meng
ation >>>>>> >> Most of it is done, but date/timestamp support is still missing. >>>>>> Great to have in 2.4. >>>>>> >> >>>>>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect >>>>>> answers >>>>>> >> This is a long-standing correctness bug, great to have in 2.4. >>>>>> >> >>>>>> >> There are some other important features like the adaptive >>>>>> execution, streaming SQL, etc., not in the list, since I think we are not >>>>>> able to finish them before 2.4. >>>>>> >> >>>>>> >> Feel free to add more things if you think they are important to >>>>>> Spark 2.4 by replying to this email. >>>>>> >> >>>>>> >> Thanks, >>>>>> >> Wenchen >>>>>> >> >>>>>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen >>>>>> wrote: >>>>>> >> >>>>>> >> In theory releases happen on a time-based cadence, so it's >>>>>> pretty much wrap up what's ready by the code freeze and ship it. In >>>>>> practice, the cadence slips frequently, and it's very much a negotiation >>>>>> about what features should push the >>>>>> >> code freeze out a few weeks every time. So, kind of a hybrid >>>>>> approach here that works OK. >>>>>> >> >>>>>> >> Certainly speak up if you think there's something that really >>>>>> needs to get into 2.4. This is that discuss thread. >>>>>> >> >>>>>> >> (BTW I updated the page you mention just yesterday, to reflect >>>>>> the plan suggested in this thread.) >>>>>> >> >>>>>> >> On Mon, Jul 30, 2018 at 9:51 AM Tom Graves >>>>>> wrote: >>>>>> >> >>>>>> >> Shouldn't this be a discuss thread? >>>>>> >> >>>>>> >> I'm also happy to see more release managers and agree the time >>>>>> is getting close, but we should see what features are in progress and see >>>>>> how close things are and propose a date based on that. Cutting a branch >>>>>> to >>>>>> soon just creates >>>>>> >> more work for committers to push to more branches. >>>>>> >> >>>>>> >>http://spark.apache.org/versioning-policy.html mentioned the >>>>>> code freeze and release branch cut mid-august. >>>>>> >> >>>>>> >> Tom >>>>>> > >>>>>> > >>>>>> - >>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> > >>>>>> >>>>>> >>>> >> -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

[SPARK-24579] SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-06-18 Thread Xiangrui Meng
a look and let me know your thoughts in JIRA comments. Thanks! Best, Xiangrui -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Xiangrui Meng
+1 from myself. The vote passed with the following +1s: * Susham kumar reddy Yerabolu * Xingbo Jiang * Xiao Li* * Weichen Xu * Joseph Bradley* * Henry Robinson * Xiangrui Meng* * Wenchen Fan* Henry, you can find a design sketch at https://issues.apache.org/jira/browse/SPARK-24375. To help

Re: [VOTE] SPIP ML Pipelines in R

2018-06-01 Thread Xiangrui Meng
lowing reasons. >> > >> > Thanks, >> > --Hossein >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> > > > -- > > Joseph Bradley > > Software Engineer - Machine Learning > > Databricks, Inc. > > [image: http://databricks.com] <http://databricks.com/> > -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

[VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-01 Thread Xiangrui Meng
-1: I don't think this is a good idea because of the following technical reasons. Best, Xiangrui -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Integrating ML/DL frameworks with Spark

2018-05-23 Thread Xiangrui Meng
, May 20, 2018 at 8:19 PM Felix Cheung wrote: > Very cool. We would be very interested in this. > > What is the plan forward to make progress in each of the three areas? > > > -- > *From:* Bryan Cutler > *Sent:* Monday, May 14, 2018 11:37:20 P

Re: Integrating ML/DL frameworks with Spark

2018-05-09 Thread Xiangrui Meng
t also useful for scaling MLlib algorithms. One of my earliest attempts >>>>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485 >>>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up >>>>>&

Re: Integrating ML/DL frameworks with Spark

2018-05-07 Thread Xiangrui Meng
ar your feedback and past efforts along those directions if they were not fully captured by our JIRA. > Xiangrui - please also chime in if I didn’t capture everything. > > > -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Welcoming Yanbo Liang as a committer

2016-06-07 Thread Xiangrui Meng
Congrats!! On Mon, Jun 6, 2016, 8:12 AM Gayathri Murali wrote: > Congratulations Yanbo Liang! Well deserved. > > > On Sun, Jun 5, 2016 at 7:10 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Congrats, Yanbo! >> >> On Sun, Jun 5, 2016 at 6:25 PM, Liwei Lin wrote: >> >>> Congratul

Re: SparkR dataframe error

2016-05-19 Thread Xiangrui Meng
nt: > OutputCommitCoordinator stopped! > 1384643 16/05/19 11:28:13.909 Thread-1 INFO SparkContext: Successfully > stopped SparkContext > 1384644 16/05/19 11:28:13.910 Thread-1 INFO ShutdownHookManager: Shutdown > hook called > 1384645 16/05/19 11:28:13.911 Thread-1 INFO ShutdownHo

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Xiangrui Meng
+1 On Thu, May 19, 2016 at 9:18 AM Joseph Bradley wrote: > +1 > > On Wed, May 18, 2016 at 10:49 AM, Reynold Xin wrote: > >> Hi Ovidiu-Cristian , >> >> The best source of truth is change the filter with target version to >> 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as w

Re: SparkR dataframe error

2016-05-19 Thread Xiangrui Meng
Is it on 1.6.x? On Wed, May 18, 2016, 6:57 PM Sun Rui wrote: > I saw it, but I can’t see the complete error message on it. > I mean the part after “error in invokingJava(…)” > > On May 19, 2016, at 08:37, Gayathri Murali > wrote: > > There was a screenshot attached to my original email. If you

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Xiangrui Meng
Not exacly the same as the one you suggested but you can chain it with flatMap to get what you want, if each file is not huge. On Thu, May 19, 2016, 8:41 AM Xiangrui Meng wrote: > This was implemented as sc.wholeTextFiles. > > On Thu, May 19, 2016, 2:43 AM Reynold Xin wrote: > &g

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Xiangrui Meng
This was implemented as sc.wholeTextFiles. On Thu, May 19, 2016, 2:43 AM Reynold Xin wrote: > Users would be able to run this already with the 3 lines of code you > supplied right? In general there are a lot of methods already on > SparkContext and we lean towards the more conservative side in i

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Xiangrui Meng
amespace in the 2.x series ? > > Thanks > Shivaram > > On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen wrote: > > FWIW, all of that sounds like a good plan to me. Developing one API is > > certainly better than two. > > > > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui M

Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Xiangrui Meng
Hi all, More than a year ago, in Spark 1.2 we introduced the ML pipeline API built on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has been developed under the spark.ml package, while the old RDD-based API has been developed in parallel under the spark.mllib package. While

Re: Various forks

2016-03-19 Thread Xiangrui Meng
We made that fork to hide package private classes/members in the generated Java API doc. Otherwise, the Java API doc is very messy. The patch is to map all private[*] to the default scope in the generated Java code. However, this might not be the expected behavior for other packages. So it didn't g

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Xiangrui Meng
+1. Checked user guide and API doc, and ran some MLlib and SparkR examples. -Xiangrui On Thu, Sep 24, 2015 at 2:54 PM, Reynold Xin wrote: > I'm going to +1 this myself. Tested on my laptop. > > > > On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin wrote: >> >> I forked a new thread for this. Please

Re: Are These Issues Suitable for our Senior Project?

2015-07-09 Thread Xiangrui Meng
Hi Emrehan, Thanks for asking! There are actually many TODOs for MLlib. I would recommend starting with small tasks before picking a topic for your senior project. Please check https://issues.apache.org/jira/browse/SPARK-8445 for the 1.5 roadmap and see whether there are ones you are interested in

Re: [mllib] Refactoring some spark.mllib model classes in Python not inheriting JavaModelWrapper

2015-06-18 Thread Xiangrui Meng
Hi Yu, Reducing the code complexity on the Python side is certainly what we want to see:) We didn't call Java directly in Python models because Java methods don't work inside RDD closures, e.g., rdd.map(lambda x: model.predict(x[1])) But I agree that for model save/load the implementation should

Re: [sample code] deeplearning4j for Spark ML (@DeveloperAPI)

2015-06-17 Thread Xiangrui Meng
Hi Eron, Please register your Spark Package on http://spark-packages.org, which helps users find your work. Do you have some performance benchmark to share? Thanks! Best, Xiangrui On Wed, Jun 10, 2015 at 10:48 PM, Nick Pentreath wrote: > Looks very interesting, thanks for sharing this. > > I ha

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Xiangrui Meng
mn. Maybe we can inspect the frame `df.name` gets called and warn users in `df.select(df.name)` but not in `name = df.name`. This could be tricky to implement. -Xiangrui > > Thanks > Shivaram > > > On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng wrote: >> >> Hi all

Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Xiangrui Meng
Hi all, In PySpark, a DataFrame column can be referenced using df["abcd"] (__getitem__) and df.abcd (__getattr__). There is a discussion on SPARK-7035 on compatibility issues with the __getattr__ approach, and I want to collect more inputs on this. Basically, if in the future we introduce a new m

Re: [discuss] ending support for Java 6?

2015-05-05 Thread Xiangrui Meng
+1. One issue with dropping Java 6: if we use Java 7 to build the assembly jar, it will use zip64. Could Python 2.x (or even 3.x) be able to load zip64 files on PYTHONPATH? -Xiangrui On Tue, May 5, 2015 at 3:25 PM, Reynold Xin wrote: > OK I sent an email. > > > On Tue, May 5, 2015 at 2:47 PM, sha

Re: OOM error with GMMs on 4GB dataset

2015-05-05 Thread Xiangrui Meng
Did you set `--driver-memory` with spark-submit? -Xiangrui On Mon, May 4, 2015 at 5:16 PM, Vinay Muttineni wrote: > Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760). > The spark (1.3.1) job is allocated 120 executors with 6GB each and the > driver also has 6GB. > Spark C

Re: Pickling error when attempting to add a method in pyspark

2015-05-05 Thread Xiangrui Meng
Hi Stephen, I think it would be easier to see what you implemented by showing the branch diff link on github. There are couple utility class to make Rating work between Scala and Python: 1. serializer: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/py

Re: Support parallelized online matrix factorization for Collaborative Filtering

2015-04-06 Thread Xiangrui Meng
This is being discussed in https://issues.apache.org/jira/browse/SPARK-6407. Let's move the discussion there. Thanks for providing references! -Xiangrui On Sun, Apr 5, 2015 at 11:48 PM, Chunnan Yao wrote: > On-line Collaborative Filtering(CF) has been widely used and studied. To > re-train a CF m

Re: Stochastic gradient descent performance

2015-04-06 Thread Xiangrui Meng
The gap sampling is triggered when the sampling probability is small and the directly underlying storage has constant time lookups, in particular, ArrayBuffer. This is a very strict requirement. If rdd is cached in memory, we use ArrayBuffer to store its elements and rdd.sample will trigger gap sam

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-05 Thread Xiangrui Meng
+1 Verified some MLlib bug fixes on OS X. -Xiangrui On Sun, Apr 5, 2015 at 1:24 AM, Sean Owen wrote: > Signatures and hashes are good. > LICENSE, NOTICE still check out. > Compiles for a Hadoop 2.6 + YARN + Hive profile. > > I still see the UISeleniumSuite test failure observed in 1.3.0, which >

Re: Using CUDA within Spark / boosting linear algebra

2015-04-02 Thread Xiangrui Meng
; Best regards, Alexander > -Original Message- > From: Xiangrui Meng [mailto:men...@gmail.com] > Sent: Monday, March 30, 2015 2:43 PM > To: Sean Owen > Cc: Evan R. Sparks; Sam Halliday; dev@spark.apache.org; Ulanov, Alexander; > jfcanny > Subject: Re: Using CUDA within Spark / boosti

Re: Using CUDA within Spark / boosting linear algebra

2015-03-30 Thread Xiangrui Meng
Hi Alex, Since it is non-trivial to make nvblas work with netlib-java, it would be great if you can send the instructions to netlib-java as part of the README. Hopefully we don't need to modify netlib-java code to use nvblas. Best, Xiangrui On Thu, Mar 26, 2015 at 9:54 AM, Sean Owen wrote: > Th

Re: mllib.recommendation Design

2015-03-30 Thread Xiangrui Meng
he features for modest >> ranks where gram matrices can be made... >> >> For large ranks I am still working on the code >> >> On Tue, Feb 17, 2015 at 3:19 PM, Xiangrui Meng wrote: >>> >>> The current ALS implementation allow pluggable solvers for >&

Re: enum-like types in Spark

2015-03-17 Thread Xiangrui Meng
nation for rejecting is basically that >>>> (a) enums aren't important enough for introducing some new special >>>> feature, >>>> scala's got bigger things to work on and (b) if you really need a good >>>> enum, just use java's enum. &

Re: enum-like types in Spark

2015-03-16 Thread Xiangrui Meng
; Java enums actually have some very real advantages over the other >> >> > > approaches -- you get values(), valueOf(), EnumSet, and EnumMap. >> There >> >> > has >> >> > > been endless debate in the Scala community about the problems with

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Xiangrui Meng
Krishna, I tested your linear regression example. For linear regression, we changed its objective function from 1/n * \|A x - b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least squares formulations. It means you could re-produce the same result by multiplying the step size by 2.

Re: Loading previously serialized object to Spark

2015-03-09 Thread Xiangrui Meng
ndard way for Mllib models to be > serialized? > > Btw. The example I pasted below works if one implements a TestSuite with > MLlibTestSparkContext. > > -Original Message- > From: Xiangrui Meng [mailto:men...@gmail.com] > Sent: Monday, March 09, 2015 12:10 PM > To

Re: Loading previously serialized object to Spark

2015-03-09 Thread Xiangrui Meng
Could you try `sc.objectFile` instead? sc.parallelize(Seq(model), 1).saveAsObjectFile("path") val sameModel = sc.objectFile[NaiveBayesModel]("path").first() -Xiangrui On Mon, Mar 9, 2015 at 11:52 AM, Ulanov, Alexander wrote: > Just tried, the same happens if I use the internal Spark serializer:

Re: enum-like types in Spark

2015-03-05 Thread Xiangrui Meng
bigger things to work on and (b) if you really need a good > enum, just use java's enum. > > I doubt it really matters that much for Spark internals, which is why I > think #4 is fine. But I figured I'd give my spiel, because every developer > loves language wars :) > &g

Re: enum-like types in Spark

2015-03-04 Thread Xiangrui Meng
ming-conventions.html >>> > > >>> > > Constants, Values, Variable and Methods >>> > > >>> > > Constant names should be in upper camel case. That is, if the member is >>> > > final, immutable and it belongs to a package object or an obje

enum-like types in Spark

2015-03-04 Thread Xiangrui Meng
Hi all, There are many places where we use enum-like types in Spark, but in different ways. Every approach has both pros and cons. I wonder whether there should be an “official” approach for enum-like types in Spark. 1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc) * All types sho

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-03 Thread Xiangrui Meng
On Tue, Mar 3, 2015 at 11:15 PM, Krishna Sankar wrote: > +1 (non-binding, of course) > > 1. Compiled OSX 10.10 (Yosemite) OK Total time: 13:53 min > mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 > -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11 > 2. Tested pyspark, mlib -

Re: Using CUDA within Spark / boosting linear algebra

2015-03-02 Thread Xiangrui Meng
which is probably has and now I just need a > reason to look at it) > > On 27 Feb 2015 20:26, "Xiangrui Meng" wrote: >> >> Hey Sam, >> >> The running times are not "big O" estimates: >> >> > The CPU version finished in 12 seconds. >&g

Re: Using CUDA within Spark / boosting linear algebra

2015-02-27 Thread Xiangrui Meng
gt; link me to the source code for DGEMM? > > I show all of this in my talk, with explanations, I can't stress enough how > much I recommend that you watch it if you want to understand high > performance hardware acceleration for linear algebra :-) > > On 27 Feb 2015

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Xiangrui Meng
BIDMat-cuda was faster than netlib-cuda >> and the most reasonable explanation is that it holds the result in GPU >> memory, as Sam suggested. At the same time, it is OK because you can copy >> the result back from GPU only when needed. However, to be sure, I am going >> to

Re: Google Summer of Code - ideas

2015-02-26 Thread Xiangrui Meng
There are couple things in Scala/Java but missing in Python API: 1. model import/export 2. evaluation metrics 3. distributed linear algebra 4. streaming algorithms If you are interested, we can list/create target JIRAs and hunt them down one by one. Best, Xiangrui On Wed, Feb 25, 2015 at 7:37 P

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Xiangrui Meng
Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley wrote: > Better documen

Re: Help vote for Spark talks at the Hadoop Summit

2015-02-25 Thread Xiangrui Meng
Made 3 votes to each of the talks. Looking forward to see them in Hadoop Summit:) -Xiangrui On Tue, Feb 24, 2015 at 9:54 PM, Reynold Xin wrote: > Hi all, > > The Hadoop Summit uses community choice voting to decide which talks to > feature. It would be great if the community could help vote for S

Re: Google Summer of Code - ideas

2015-02-24 Thread Xiangrui Meng
Would you be interested in working on MLlib's Python API during the summer? We want everything we implemented in Scala can be used in both Java and Python, but we are not there yet. It would be great if someone is willing to help. -Xiangrui On Sat, Feb 21, 2015 at 11:24 AM, Manoj Kumar wrote: > H

Re: Batch prediciton for ALS

2015-02-18 Thread Xiangrui Meng
nd try update with > the new ALS... > > On Tue, Feb 17, 2015 at 3:22 PM, Xiangrui Meng wrote: >> >> It may be too late to merge it into 1.3. I'm going to make another >> pass on your PR today. -Xiangrui >> >> On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das

Re: Batch prediciton for ALS

2015-02-17 Thread Xiangrui Meng
It may be too late to merge it into 1.3. I'm going to make another pass on your PR today. -Xiangrui On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das wrote: > Hi, > > Will it be possible to merge this PR to 1.3 ? > > https://github.com/apache/spark/pull/3098 > > The batch prediction API in ALS will b

Re: mllib.recommendation Design

2015-02-17 Thread Xiangrui Meng
The current ALS implementation allow pluggable solvers for NormalEquation, where we put CholeskeySolver and NNLS solver. Please check the current implementation and let us know how your constraint solver would fit. For a general matrix factorization package, let's make a JIRA and move our discussio

Re: [ml] Lost persistence for fold in crossvalidation.

2015-02-17 Thread Xiangrui Meng
There are three different regParams defined in the grid and there are tree folds. For simplicity, we didn't split the dataset into three and reuse them, but do the split for each fold. Then we need to cache 3*3 times. Note that the pipeline API is not yet optimized for performance. It would be nice

Re: multi-line comment style

2015-02-09 Thread Xiangrui Meng
as.numeric(data$V1) weights <- coef(glmnet(features, label, family="gaussian", alpha = 0, lambda = 0)) */ ~~~ So people can copy & paste the R commands directly. Xiangrui On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng wrote: > I like the `/* .. */` style more. Because i

Re: multi-line comment style

2015-02-09 Thread Xiangrui Meng
I like the `/* .. */` style more. Because it is easier for IDEs to recognize it as a block comment. If you press enter in the comment block with the `//` style, IDEs won't add `//` for you. -Xiangrui On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin wrote: > We should update the style doc to reflect wh

Re: IDF for ml pipeline

2015-02-03 Thread Xiangrui Meng
Yes, we need a wrapper under spark.ml. Feel free to create a JIRA for it. -Xiangrui On Mon, Feb 2, 2015 at 8:56 PM, masaki rikitoku wrote: > Hi all > > I am trying the ml pipeline for text classfication now. > > recently, i succeed to execute the pipeline processing in ml packages, > which consis

Re: Maximum size of vector that reduce can handle

2015-01-27 Thread Xiangrui Meng
60m-vector costs 480MB memory. You have 12 of them to be reduced to the driver. So you need ~6GB memory not counting the temp vectors generated from '_+_'. You need to increase driver memory to make it work. That being said, ~10^7 hits the limit for the current impl of glm. -Xiangrui On Jan 23, 201

Re: Any interest in 'weighting' VectorTransformer which does component-wise scaling?

2015-01-27 Thread Xiangrui Meng
I would call it Scaler. You might want to add it to the spark.ml pipieline api. Please check the spark.ml.HashingTF implementation. Note that this should handle sparse vectors efficiently. Hadamard and FFTs are quite useful. If you are intetested, make sure that we call an FFT libary that is licen

Re: KNN for large data set

2015-01-21 Thread Xiangrui Meng
For large datasets, you need hashing in order to compute k-nearest neighbors locally. You can start with LSH + k-nearest in Google scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. wrote: > Hi all, > > Please help me to find out best

Re: Spectral clustering

2015-01-20 Thread Xiangrui Meng
Fan and Stephen (cc'ed) are working on this feature. They will update the JIRA page and report progress soon. -Xiangrui On Fri, Jan 16, 2015 at 12:04 PM, Andrew Musselman wrote: > Hi, thinking of picking up this Jira ticket: > https://issues.apache.org/jira/browse/SPARK-4259 > > Anyone done any w

Re: DBSCAN for MLlib

2015-01-14 Thread Xiangrui Meng
Please find my comments on the JRIA page. -Xiangrui On Tue, Jan 13, 2015 at 1:49 PM, Muhammad Ali A'råby wrote: > I have to say, I have created a Jira task for it: > [SPARK-5226] Add DBSCAN Clustering Algorithm to MLlib - ASF JIRA > > | | > | | | | | | > | [SPARK-5226] Add DBSCAN Clus

Re: Re-use scaling means and variances from StandardScalerModel

2015-01-09 Thread Xiangrui Meng
Feel free to create a JIRA for this issue. We might need to discuss what to put in the public constructors. In the meanwhile, you can use Java serialization to save/load the model: sc.parallelize(Seq(model), 1).saveAsObjectFile("/tmp/model") val model = sc.objectFile[StandardScalerModel]("/tmp/mod

Announcing Spark Packages

2014-12-22 Thread Xiangrui Meng
Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install pac

  1   2   3   >