Re: Spark data source resiliency

2018-07-02 Thread assaf.mendelson
That is what I expected, however, I did a very simple test (using println just to see when the exception is triggered in the iterator) using local master and I saw it failed once and cause the entire operation to fail. Is this something which may be unique to local master (or some default configur

Re: Spark data source resiliency

2018-07-02 Thread Wenchen Fan
a failure in the data reader results to a task failure, and Spark will re-try the task for you (IIRC re-try 3 times before fail the job). Can you check your Spark log and see if the task fails consistently? On Tue, Jul 3, 2018 at 2:17 PM assaf.mendelson wrote: > Hi All, > > I am implemented a d

Spark data source resiliency

2018-07-02 Thread assaf.mendelson
Hi All, I am implemented a data source V2 which integrates with an internal system and I need to make it resilient to errors in the internal data source. The issue is that currently, if there is an exception in the data reader, the exception seems to fail the entire task. I would prefer instead t

Re: Beam's recent community development work

2018-07-02 Thread Matei Zaharia
I think telling people that they’re being considered as committers early on is a good idea, but AFAIK we’ve always had individual committers do that with contributors who were doing great work in various areas. We don’t have a centralized process for it though — it’s up to whoever wants to work

Re: Beam's recent community development work

2018-07-02 Thread Reynold Xin
That's fair, and it's great to find high quality contributors. But I also feel the two projects have very different background and maturity phase. There are 1300+ contributors to Spark, and only 300 to Beam, with the vast majority of contributions coming from a single company for Beam (based on my

Re: Beam's recent community development work

2018-07-02 Thread Holden Karau
As someone who floats a bit between both projects (as a contributor) I'd love to see us adopt some of these techniques to be pro-active about growing our committer-ship (I think perhaps we could do this by also moving some of the newer committers into the PMC faster so there are more eyes out looki

Fwd: Beam's recent community development work

2018-07-02 Thread Sean Owen
Worth, I think, a read and consideration from Spark folks. I'd be interested in comments; I have a few reactions too. -- Forwarded message - From: Kenneth Knowles Date: Sat, Jun 30, 2018 at 1:15 AM Subject: Beam's recent community development work To: , , Griselda Cuevas < g...@ap

[RESULT] [VOTE] Spark 2.2.2 (RC2)

2018-07-02 Thread Tom Graves
The vote passes. Thanks to all who helped with the release! I'll start publishing everything tomorrow, and an announcement will be sent when artifacts have propagated to the mirrors (probably early next week). +1 (* = binding): - Marcelo Vanzin * - Sean Owen * - Tom Graves * - Holder Kaurau *- Do

Re: [VOTE] Spark 2.2.2 (RC2)

2018-07-02 Thread Tom Graves
I forgot to post it, I'm +1. Tom On Monday, July 2, 2018, 12:19:08 AM CDT, Holden Karau wrote: Leaving documents aside (I think we should maybe have a thread on how we want to handle doc changes to existing releases on dev@) I'm +1 PySpark venv checks out. On Sun, Jul 1, 2018 at 9:40

Retraining with (each document as separate file) creates OOME

2018-07-02 Thread Jatin Puri
May be this is a bug. The source can be found at: https://github.com/purijatin/spark-retrain-bug *Issue:* The program takes input a set of documents. Where each document is in a separate file. The spark program tf-idf of the terms (Tokenizer -> Stopword remover -> stemming -> tf -> tfidf). Once