Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-13 Thread Shivaram Venkataraman
Thanks Fred for the detailed reply. The stability points are especially interesting as a goal for the streaming component in Spark. In terms of next steps, one approach that might be helpful is trying to create benchmarks or situations that mimic real-life workloads and then we can work on isolatin

Re: Regularized Logistic regression

2016-10-13 Thread Seth Hendrickson
Spark MLlib provides a cross-validation toolkit for selecting hyperparameters. I think you'll find the documentation quite helpful: http://spark.apache.org/docs/latest/ml-tuning.html#example-model-selection-via-cross-validation There is actually a python example for logistic regression there. If

Re: DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

2016-10-13 Thread Michael Armbrust
There is a lot of confusion around nullable

RE: Regularized Logistic regression

2016-10-13 Thread aditya1702
Ok so I tried setting the regParam and tried lowering it. how do I evaluate which regParam is best. Do I have to to do it by trial and error. I am currently calculating the log_loss for the model. Is it good to find the best regparam value. here is my code: from math import exp,log #from pyspark.s

RE: Regularized Logistic regression

2016-10-13 Thread aditya1702
Thank you Anurag Verma for replying. I tried increasing the iterations. However I still get underfitted results. I am checking the model's prediction by seeing how many pairs of labels and predictions it gets right data_predict_with_model=best_model.transform(data_test_df) final_pred_df=data_predi

DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

2016-10-13 Thread Aleksander Eskilson
Hi there, Working in the space of custom Encoders/ExpressionEncoders, I've noticed that the StructType schema as set when creating an object of the ExpressionEncoder[T] class [1] is not the schema actually used to set types for the columns of a Dataset, as created by using the .as(encoder) method

Re: DAGScheduler.handleJobCancellation uses jobIdToStageIds for verification while jobIdToActiveJob for lookup?

2016-10-13 Thread Mark Hamstra
There were at least a couple of ideas behind the original thinking on using both of those Maps: 1) We needed the ability to efficiently get from jobId to both ActiveJob objects and to sets of associated Stages, and using both Maps here was an opportunity to do a little sanity checking to make sure

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-13 Thread Holden Karau
Awesome, good points everyone. The ranking of the issues is super useful and I'd also completely forgotten about the lack of built in UDAF support which is rather important. There is a PR to make it easier to call/register JVM UDFs from Python which will hopefully help a bit there too. I'm getting

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-13 Thread Holden Karau
This is a thing I often have people ask me about, and then I do my best dissuade them from using Spark in the "hot path" and it's normally something which most people eventually accept. Fred might have more information for people for whom this is a hard requirement though. On Thursday, October 13,

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-13 Thread Cody Koeninger
I've always been confused as to why it would ever be a good idea to put any streaming query system on the critical path for synchronous < 100msec requests. It seems to make a lot more sense to have a streaming system do asynch updates of a store that has better latency and quality of service char

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-13 Thread Fred Reiss
On Tue, Oct 11, 2016 at 11:02 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > > > > Could you expand a little bit more on stability ? Is it just bursty > workloads in terms of peak vs. average throughput ? Also what level of > latencies do you find users care about ? Is it on the o

Re: DAGScheduler.handleJobCancellation uses jobIdToStageIds for verification while jobIdToActiveJob for lookup?

2016-10-13 Thread Jacek Laskowski
Thanks Imran! Not only did the response come so promptly, but also it's something I could work on (and have another Spark contributor badge unlocked)! Thanks. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follo

Re: DAGScheduler.handleJobCancellation uses jobIdToStageIds for verification while jobIdToActiveJob for lookup?

2016-10-13 Thread Imran Rashid
Hi Jacek, doesn't look like there is any good reason -- Mark Hamstra might know this best. Feel free to open a jira & pr for it, you can ping Mark, Kay Ousterhout, and me (@squito) for review. Imran On Thu, Oct 13, 2016 at 7:56 AM, Jacek Laskowski wrote: > Hi, > > Is there a reason why DAGSch

DAGScheduler.handleJobCancellation uses jobIdToStageIds for verification while jobIdToActiveJob for lookup?

2016-10-13 Thread Jacek Laskowski
Hi, Is there a reason why DAGScheduler.handleJobCancellation checks the active job id in jobIdToStageIds [1] while looking the job up in jobIdToActiveJob [2]? Perhaps synchronized earlier yet still inconsistent. [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark

RE: Regularized Logistic regression

2016-10-13 Thread Anurag Verma
Probably your regularization parameter is set too high. Try regParam=0.1/ 0.2 Also you should probably increase the number to iteration to something like 500. Additionally you can specify elasticNetParam (between 0 and 1). -Original Message- From: aditya1702 [mailto:adityavya...@gmail.com

Regularized Logistic regression

2016-10-13 Thread aditya1702
Hello, I am trying to solve a problem using regularized logistic regression in spark. I am using the model created by LogisticRegression(): lr=LogisticRegression(regParam=10.0,maxIter=10,standardization=True) model=lr.fit(data_train_df) data_predict_with_model=model.transform(data_test_df) Howeve

RE: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-13 Thread assaf.mendelson
Hi, We are actually using pyspark heavily. I agree with all of your points, for me I see the following as the main hurdles: 1. Pyspark does not have support for UDAF. We have had multiple needs for UDAF and needed to go to java/scala to support these. Having python UDAF would have made l

RE: Official Stance on Not Using Spark Submit

2016-10-13 Thread assaf.mendelson
I actually not use spark submit for several use cases, all of them currently revolve around running it directly with python. One of the most important ones is developing in pycharm. Basically I have am using pycharm and configure it with a remote interpreter which runs on the server while my pych