Re: Apache Spark - MLLib challenges

2017-09-23 Thread Aseem Bansal
This is something I wrote specifically for the challenges that we faced when taking spark ml models to production http://www.tothenew.com/blog/when-you-take-your-machine-learning-models-to-production-for-real-time-predictions/ On Sat, Sep 23, 2017 at 1:33 PM, Jörn Franke wrote: > As far as I kno

NullPointer when collecting a dataset grouped a column

2017-07-24 Thread Aseem Bansal
I was doing this dataset.groupBy("column").collectAsList() It worked for a small dataset but for a bigger dataset I got a NullPointer exception in which went down to spark's code. Is this known behaviour? Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$

Re: Is there a difference between these aggregations

2017-07-24 Thread Aseem Bansal
t;* @group agg_funcs >* @since 1.4.0 >*/ > def mean(e: Column): Column = avg(e) > > > That's the same when the argument is the column name. > > So no difference between mean and avg functions. > > > -- > *De :* Aseem Bansa

Is there a difference between these aggregations

2017-07-24 Thread Aseem Bansal
If I want to aggregate mean and subtract from my column I can do either of the following in Spark 2.1.0 Java API. Is there any difference between these? Couldn't find anything from reading the docs. dataset.select(mean("mycol")) dataset.agg(mean("mycol")) dataset.select(avg("mycol")) dataset.agg(

Re: Setting initial weights of ml.classification.LogisticRegression similar to mllib.classification.LogisticRegressionWithLBFGS

2017-07-20 Thread Aseem Bansal
ent attempt is to start with simple linear regression, as > here: https://issues.apache.org/jira/browse/SPARK-21386 > > > On Thu, 20 Jul 2017 at 08:36 Aseem Bansal wrote: > >> We were able to set initial weights on https://spark.apache.org/ >> docs/2.1.0/api/scala/inde

Setting initial weights of ml.classification.LogisticRegression similar to mllib.classification.LogisticRegressionWithLBFGS

2017-07-19 Thread Aseem Bansal
We were able to set initial weights on https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS How can we set the initial weights on https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.classification.Logist

Regarding Logistic Regression changes in Spark 2.2.0

2017-07-19 Thread Aseem Bansal
Hi I was reading the API of Spark 2.2.0 and noticed a change compared to 2.1.0 Compared to https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression the 2.2.0 docs at https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.

Spark 2.1 - Infering schema of dataframe after reading json files not during

2017-06-02 Thread Aseem Bansal
When we read files in spark it infers the schema. We have the option to not infer the schema. Is there a way to ask spark to infer the schema again just like when reading json? The reason we want to get this done is because we have a problem in our data files. We have a json file containing this

Re: Spark 2.1 ml library scalability

2017-04-07 Thread Aseem Bansal
o submit your job via spark-submit? > > On Fri, 7 Apr 2017 at 13:12 Aseem Bansal wrote: > >> When using spark ml's LogisticRegression, RandomForest, CrossValidator >> etc. do we need to give any consideration while coding in making it scale >> with more CPUs or doe

Spark 2.1 ml library scalability

2017-04-07 Thread Aseem Bansal
When using spark ml's LogisticRegression, RandomForest, CrossValidator etc. do we need to give any consideration while coding in making it scale with more CPUs or does it scale automatically? I am reading some data from S3, using a pipeline to train a model. I am running the job on a spark cluster

Does spark's random forest need categorical features to be one hot encoded?

2017-03-23 Thread Aseem Bansal
I was reading http://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest and found that needs to be done in sklearn. Is that required in spark?

Re: spark keeps on creating executors and each one fails with "TransportClient has not yet been set."

2017-03-02 Thread Aseem Bansal
Anyone has any idea what could I enable so as to find out what it is trying to connect to? On Thu, Mar 2, 2017 at 5:34 PM, Aseem Bansal wrote: > Is there a way to find out what is it trying to connect to? I am running > my spark client from within a docker container so I opened up various

spark keeps on creating executors and each one fails with "TransportClient has not yet been set."

2017-03-02 Thread Aseem Bansal
Is there a way to find out what is it trying to connect to? I am running my spark client from within a docker container so I opened up various ports as per http://stackoverflow.com/questions/27729010/how-to-configure-apache-spark-random-worker-ports-for-tight-firewalls after adding all the properti

Re: Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Aseem Bansal
from json1 and b from josn2"then run > explain to give you a hint to how to do it in code > > Regards > Sam > On Tue, 14 Feb 2017 at 14:30, Aseem Bansal wrote: > >> Say I have two files containing single rows >> >> json1.json >> >> {"a":

Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Aseem Bansal
Say I have two files containing single rows json1.json {"a": 1} json2.json {"b": 2} I read in this json file using spark's API into a dataframe one at a time. So I have Dataset json1DF and Dataset json2DF If I run "select a, b from __THIS__" in a SQLTransformer then I will get an exception a

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-06 Thread Aseem Bansal
fka/msg > queues...for such cases raw access to ML model is essential similar to > mllib model access... > > Thanks. > Deb > On Feb 4, 2017 9:58 PM, "Aseem Bansal" wrote: > >> @Debasish >> >> I see that the spark version being used in the project that yo

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Aseem Bansal
t; the model out of PipelineModel so that predict can be called on itthere > is no dependency of spark context in ml model... > On Feb 4, 2017 9:11 AM, "Aseem Bansal" wrote: > >> >>- In Spark 2.0 there is a class called PipelineModel. I know that the >>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Aseem Bansal
om the > store of choice within ms, it can be used on incoming features to score > through spark.ml.Model predict API...I am not clear on 2200x speedup...why > r we using dataframe and not the ML model directly from API ? > On Feb 4, 2017 7:52 AM, "Aseem Bansal" wrote: > &

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Aseem Bansal
2 input features, and by > the time all the processing was done, we had somewhere around 1000 features > or so going into the linear regression after one hot encoding and > everything else. > > Hope this helps, > Hollin > > On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal wrote: &g

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-03 Thread Aseem Bansal
Does this support Java 7? On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal wrote: > Is computational time for predictions on the order of few milliseconds (< > 10 ms) like the old mllib library? > > On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins wrote: > >> Hey everyone,

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-03 Thread Aseem Bansal
Is computational time for predictions on the order of few milliseconds (< 10 ms) like the old mllib library? On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins wrote: > Hey everyone, > > > Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits about > MLeap and how you can use it to b

Re: tylerchap...@yahoo-inc.com is no longer with Yahoo! (was: Question about Multinomial LogisticRegression in spark mllib in spark 2.1.0)

2017-02-01 Thread Aseem Bansal
Can a admin of mailing list please remove this email? I get this email every time I send an email to the mailing list. On Wed, Feb 1, 2017 at 5:12 PM, Yahoo! No Reply wrote: > > This is an automatically generated message. > > tylerchap...@yahoo-inc.com is no longer with Yahoo! Inc. > > Your mess

Question about Multinomial LogisticRegression in spark mllib in spark 2.1.0

2017-02-01 Thread Aseem Bansal
*What I want to do* I have a trained a ml.classification.LogisticRegressionModel using spark ml package. It has 3 features and 3 classes. So the generated model has coefficients in (3, 3) matrix and intercepts in Vector of length (3) as expected. Now, I want to take these coefficients and convert

Re: ML version of Kmeans

2017-01-31 Thread Aseem Bansal
If you want to predict using dataset then transform is the way to go. If you want to predict on vectors then you will have to wait on this issue to be completed https://issues.apache.org/jira/browse/SPARK-10413 On Tue, Jan 31, 2017 at 3:01 PM, Holden Karau wrote: > You most likely want the trans

Is there any scheduled release date for Spark 2.1.0?

2016-12-23 Thread Aseem Bansal

Re: Is Spark launcher's listener API considered production ready?

2016-11-04 Thread Aseem Bansal
Anyone has any idea about this? On Thu, Nov 3, 2016 at 12:52 PM, Aseem Bansal wrote: > While using Spark launcher's listener we came across few cases where the > failures were not being reported correctly. > > >- https://issues.apache.org/jira/browse/SPAR

Is Spark launcher's listener API considered production ready?

2016-11-03 Thread Aseem Bansal
While using Spark launcher's listener we came across few cases where the failures were not being reported correctly. - https://issues.apache.org/jira/browse/SPARK-17742 - https://issues.apache.org/jira/browse/SPARK-18241 So just wanted to ensure whether this API considered production ready

Re: [SPARK 2.0.0] Specifying remote repository when submitting jobs

2016-10-28 Thread Aseem Bansal
PM, Aseem Bansal wrote: > Hi > > We are trying to use some of our artifacts as dependencies while > submitting spark jobs. To specify the remote artifactory URL we are using > the following syntax > > https://USERNAME:passw...@artifactory.companyname.com/ > artifactory/

[SPARK 2.0.0] Specifying remote repository when submitting jobs

2016-10-28 Thread Aseem Bansal
Hi We are trying to use some of our artifacts as dependencies while submitting spark jobs. To specify the remote artifactory URL we are using the following syntax https://USERNAME:passw...@artifactory.companyname.com/artifactory/COMPANYNAME-libs But the resolution fails. Although the URL which i

Fwd: Need help with SVM

2016-10-26 Thread Aseem Bansal
He replied to me. Forwarding to the mailing list. -- Forwarded message -- From: Aditya Vyas Date: Tue, Oct 25, 2016 at 8:16 PM Subject: Re: Need help with SVM To: Aseem Bansal Hello, Here is the public gist:https://gist.github.com/a ditya1702/760cd5c95a6adf2447347e0b087bc318

What syntax can be used to specify the latest version of JAR found while using spark submit

2016-10-26 Thread Aseem Bansal
Hi Can someone please share their thoughts on http://stackoverflow.com/questions/40259022/what-syntax-can-be-used-to-specify-the-latest-version-of-jar-found-while-using-s

Can application JAR name contain + for dependency resolution to latest version?

2016-10-26 Thread Aseem Bansal
Hi While using spark-submit to submit spark jobs is the exact name of the JAR file necessary? Or is there a way to use something like `1.0.+` to denote the latest version found?

Re: Need help with SVM

2016-10-25 Thread Aseem Bansal
Is there any labeled point with label 0 in your dataset? On Tue, Oct 25, 2016 at 2:13 AM, aditya1702 wrote: > Hello, > I am using linear SVM to train my model and generate a line through my > data. > However my model always predicts 1 for all the feature examples. Here is my > code: > > print da

Re: mllib model in production web API

2016-10-18 Thread Aseem Bansal
; vincent.gromakow...@gmail.com> wrote: > Hi > Did you try applying the model with akka instead of spark ? > https://spark-summit.org/eu-2015/events/real-time-anomaly- > detection-with-spark-ml-and-akka/ > > Le 18 oct. 2016 5:58 AM, "Aseem Bansal" a écrit : >

Re: mllib model in production web API

2016-10-17 Thread Aseem Bansal
on a bit more? I'm not sure I understand > it. At the moment we load our models from S3 ( > RandomForestClassificationModel.load(..) ) and then store that in an > object property so that it persists across requests - this is in Scala. Is > this essentially what you mean? > >

Re: mllib model in production web API

2016-10-12 Thread Aseem Bansal
Hi Faced a similar issue. Our solution was to load the model, cache it after converting it to a model from mllib and then use that instead of ml model. On Tue, Oct 11, 2016 at 10:22 PM, Sean Owen wrote: > I don't believe it will ever scale to spin up a whole distributed job to > serve one reque

Reading from and writing to different S3 buckets in spark

2016-10-12 Thread Aseem Bansal
Hi I want to read CSV from one bucket, do some processing and write to a different bucket. I know the way to set S3 credentials using jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY) jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY) But the prob

When will the next version of spark be released?

2016-10-04 Thread Aseem Bansal
Hi I looked at Maven Central releases and guessed that spark has something like 2 months release cycle or sometimes even monthly. But the release of Spark 2.0.0 was in July so maybe that is wrong. When will the next version be released or is it more on an ad-hoc basis? Asking as there are some fi

Re: spark listener do not get fail status

2016-09-29 Thread Aseem Bansal
Hi In case my previous email was lacking in details here are some more details. - using Spark 2.0.0 - launching the job using org.apache.spark.launcher.SparkLauncher.startApplication(myListener) - checking state in the listener's stateChanged method On Thu, Sep 29, 2016 at 5:24 PM,

spark listener do not get fail status

2016-09-29 Thread Aseem Bansal
Hi Submitting job via spark api but I never get fail status even when the job throws an exception or exit via System.exit(-1) How do I indicate via SparkListener API that my job failed?

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-02 Thread Aseem Bansal
Hi Thanks for all the details. I was able to convert from ml.NaiveBayesModel to mllib.NaiveBayesModel and get it done. It is fast for our use case. Just one question. Before mllib is removed can ml package be expected to reach feature parity with mllib? On Thu, Sep 1, 2016 at 7:12 PM, Sean Owen

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Aseem Bansal
elated to .ml vs .mllib APIs. > > On Thu, Sep 1, 2016 at 2:01 PM, Aseem Bansal wrote: > > I understand your point. > > > > Is there something like a bridge? Is it possible to convert the model > > trained using Dataset (i.e. the distributed one) to the one which >

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Aseem Bansal
l for scoring a single example (but, > pretty fine for high-er latency, high throughput batch operations) > > However if you're scoring a Vector locally I can't imagine it's that > slow. It does some linear algebra but it's not that complicated. Even > something unop

Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Aseem Bansal
Hi Currently trying to use NaiveBayes to make predictions. But facing issues that doing the predictions takes order of few seconds. I tried with other model examples shipped with Spark but they also ran in minimum of 500 ms when I used Scala API. With Has anyone used spark ML to do predictions fo

Re: Spark 2.0.0 - Java vs Scala performance difference

2016-09-01 Thread Aseem Bansal
rent. >> Both are using the JVM-based APIs directly. Here and there there's a >> tiny bit of overhead in using the Java APIs because something is >> translated from a Java-style object to a Scala-style object, but this >> is generally trivial. >> >>

Spark 2.0.0 - Java vs Scala performance difference

2016-09-01 Thread Aseem Bansal
Hi Would there be any significant performance difference when using Java vs. Scala API?

spark 2.0.0 - code generation inputadapter_value is not rvalue

2016-09-01 Thread Aseem Bansal
Hi Does spark does some code generation? I am trying to use map on a Java RDD and getting a huge generated files with 17406 lines in my terminal and then a stacktrace 16/09/01 13:57:36 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 16/09/01 13:57:36 INFO DefaultWriterConta

Spark 2.0.0 - What all access is needed to save model to S3?

2016-08-29 Thread Aseem Bansal
Hi What all access is needed to save a model to S3? Initially I thought it should be only write. Then I found it also needs delete to delete temporary files. Now they have given me DELETE access I am getting the error Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: org.jets3t.serv

spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-24 Thread Aseem Bansal
Hi When Spark saves anything to S3 it creates temporary files. Why? Asking this as this requires the the access credentails to be given delete permissions along with write permissions.

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Aseem Bansal
Thanks everyone for clarifying. On Tue, Aug 23, 2016 at 9:11 PM, Aseem Bansal wrote: > I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ > and it mentioned that spark streaming actually mini-batch not actual > streaming. > > I have not used streaming an

Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Aseem Bansal
I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ and it mentioned that spark streaming actually mini-batch not actual streaming. I have not used streaming and I am not sure what is the difference in the 2 terms. Hence could not make a judgement myself.

Spark 2.0.0 - Java API - Modify a column in a dataframe

2016-08-11 Thread Aseem Bansal
Hi I have a Dataset I will change a String to String so there will be no schema changes. Is there a way I can run a map on it? I have seen the function at https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html#map(org.apache.spark.api.java.function.MapFunction,%20org.apac

Re: na.fill doesn't work

2016-08-11 Thread Aseem Bansal
Check the schema of the data frame. It may be that your columns are String. You are trying to give default for numerical data. On Thu, Aug 11, 2016 at 6:28 AM, Javier Rey wrote: > Hi everybody, > > I have a data frame after many transformation, my final task is fill na's > with zeros, but I run

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-10 Thread Aseem Bansal
rry > > > > The simplest way to work around it would be to read the csv as a text file > using sparkContext textFile, split each row based on a comma, then convert > it to a dataset afterwards. > > > > *From:* Aseem Bansal [mailto:asmbans...@gmail.com] > *Sent:* 08 August

Re: Multiple Sources Found for Parquet

2016-08-08 Thread Aseem Bansal
Seems that this is a common issue with Spark 2.0.0 I faced similar with CSV. Saw someone facing this with JSON. https://issues.apache.org/jira/browse/SPARK-16893 On Mon, Aug 8, 2016 at 4:08 PM, Ted Yu wrote: > Can you examine classpath to see where *DefaultSource comes from ?* > > *Thanks* > >

Re: Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-07 Thread Aseem Bansal
cala/index. > html#org.apache.spark.api.java.JavaSparkContext the classtag doesn't need > to be specified (instead it uses a "fake" class tag automatically for you). > Where are you seeing the different API? > > On Sun, Aug 7, 2016 at 11:32 PM, Aseem Bansal > wrote:

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-07 Thread Aseem Bansal
c", "xyz"); Dataset ds > = context.createDataset(data, Encoders.STRING()); > > I think you should be calling > > .as((Encoders.STRING(), Encoders.STRING())) > > or similar > > Ewan > > On 8 Aug 2016 06:10, Aseem Bansal wrote: > > Hi All > >

Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-07 Thread Aseem Bansal
Earlier for broadcasting we just needed to use sparkcontext.broadcast(objectToBroadcast) But now it is sparkcontext.broadcast(objectToBroadcast, classTag) What is classTag here?

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-07 Thread Aseem Bansal
Hi All Has anyone done this with Java API? On Fri, Aug 5, 2016 at 5:36 PM, Aseem Bansal wrote: > I need to use few columns out of a csv. But as there is no option to read > few columns out of csv so > 1. I am reading the whole CSV using SparkSession.csv() > 2. selecting few of

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-05 Thread Aseem Bansal
Pairs.printSchema > root > |-- name: string (nullable = true) > |-- city: string (nullable = true) > > Is this what you're after? > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0 http://bit.ly/mastering-a

Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-05 Thread Aseem Bansal
I need to use few columns out of a csv. But as there is no option to read few columns out of csv so 1. I am reading the whole CSV using SparkSession.csv() 2. selecting few of the columns using DataFrame.select() 3. applying schema using the .as() function of Dataset. I tried to extent org.apac

What is "Developer API " in spark documentation?

2016-08-05 Thread Aseem Bansal
Hi Many of spark documentation say "Developer API". What does that mean?

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Aseem Bansal
Hi Depending on how how you reading the data in the first place, can you simply use the header as header instead of a row? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv(scala.collection.Seq) See the header option On Wed, Aug 3, 2016 at 10:14 PM, Car

Spark 2.0 - Case sensitive column names while reading csv

2016-08-03 Thread Aseem Bansal
While reading csv via DataFrameReader how can I make column names case sensitive? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html None of the options specified mention case sensitivity http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFr