[CFP] DataWorks Summit, San Jose, 2018

2018-02-07 Thread Yanbo Liang
, Apache MXNet, PyTorch/Torch, XGBoost, Apache Livy, Apache Zeppelin, Jupyter, etc. Please consider to submit abstract at https://dataworkssummit.com/san-jose-2018/ <https://dataworkssummit.com/san-jose-2018/> Thanks Yanbo

[CFP] DataWorks Summit Europe 2018 - Call for abstracts

2017-12-09 Thread Yanbo Liang
The DataWorks Summit Europe is in Berlin, Germany this year, on April 16-19, 2018. This is a great place to talk about work you are doing in Apache Spark or how you are using Spark for SQL/streaming processing, machine learning and data science. Information on submitting an abstract is at https

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
You are right, native Spark MLlib CrossValidation can't run *different *algorithms in parallel. Thanks Yanbo On Tue, Sep 5, 2017 at 10:56 PM, Timsina, Prem wrote: > Hi Yanboo, > > Thank You, I very much appreciate your help. > > For the current use case, the data can fit in

Re: sparkR 3rd library

2017-09-05 Thread Yanbo Liang
de of SparkR UDF, please refer this test case: https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_context.R#L171 Thanks Yanbo On Tue, Sep 5, 2017 at 6:42 AM, Felix Cheung wrote: > Can you include the code you call spark.lapply? > > > -

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
If yes, you can also try spark-sklearn, which can distribute multiple model training(single node training with sklearn) across a distributed cluster and do parameter search. FYI: https://github.com/databricks/spark-sklearn Thanks Yanbo On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy wrote: >

Re: Training A ML Model on a Huge Dataframe

2017-08-24 Thread Yanbo Liang
Hi Sea, Could you let us know which ML algorithm you use? What's the number instances and dimension of your dataset? AFAIK, Spark MLlib can train model with several millions of feature if you configure it correctly. Thanks Yanbo On Thu, Aug 24, 2017 at 7:07 AM, Suzen, Mehmet wrote: >

Re: [BlockMatrix] multiply is an action or a transformation ?

2017-08-20 Thread Yanbo Liang
hanks Yanbo On Sun, Aug 13, 2017 at 10:30 PM, Jose Francisco Saray Villamizar < jsa...@gmail.com> wrote: > Hi Everyone, > > Sorry if the question can be simple, or confusing, but I have not see > anywhere in documentation > the anwser: > > Is multiply method in BlockMatr

Re: Huber regression in PySpark?

2017-08-20 Thread Yanbo Liang
merged into LinearRegression. I will update this PR ASAP, and I'm looking forward your reviews and comments. After the Scala implementation is merged, it's very easy to add corresponding PySpark API, then you can use it to train huber regression model in the distributed environment. Thanks

Re: Collecting matrix's entries raises an error only when run inside a test

2017-07-06 Thread Yanbo Liang
Hi Simone, Would you mind to share the minimized code to reproduce this issue? Yanbo On Wed, Jul 5, 2017 at 10:52 PM, Simone Robutti wrote: > Hello, I have this problem and Google is not helping. Instead, it looks > like an unreported bug and there are no hints to possible worka

Re: PySpark 2.1.1 Can't Save Model - Permission Denied

2017-06-28 Thread Yanbo Liang
file system. Could you write a Spark DataFrame to this file system and check whether it works well? Thanks Yanbo On Tue, Jun 27, 2017 at 8:47 PM, John Omernik wrote: > Hello all, I am running PySpark 2.1.1 as a user, jomernik. I am working > through some documentation here: >

Re: Help in Parsing 'Categorical' type of data

2017-06-23 Thread Yanbo Liang
Please consider to use other classification models such as logistic regression or GBT. Naive bayes usually consider features as count, which is not suitable to be used on features generated by one-hot encoder. Thanks Yanbo On Wed, May 31, 2017 at 3:58 PM, Amlan Jyoti wrote: > Hi, >

Re: RowMatrix: tallSkinnyQR

2017-06-23 Thread Yanbo Liang
Since this function is used to compute QR decomposition for RowMatrix of a tall and skinny shape, the output R is always with small rank. [image: Inline image 1] On Fri, Jun 9, 2017 at 10:33 PM, Arun wrote: > hi > > *def tallSkinnyQR(computeQ: Boolean = false): QRDecomposition[RowMatrix, > Matr

Re: spark higher order functions

2017-06-23 Thread Yanbo Liang
See reply here: http://apache-spark-developers-list.1001551.n3.nabble.com/Will-higher-order-functions-in-spark-SQL-be-pushed-upstream-td21703.html On Tue, Jun 20, 2017 at 10:02 PM, AssafMendelson wrote: > Hi, > > I have seen that databricks have higher order functions ( > https://docs.databrick

Re: gfortran runtime library for Spark

2017-06-23 Thread Yanbo Liang
gfortran runtime library is still required for Spark 2.1 for better performance. If it's not present on your nodes, you will see a warning message and a pure JVM implementation will be used instead, but you will not get the best performance. Thanks Yanbo On Wed, Jun 21, 2017 at 5:30 PM, Sa

Re: BinaryClassificationMetrics only supports AreaUnderPR and AreaUnderROC?

2017-05-12 Thread Yanbo Liang
Yeah, for binary data, you can also use MulticlassClassificationEvaluator to evaluate other metrics which BinaryClassificationEvaluator doesn't cover, such as accuracy, f1, weightedPrecision and weightedRecall. Thanks Yanbo On Thu, May 11, 2017 at 10:31 PM, Lan Jiang wrote: > I reali

[CFP] DataWorks Summit/Hadoop Summit Sydney - Call for abstracts

2017-05-03 Thread Yanbo Liang
The Australia/Pacific version of DataWorks Summit is in Sydney this year, September 20-21. This is a great place to talk about work you are doing in Apache Spark or how you are using Spark. Information on submitting an abstract is at https://dataworkssummit.com/sydney-2017/abstracts/submit-abstract

Re: Initialize Gaussian Mixture Model using Spark ML dataframe API

2017-05-01 Thread Yanbo Liang
Hi Tim, Spark ML API doesn't support set initial model for GMM currently. I wish we can get this feature in Spark 2.3. Thanks Yanbo On Fri, Apr 28, 2017 at 1:46 AM, Tim Smith wrote: > Hi, > > I am trying to figure out the API to initialize a gaussian mixture model > using

Re: How to create SparkSession using SparkConf?

2017-04-27 Thread Yanbo Liang
StreamingContext is an old API, if you want to process streaming data, you can use SparkSession directly. FYI: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Thanks Yanbo On Fri, Apr 28, 2017 at 12:12 AM, kant kodali wrote: > Actually one more question al

Re: How to create SparkSession using SparkConf?

2017-04-27 Thread Yanbo Liang
Could you try the following way? val spark = SparkSession.builder.appName("my-application").config("spark.jars", "a.jar, b.jar").getOrCreate() Thanks Yanbo On Thu, Apr 27, 2017 at 9:21 AM, kant kodali wrote: > I am using Spark 2.1 BTW. > > On We

Re: Synonym handling replacement issue with UDF in Apache Spark

2017-04-27 Thread Yanbo Liang
What about JOIN your table with a map table? On Thu, Apr 27, 2017 at 9:58 PM, Nishanth wrote: > I am facing a major issue on replacement of Synonyms in my DataSet. > > I am trying to replace the synonym of the Brand names to its equivalent > names. > > I have tried 2 methods to solve this issue.

Re: how to create List in pyspark

2017-04-27 Thread Yanbo Liang
;split_value", split_func("value")).show() Thanks Yanbo On Tue, Apr 25, 2017 at 12:27 AM, Selvam Raman wrote: > documentDF = spark.createDataFrame([ > > ("Hi I heard about Spark".split(" "), ), > > ("I wish Java could use c

Re: how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-20 Thread Yanbo Liang
be a sparse vector (or matrix for multinomial case) if it's sparse enough. Thanks Yanbo On Sun, Mar 19, 2017 at 5:02 AM, Dhanesh Padmanabhan wrote: > It shouldn't be difficult to convert the coefficients to a sparse vector. > Not sure if that is what you are looking for > >

Re: How does preprocessing fit into Spark MLlib pipeline

2017-03-17 Thread Yanbo Liang
Hi Adrian, Did you try SQLTransformer? Your preprocessing steps are SQL operations and can be handled by SQLTransformer in MLlib pipeline scope. Thanks Yanbo On Thu, Mar 9, 2017 at 11:02 AM, aATv wrote: > I want to start using PySpark Mllib pipelines, but I don't understand >

Re: ML PIC

2016-12-21 Thread Yanbo Liang
You can track https://issues.apache.org/jira/browse/SPARK-15784 for the progress. On Wed, Dec 21, 2016 at 7:08 AM, Nick Pentreath wrote: > It is part of the general feature parity roadmap. I can't recall offhand > any blocker reasons it's just resources > On Wed, 21 Dec 2016 at 17:05, Robert Ham

Re: Usage of mllib api in ml

2016-11-20 Thread Yanbo Liang
You can refer this example( http://spark.apache.org/docs/latest/ml-tuning.html#example-model-selection-via-cross-validation) which use BinaryClassificationEvaluator, and it should be very straightforward to switch to MulticlassClassificationEvaluator. Thanks Yanbo On Sat, Nov 19, 2016 at 9:03 AM

Re: Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

2016-11-19 Thread Yanbo Liang
ix(oldRDD, nRows, nCols) mat.columnSimilarities() Please feel free to let me know whether it can satisfy your requirements. Thanks Yanbo On Wed, Nov 16, 2016 at 9:26 AM, Russell Jurney wrote: > Asher, can you cast like that? Does that casting work? That is my > confusion: I don't know what a DataFram

Re: VectorUDT and ml.Vector

2016-11-19 Thread Yanbo Liang
dataframe (Vector or Matrix)? I think it's ml.linalg.Vector, so your should use *MLUtils.convertVectorColumnsFromML.* Thanks Yanbo On Mon, Nov 7, 2016 at 5:25 AM, Ganesh wrote: > I am trying to run a SVD on a dataframe and I have used ml TF-IDF which > has created a dataframe. >

Re: why is method predict protected in PredictionModel

2016-11-19 Thread Yanbo Liang
This function is used internally currently, we will expose it as public to support make prediction on single instance. See discussion at https://issues.apache.org/jira/browse/SPARK-10413. Thanks Yanbo On Thu, Nov 17, 2016 at 1:24 AM, wobu wrote: > Hi, > > we were using Spark 1.3.1 f

Re: Spark R guidelines for non-spark functions and coxph (Cox Regression for Time-Dependent Covariates)

2016-11-16 Thread Yanbo Liang
requirements. BTW, I'm the author of Spark AFTSurvivalRegression. Any more questions, please feel free to let me know. http://spark.apache.org/docs/latest/ml-classification-regression.html#survival-regression http://spark.apache.org/docs/latest/api/R/index.html Thanks Yanbo On Tue, Nov 15, 20

Re: HashingTF for TF.IDF computation

2016-10-23 Thread Yanbo Liang
generated by HashingTF or CountVectorizer. FYI http://spark.apache.org/docs/latest/ml-features.html#tf-idf Thanks Yanbo On Thu, Oct 20, 2016 at 10:00 AM, Ciumac Sergiu wrote: > Hello everyone, > > I'm having a usage issue with HashingTF class from Spark MLLIB. > > I'm com

Re: Did anybody come across this random-forest issue with spark 2.0.1.

2016-10-17 Thread Yanbo Liang
​Please increase the value of "maxMemoryInMB"​ of your RandomForestClassifier or RandomForestRegressor. It's a warning which will not affect the result but may lead your training slower. Thanks Yanbo On Mon, Oct 17, 2016 at 8:21 PM, 张建鑫(市场部) wrote: > Hi Xi Shen > > Th

Re: Logistic Regression Standardization in ML

2016-10-10 Thread Yanbo Liang
#L551 https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L588 Thanks Yanbo On Mon, Oct 10, 2016 at 7:27 AM, Sean Owen wrote: > (BTW I think it means "when no standardization is applied", which

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread Yanbo Liang
; 日期: 2016年10月8日 星期六 上午12:21 > 至: Yanbo Liang > > 抄送: "d...@spark.apache.org" , "user@spark.apache.org" > > 主题: Re: Could we expose log likelihood of EM algorithm in MLLIB? > > Thanks for replying. > When could you send out the PR? > > 发件人:

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread Yanbo Liang
It's a good question and I had similar requirement in my work. I'm copying the implementation from mllib to ml currently, and then exposing the maximum log likelihood. I will send this PR soon. Thanks. Yanbo On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) wrote: > > Hi, > > Do y

Re: SVD output within Spark

2016-08-31 Thread Yanbo Liang
The signs of the eigenvectors are essentially arbitrary, so both result of Spark and Matlab are right. Thanks On Thu, Jul 21, 2016 at 3:50 PM, Martin Somers wrote: > > just looking at a comparision between Matlab and Spark for svd with an > input matrix N > > > this is matlab code - yes very sm

Re: Spark MLlib question: load model failed with exception:org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String

2016-08-18 Thread Yanbo Liang
maintenance mode. So do all your work under the same APIs. Thanks Yanbo 2016-08-17 1:30 GMT-07:00 : > Hello guys: > I have a problem in loading recommend model. I have 2 models, one is > good(able to get recommend result) and another is not working. I checked > these 2 mode

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-17 Thread Yanbo Liang
If you want to tie them with other data, I think the best way is to use DataFrame join operation on condition that they share an identity column. Thanks Yanbo 2016-08-16 20:39 GMT-07:00 ayan guha : > Hi > > Thank you for your reply. Yes, I can get prediction and original features >

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-17 Thread Yanbo Liang
mode, so we strongly recommend users to use the DataFrame-based spark.ml API. Thanks Yanbo 2016-08-17 11:46 GMT-07:00 Michał Zieliński : > I'm using Spark 1.6.2 for Vector-based UDAF and this works: > > def inputSchema: StructType = new StructType().add("input", new >

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-16 Thread Yanbo Liang
It seams that VectorUDT is private and can not be accessed out of Spark currently. It should be public but we need to do some refactor before make it public. You can refer the discussion at https://github.com/apache/spark/pull/12259 . Thanks Yanbo 2016-08-16 9:48 GMT-07:00 alexeys : > I

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-16 Thread Yanbo Liang
MLlib will keep the original dataset during transformation, it just append new columns to existing DataFrame. That is you can get both prediction value and original features from the output DataFrame of model.transform. Thanks Yanbo 2016-08-16 17:48 GMT-07:00 ayan guha : > Hi > >

Re: Spark's Logistic Regression runs unstable on Yarn cluster

2016-08-16 Thread Yanbo Liang
Could you check the log to see how much iterations does your LoR runs? Does your program output same model between different attempts? Thanks Yanbo 2016-08-12 3:08 GMT-07:00 olivierjeunen : > I'm using pyspark ML's logistic regression implementation to do some > classificati

Re: Linear regression, weights constraint

2016-08-16 Thread Yanbo Liang
Spark MLlib does not support boxed constraints on model coefficients currently. Thanks Yanbo 2016-08-15 3:53 GMT-07:00 letaiv : > Hi all, > > Is there any approach to add constrain for weights in linear regression? > What I need is least squares regression with non-negative constrai

Re: using matrix as column datatype in SparkSQL Dataframe

2016-08-10 Thread Yanbo Liang
. Thanks Yanbo 2016-08-08 11:06 GMT-07:00 Vadla, Karthik : > Hello all, > > > > I'm trying to load set of medical images(dicom) into spark SQL dataframe. > Here each image is loaded into matrix column of dataframe. I see spark > recently added MatrixUDT to support this kind of

Re: Random forest binary classification H20 difference Spark

2016-08-10 Thread Yanbo Liang
Hi Samir, Did you use VectorAssembler to assemble some columns into the feature column? If there are NULLs in your dataset, VectorAssembler will throw this exception. You can use DataFrame.drop() or DataFrame.replace() to drop/substitute NULL values. Thanks Yanbo 2016-08-07 19:51 GMT-07:00

Re: Logistic regression formula string

2016-08-10 Thread Yanbo Liang
I think you can output the schema of DataFrame which will be feed into the estimator such as LogisticRegression. The output array will be the encoded feature names corresponding the coefficients of the model. Thanks Yanbo 2016-08-08 15:53 GMT-07:00 Cesar : > > I have a data frame wit

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Yanbo Liang
compute term frequency divided by the length of the document, you should write your own function based on transformers provided by MLlib. Thanks Yanbo 2016-08-01 15:29 GMT-07:00 Hao Ren : > When computing term frequency, we can use either HashTF or CountVectorizer > feature extractors. > Howe

Re: K-means Evaluation metrics

2016-07-24 Thread Yanbo Liang
Spark MLlib KMeansModel provides "computeCost" function which return the sum of squared distances of points to their nearest center as the k-means cost on the given dataset. Thanks Yanbo 2016-07-24 17:30 GMT-07:00 janardhan shetty : > Hi, > > I was trying to evaluate

Re: Frequent Item Pattern Spark ML Dataframes

2016-07-24 Thread Yanbo Liang
You can refer this JIRA (https://issues.apache.org/jira/browse/SPARK-14501) for porting spark.mllib.fpm to spark.ml. Thanks Yanbo 2016-07-24 11:18 GMT-07:00 janardhan shetty : > Is there any implementation of FPGrowth and Association rules in Spark > Dataframes ? > We have in RD

Re: Locality sensitive hashing

2016-07-24 Thread Yanbo Liang
Hi Janardhan, Please refer the JIRA (https://issues.apache.org/jira/browse/SPARK-5992) for the discussion about LSH. Regards Yanbo 2016-07-24 7:13 GMT-07:00 Karl Higley : > Hi Janardhan, > > I collected some LSH papers while working on an RDD-based implementation. > Links at th

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Sorry for the wrong link, what you should refer is jpmml-sparkml ( https://github.com/jpmml/jpmml-sparkml). Thanks Yanbo 2016-07-24 4:46 GMT-07:00 Yanbo Liang : > Spark does not support exporting ML models to PMML currently. You can try > the third party jpmml-spark (https://github.com

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Spark does not support exporting ML models to PMML currently. You can try the third party jpmml-spark (https://github.com/jpmml/jpmml-spark) package which supports a part of ML models. Thanks Yanbo 2016-07-20 11:14 GMT-07:00 Ajinkya Kale : > Just found Google dataproc has a preview of spark

Re: Distributed Matrices - spark mllib

2016-07-24 Thread Yanbo Liang
, MatrixEntry l = [(1, 1, 10), (2, 2, 20), (3, 3, 30)] df = sqlContext.createDataFrame(l, ['row', 'column', 'value']) rdd = df.select('row', 'column', 'value').rdd.map(lambda row: MatrixEntry(*row)) mat = CoordinateMatrix(rdd) mat.entries.

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-17 Thread Yanbo Liang
Hi Tobi, Thanks for clarifying the question. It's very straight forward to convert the filtered RDD to DataFrame, you can refer the following code snippets: from pyspark.sql import Row rdd2 = filteredRDD.map(lambda v: Row(features=v)) df = rdd2.toDF() Thanks Yanbo 2016-07-16 14:51 GMT-

Re: Feature importance IN random forest

2016-07-16 Thread Yanbo Liang
elCol="indexed", seed=42) model = rf.fit(td) model.featureImportances Then you can get the feature importances which is a Vector. Thanks Yanbo 2016-07-12 10:30 GMT-07:00 pseudo oduesp : > Hi, > i use pyspark 1.5.0 > can i ask you how i can get feature imprtance for a randomforest

Re: bisecting kmeans model tree

2016-07-16 Thread Yanbo Liang
Currently we do not expose the APIs to get the Bisecting KMeans tree structure, they are private in the ml.clustering package scope. But I think we should make a plan to expose these APIs like what we did for Decision Tree. Thanks Yanbo 2016-07-12 11:45 GMT-07:00 roni : > Hi Spark,Mlib expe

Re: Dense Vectors outputs in feature engineering

2016-07-16 Thread Yanbo Liang
orm(df2) df3.show() // Decode to get the original categories. val group = AttributeGroup.fromStructField(df3.schema("encodedName")) val categories = group.attributes.get.map(_.name.get) println(categories.mkString(",")) // Output: b,a,c Thanks Yanbo 2016-07-14 6:46

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-16 Thread Yanbo Liang
= sc.parallelize(data) model = ChiSqSelector(1).fit(rdd) filteredRDD = model.transform(rdd.map(lambda lp: lp.features)) filteredRDD.collect() However, we strongly recommend you to migrate to DataFrame-based API since the RDD-based API is switched to maintain mode. Thanks Yanbo 2016-07-14 13:23 GMT

Re: QuantileDiscretizer not working properly with big dataframes

2016-07-16 Thread Yanbo Liang
Could you tell us the Spark version you used? We have fixed this bug at Spark 1.6.2 and Spark 2.0, please upgrade to these versions and retry. If this issue still exists, please let us know. Thanks Yanbo 2016-07-12 11:03 GMT-07:00 Pasquinell Urbani < pasquinell.urb...@exalitica.com>: &g

Re: Isotonic Regression, run method overloaded Error

2016-07-11 Thread Yanbo Liang
t;prediction").rdd.map { case Row(pred) => pred }.collect() assert(predictions === Array(1, 2, 2, 2, 6, 16.5, 16.5, 17, 18)) Thanks Yanbo 2016-07-11 6:14 GMT-07:00 Fridtjof Sander : > Hi Swaroop, > > from my understanding, Isotonic Regression is currently limited to data &g

Re: Isotonic Regression, run method overloaded Error

2016-07-10 Thread Yanbo Liang
Hi Swaroop, Would you mind to share your code that others can help you to figure out what caused this error? I can run the isotonic regression examples well. Thanks Yanbo 2016-07-08 13:38 GMT-07:00 dsp : > Hi I am trying to perform Isotonic Regression on a data set with 9 features >

Re: mllib based on dataset or dataframe

2016-07-10 Thread Yanbo Liang
DataFrame is a kind of special case of Dataset, so they mean the same thing. Actually the ML pipeline API will accept Dataset[_] instead of DataFrame in Spark 2.0. We can say that MLlib will focus on the Dataset-based API for futher development more accurately. Thanks Yanbo 2016-07-10 20:35 GMT

Re: Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-04 Thread Yanbo Liang
Would you mind to file a JIRA to track this issue? I will take a look when I have time. 2016-07-04 14:09 GMT-07:00 mshiryae : > Hi, > > I am trying to train model by MultilayerPerceptronClassifier. > > It works on sample data from > data/mllib/sample_multiclass_classification_data.txt with 4 feat

Re: Graphframe Error

2016-07-04 Thread Yanbo Liang
bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar to launch PySpark with graphframes enabled. You should set "--py-files" and "--jars" options with the directory where you saved graphframes.jar. Thanks Yanbo 2016-07-03 15:48 GMT-07:00 Arun Patel : &g

Re: Several questions about how pyspark.ml works

2016-07-02 Thread Yanbo Liang
Hi Nick, Please see my inline reply. Thanks Yanbo 2016-06-12 3:08 GMT-07:00 XapaJIaMnu : > Hey, > > I have some additional Spark ML algorithms implemented in scala that I > would > like to make available in pyspark. For a reference I am looking at the > available l

Re: Trainning a spark ml linear regresion model fail after migrating from 1.5.2 to 1.6.1

2016-07-02 Thread Yanbo Liang
Yes, WeightedLeastSquares can not solve some ill-conditioned problem currently, the community members have paid some efforts to resolve it (SPARK-13777). For the work around, you can set the solver to "l-bfgs" which will train the LogisticRegressionModel by L-BFGS optimization method. 2016-06-09 7

Re: Get both feature importance and ROC curve from a random forest classifier

2016-07-02 Thread Yanbo Liang
ble, label: Double) => (rawPrediction, label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) metrics.roc() Thanks Yanbo 2016-06-15 7:13 GMT-07:00 matd : > Hi ml folks ! > > I'm using a Random Forest for a binary classification. > I'm interested in gett

Re: Ideas to put a Spark ML model in production

2016-07-02 Thread Yanbo Liang
;/tmp/lr-model") val data = newDataset val prediction = model.transform(data) However, usually we save/load PipelineModel which include necessary feature transformers and model training process rather than the single model, but they are similar operations. Thanks Yanbo 2016-06-23 10:54 GMT-0

Re: Custom Optimizer

2016-07-02 Thread Yanbo Liang
Spark MLlib does not support optimizer as a plugin, since the optimizer interface is private. Thanks Yanbo 2016-06-23 16:56 GMT-07:00 Stephen Boesch : > My team has a custom optimization routine that we would have wanted to > plug in as a replacement for the default LBFGS / OWLQN for

Re: Spark ML - Java implementation of custom Transformer

2016-07-02 Thread Yanbo Liang
solution for the compatibility issue has been figured out, we will add it back at 2.1. Thanks Yanbo 2016-06-27 11:57 GMT-07:00 Mehdi Meziane : > Hi all, > > We have some problems while implementing custom Transformers in JAVA > (SPARK 1.6.1). > We do override the method copy, but

Re: ML regression - spark context dies without error

2016-06-05 Thread Yanbo Liang
Could you tell me which regression algorithm, the parameters you set and the detail exception information? Or it's better to paste your code and exception here if it's applicable, then other members can help you to diagnose the problem. Thanks Yanbo 2016-05-12 2:03 GMT-07:00 Al

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
Yes, you are right. 2016-05-30 2:34 GMT-07:00 Abhishek Anand : > > Thanks Yanbo. > > So, you mean that if I have a variable which is of type double but I want > to treat it like String in my model I just have to cast those columns into > string and simply run the glm model. S

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
Hi Abhi, In SparkR glm, category features (columns of type string) will be one-hot encoded automatically. So pre-processing like `as.factor` is not necessary, you can directly feed your data to the model training. Thanks Yanbo 2016-05-30 2:06 GMT-07:00 Abhishek Anand : > Hi , > > I wa

Re: Possible bug involving Vectors with a single element

2016-05-27 Thread Yanbo Liang
Spark MLlib Vector only supports data of double type, it's reasonable to throw exception when you creating a Vector with element of unicode type. 2016-05-24 7:27 GMT-07:00 flyinggip : > Hi there, > > I notice that there might be a bug in pyspark.mllib.linalg.Vectors when > dealing with a vector w

Re: Reg:Reading a csv file with String label into labelepoint

2016-03-16 Thread Yanbo Liang
featureCol and labelCol. Thanks Yanbo 2016-03-16 13:41 GMT+08:00 Dharmin Siddesh J : > Hi > > I am trying to read a csv with few double attributes and String Label . > How can i convert it to labelpoint RDD so that i can run it with spark > mllib classification algorithms. >

Re: SparkML Using Pipeline API locally on driver

2016-02-28 Thread Yanbo Liang
track the progress of https://issues.apache.org/jira/browse/SPARK-10413. Thanks Yanbo 2016-02-27 8:52 GMT+08:00 Eugene Morozov : > Hi everyone. > > I have a requirement to run prediction for random forest model locally on > a web-service without touching spark at all in some spe

Re: Saving and Loading Dataframes

2016-02-28 Thread Yanbo Liang
("parquet").mode("overwrite").save(output) > val data = sqlContext.read.format("parquet").load(output) Thanks Yanbo 2016-02-27 2:01 GMT+08:00 Raj Kumar : > Thanks for the response Yanbo. Here is the source (it uses the > sample_libsvm_data.txt file

Re: Survival Curves using AFT implementation in Spark

2016-02-26 Thread Yanbo Liang
mples/ml/AFTSurvivalRegressionExample.scala#L48> . Maybe we can add this feature later. Thanks Yanbo 2016-02-26 14:35 GMT+08:00 Stuti Awasthi : > Hi All, > > I wanted to apply Survival Analysis using Spark AFT algorithm > implementation. Now I perform the same in R using coxph model

Re: Calculation of histogram bins and frequency in Apache spark 1.6

2016-02-25 Thread Yanbo Liang
Actually Spark SQL `groupBy` with `count` can get frequency in each bin. You can also try with DataFrameStatFunctions.freqItems() to get the frequent items for columns. Thanks Yanbo 2016-02-24 1:21 GMT+08:00 Burak Yavuz : > You could use the Bucketizer transformer in Spark ML. > > Best

Re: Saving and Loading Dataframes

2016-02-25 Thread Yanbo Liang
Hi Raj, Could you share your code which can help others to diagnose this issue? Which version did you use? I can not reproduce this problem in my environment. Thanks Yanbo 2016-02-26 10:49 GMT+08:00 raj.kumar : > Hi, > > I am using mllib. I use the ml vectorization tools to c

Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-16 Thread Yanbo Liang
= standardScaler.fit(ovarian2) val ovarian3 = ssModel.transform(ovarian2) val aft = new AFTSurvivalRegression().setFeaturesCol("standardized_features") val model = aft.fit(ovarian3) val newCoefficients = model.coefficients.toArray.zip(ssModel.std.toArray).map { x => x._1 / x._2 }

Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-15 Thread Yanbo Liang
Hi Stuti, This is a bug of AFTSurvivalRegression, we did not handle "lossSum == infinity" properly. I have open https://issues.apache.org/jira/browse/SPARK-13322 to track this issue and will send a PR. Thanks for reporting this issue. Yanbo 2016-02-12 15:03 GMT+08:00 Stuti Awasthi :

Re: [MLLib] Is the order of the coefficients in a LogisticRegresionModel kept ?

2016-02-02 Thread Yanbo Liang
For you case, it's true. But not always correct for a pipeline model, some transformers in pipeline will change the features such as OneHotEncoder. 2016-02-03 1:21 GMT+08:00 jmvllt : > Hi everyone, > > This may sound like a stupid question but I need to be sure of this : > > Given a dataframe com

Re: how to save Matrix type result to hdfs file using java

2016-01-24 Thread Yanbo Liang
Matrix can be save as column of type MatrixUDT.

Re: can we create dummy variables from categorical variables, using sparkR

2016-01-24 Thread Yanbo Liang
Yanbo 2016-01-20 1:15 GMT+08:00 Vinayak Agrawal : > Yes, you can use Rformula library. Please see > > https://databricks.com/blog/2015/10/05/generalized-linear-models-in-sparkr-and-r-formula-support-in-mllib.html > > On Tue, Jan 19, 2016 at 10:34 AM, Devesh Raj Singh > wrote: >

Re: has any one implemented TF_IDF using ML transformers?

2016-01-24 Thread Yanbo Liang
Hi Andy, I will take a look at your code after your share it. Thanks! Yanbo 2016-01-23 0:18 GMT+08:00 Andy Davidson : > Hi Yanbo > > I recently code up the trivial example from > http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html > I > d

Re: Extracting p values in Logistic regression using mllib scala

2016-01-24 Thread Yanbo Liang
Hi Chandan, MLlib only support getting p-value, t-value from Linear Regression model, other models such as Logistic Model are not supported currently. This feature is under development and will be released at the next version(Spark 2.0). Thanks Yanbo 2016-01-18 16:45 GMT+08:00 Chandan Verma

Re: has any one implemented TF_IDF using ML transformers?

2016-01-19 Thread Yanbo Liang
/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala#L226 Thanks Yanbo 2016-01-19 7:05 GMT+08:00 Andy Davidson : > Hi Yanbo > > I am using 1.6.0. I am having a hard of time trying to figure out what the > exact equation is. I do not know Scala. > &g

Re: Feature importance for RandomForestRegressor in Spark 1.5

2016-01-17 Thread Yanbo Liang
-classification-regression.html#random-forest-classifier . Thanks Yanbo 2016-01-16 0:16 GMT+08:00 Robin East : > re 1. > The pull requests reference the JIRA ticket in this case > https://issues.apache.org/jira/browse/SPARK-5133. The JIRA says it was > rel

Re: has any one implemented TF_IDF using ML transformers?

2016-01-17 Thread Yanbo Liang
pache/spark/ml/feature/IDF.scala#L121 I found the document of IDF is not very clear, we need to update it. Thanks Yanbo 2016-01-16 6:10 GMT+08:00 Andy Davidson : > I wonder if I am missing something? TF-IDF is very popular. Spark ML has a > lot of transformers how ever it TF_IDF is no

Re: AIC in Linear Regression in ml pipeline

2016-01-15 Thread Yanbo Liang
Hi Arunkumar, It does not support output AIC value for Linear Regression currently. This feature is under development and will be released at Spark 2.0. Thanks Yanbo 2016-01-15 17:20 GMT+08:00 Arunkumar Pillai : > Hi > > Is it possible to get AIC value in Linear Regression using ml

Re: ml.classification.NaiveBayesModel how to reshape theta

2016-01-13 Thread Yanbo Liang
Yep, row of Matrix theta is the number of classes and column of theta is the number of features. 2016-01-13 10:47 GMT+08:00 Andy Davidson : > I am trying to debug my trained model by exploring theta > Theta is a Matrix. The java Doc for Matrix says that it is column major > formate > > I have tra

Re: Deploying model built in SparkR

2016-01-11 Thread Yanbo Liang
Hi Chandan, Could you tell us the meaning of deploying model? Using the model to make prediction by R? Thanks Yanbo 2016-01-11 20:40 GMT+08:00 Chandan Verma : > Hi All, > > Does any one over here has deployed a model produced in SparkR or atleast > help me with the steps fo

Re: broadcast params to workers at the very beginning

2016-01-11 Thread Yanbo Liang
Hi, The parameters should be broadcasted again after you update it at driver side, then you can get updated version at worker side. Thanks Yanbo 2016-01-09 23:12 GMT+08:00 octavian.ganea : > Hi, > > In my app, I have a Params scala object that keeps all the specific > (hyper)para

Re: StandardScaler in spark.ml.feature requires vector input?

2016-01-11 Thread Yanbo Liang
into StandardScaler. Thanks Yanbo 2016-01-10 8:10 GMT+08:00 Kristina Rogale Plazonic : > Hi, > > The code below gives me an unexpected result. I expected that > StandardScaler (in ml, not mllib) will take a specified column of an input > dataframe and subtract the mean of the c

Re: Predictive Modelling in sparkR

2016-01-07 Thread Yanbo Liang
Hi Chandan, Do you mean to run your own LR algorithm based on SparkR? Actually, SparkR provide the ability to run the distributed Spark MLlib LR and the interface is similar with the R GLM. For your refer: https://spark.apache.org/docs/latest/sparkr.html#binomial-glm-model 2016-01-07 2:45 GMT+08:

Re: Date Time Regression as Feature

2016-01-07 Thread Yanbo Liang
input into the features which can be feed into model trainer. OneHotEncoder and VectorAssembler are feature transformers provided by Spark ML, you can refer https://spark.apache.org/docs/latest/ml-features.html Thanks Yanbo 2016-01-08 7:52 GMT+08:00 Annabel Melongo : > Or he can also transform

Re: sparkR ORC support.

2016-01-06 Thread Yanbo Liang
You should ensure your sqlContext is HiveContext. sc <- sparkR.init() sqlContext <- sparkRHive.init(sc) 2016-01-06 20:35 GMT+08:00 Sandeep Khurana : > Felix > > I tried the option suggested by you. It gave below error. I am going to > try the option suggested by Prem . > > Error in writeJobj

Re: finding distinct count using dataframe

2016-01-05 Thread Yanbo Liang
Hi Arunkumar, You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or approxCountDistinct for a approximate result. 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai : > Hi > > Is there any functions to find distinct count of all the variables in > dataframe. > > val sc = new SparkCont

Re: SparkML algos limitations question.

2016-01-04 Thread Yanbo Liang
Hi Alexander, That's cool! Thanks for the clarification. Yanbo 2016-01-05 5:06 GMT+08:00 Ulanov, Alexander : > Hi Yanbo, > > > > As long as two models fit into memory of a single machine, there should be > no problems, so even 16GB machines can handle large models. (ma

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-04 Thread Yanbo Liang
like the following code snippet: gmmModel.predictSoft(rdd) then you will get a new RDD which is the soft prediction result. And all the models in ML package follow this rule. Yanbo 2016-01-04 22:16 GMT+08:00 Tomasz Fruboes : > Hi Yanbo, > > thanks for info. Is it likely to change

  1   2   3   >