Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Yanbo Liang
If you run Spark on YARN, the simplest way is replace the $SPARK_HOME/lib/spark-.jar with your own version spark jar file and run your application. The spark-submit script will upload this jar to YARN cluster automatically and then you can run your application as usual. It does not care about w

Re: Can it works in load the MatrixFactorizationModel and predict product with Spark Streaming?

2015-06-17 Thread Yanbo Liang
The logs have told you what cause the error that you can not invoke RDD transformations and actions in other transformations. You have not do this explicitly but the implementation of MatrixFactorizationModel .recommendProducts do that, you can refer https://github.com/apache/spark/blob/master/mlli

Re: Retrieving Spark Configuration properties

2015-07-16 Thread Yanbo Liang
This is because that you did not set the parameter "spark.sql. hive.metastore.version". You can check other parameters that you have set, it will work well. Or you can first set this parameter, and then get it. 2015-07-17 11:53 GMT+08:00 RajG : > I am using this version of Spark : *spark-1.4.0-bi

Re: Reg:Reading a csv file with String label into labelepoint

2016-03-16 Thread Yanbo Liang
Actually it's unnecessary to convert csv row to LabeledPoint, because we use DataFrame as the standard data format when training a model by Spark ML. What you should do is converting double attributes to Vector named "feature". Then you can train the ML model by specifying the featureCol and labelC

Re: Possible bug involving Vectors with a single element

2016-05-27 Thread Yanbo Liang
Spark MLlib Vector only supports data of double type, it's reasonable to throw exception when you creating a Vector with element of unicode type. 2016-05-24 7:27 GMT-07:00 flyinggip : > Hi there, > > I notice that there might be a bug in pyspark.mllib.linalg.Vectors when > dealing with a vector w

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
Hi Abhi, In SparkR glm, category features (columns of type string) will be one-hot encoded automatically. So pre-processing like `as.factor` is not necessary, you can directly feed your data to the model training. Thanks Yanbo 2016-05-30 2:06 GMT-07:00 Abhishek Anand : > Hi , > > I want to run

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
tring columns will be directly > one-hot encoded by the glm provided by sparkR ? > > Just wanted to clarify as in R we need to apply as.factor for categorical > variables. > > val dfNew = df.withColumn("C0",df.col("C0").cast("String")) > > > Ab

Re: ML regression - spark context dies without error

2016-06-05 Thread Yanbo Liang
Could you tell me which regression algorithm, the parameters you set and the detail exception information? Or it's better to paste your code and exception here if it's applicable, then other members can help you to diagnose the problem. Thanks Yanbo 2016-05-12 2:03 GMT-07:00 AlexModestov : > Hel

Re: Spark ML - Java implementation of custom Transformer

2016-07-02 Thread Yanbo Liang
Hi Mehdi, Could you share your code and then we can help you to figure out the problem? Actually JavaTestParams can work well but there is some compatibility issue for JavaDeveloperApiExample. We have removed JavaDeveloperApiExample temporary at Spark 2.0 in order to not confuse users. Since the s

Re: Custom Optimizer

2016-07-02 Thread Yanbo Liang
Spark MLlib does not support optimizer as a plugin, since the optimizer interface is private. Thanks Yanbo 2016-06-23 16:56 GMT-07:00 Stephen Boesch : > My team has a custom optimization routine that we would have wanted to > plug in as a replacement for the default LBFGS / OWLQN for use by so

Re: Ideas to put a Spark ML model in production

2016-07-02 Thread Yanbo Liang
Let's suppose you have trained a LogisticRegressionModel and saved it at "/tmp/lr-model". You can copy the directory to production environment and use it to make prediction on users new data. You can refer the following code snippets: val model = LogisiticRegressionModel.load("/tmp/lr-model") val

Re: Get both feature importance and ROC curve from a random forest classifier

2016-07-02 Thread Yanbo Liang
Hi Mathieu, Using the new ml package to train a RandomForestClassificationModel, you can get feature importance. Then you can convert the prediction result to RDD and feed it into BinaryClassificationEvaluator for ROC curve. You can refer the following code snippet: val rf = new RandomForestClass

Re: Trainning a spark ml linear regresion model fail after migrating from 1.5.2 to 1.6.1

2016-07-02 Thread Yanbo Liang
Yes, WeightedLeastSquares can not solve some ill-conditioned problem currently, the community members have paid some efforts to resolve it (SPARK-13777). For the work around, you can set the solver to "l-bfgs" which will train the LogisticRegressionModel by L-BFGS optimization method. 2016-06-09 7

Re: Several questions about how pyspark.ml works

2016-07-02 Thread Yanbo Liang
Hi Nick, Please see my inline reply. Thanks Yanbo 2016-06-12 3:08 GMT-07:00 XapaJIaMnu : > Hey, > > I have some additional Spark ML algorithms implemented in scala that I > would > like to make available in pyspark. For a reference I am looking at the > available logistic regression implementat

Re: Graphframe Error

2016-07-04 Thread Yanbo Liang
Hi Arun, The command bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6 will automatically load the required graphframes jar file from maven repository, it was not affected by the location where the jar file was placed. Your examples works well in my laptop. Or you can use try with

Re: Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-04 Thread Yanbo Liang
Would you mind to file a JIRA to track this issue? I will take a look when I have time. 2016-07-04 14:09 GMT-07:00 mshiryae : > Hi, > > I am trying to train model by MultilayerPerceptronClassifier. > > It works on sample data from > data/mllib/sample_multiclass_classification_data.txt with 4 feat

Re: mllib based on dataset or dataframe

2016-07-10 Thread Yanbo Liang
DataFrame is a kind of special case of Dataset, so they mean the same thing. Actually the ML pipeline API will accept Dataset[_] instead of DataFrame in Spark 2.0. We can say that MLlib will focus on the Dataset-based API for futher development more accurately. Thanks Yanbo 2016-07-10 20:35 GMT-0

Re: Isotonic Regression, run method overloaded Error

2016-07-10 Thread Yanbo Liang
Hi Swaroop, Would you mind to share your code that others can help you to figure out what caused this error? I can run the isotonic regression examples well. Thanks Yanbo 2016-07-08 13:38 GMT-07:00 dsp : > Hi I am trying to perform Isotonic Regression on a data set with 9 features > and a label

Re: Isotonic Regression, run method overloaded Error

2016-07-11 Thread Yanbo Liang
IsotonicRegression can handle feature column of vector type. It will extract the a certain index (controlled by param "featureIndex") of this feature vector and feed it into model training. It will perform Pool adjacent violators algorithms on each partition, so it's distributed and the data is not

Re: QuantileDiscretizer not working properly with big dataframes

2016-07-16 Thread Yanbo Liang
Could you tell us the Spark version you used? We have fixed this bug at Spark 1.6.2 and Spark 2.0, please upgrade to these versions and retry. If this issue still exists, please let us know. Thanks Yanbo 2016-07-12 11:03 GMT-07:00 Pasquinell Urbani < pasquinell.urb...@exalitica.com>: > In the fo

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-16 Thread Yanbo Liang
Hi Tobi, The MLlib RDD-based API does support to apply transformation on both Vector and RDD, but you did not use the appropriate way to do. Suppose you have a RDD with LabeledPoint in each line, you can refer the following code snippets to train a ChiSqSelectorModel model and do transformation:

Re: Dense Vectors outputs in feature engineering

2016-07-16 Thread Yanbo Liang
Since you use two steps (StringIndexer and OneHotEncoder) to encode categories to Vector, I guess you want to decode the eventual vector into their original categories. Suppose you have a DataFrame with only one column named "name", there are three categories: "b", "a", "c" (ranked by frequency). Y

Re: bisecting kmeans model tree

2016-07-16 Thread Yanbo Liang
Currently we do not expose the APIs to get the Bisecting KMeans tree structure, they are private in the ml.clustering package scope. But I think we should make a plan to expose these APIs like what we did for Decision Tree. Thanks Yanbo 2016-07-12 11:45 GMT-07:00 roni : > Hi Spark,Mlib experts,

Re: Feature importance IN random forest

2016-07-16 Thread Yanbo Liang
Spark 1.5 only support getting feature importance for RandomForestClassificationModel and RandomForestRegressionModel by Scala. We support this feature in PySpark until 2.0.0. It's very straight forward with a few lines of code. rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexe

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-17 Thread Yanbo Liang
ot available to me in the python spark 1.4 api. > > Regards, > Tobi > > On Jul 16, 2016 4:53 AM, "Yanbo Liang" wrote: > >> Hi Tobi, >> >> The MLlib RDD-based API does support to apply transformation on both >> Vector and RDD, but you did not use th

Re: Distributed Matrices - spark mllib

2016-07-24 Thread Yanbo Liang
Hi Gourav, I can not reproduce your problem. The following code snippets works well on my local machine, you can try to verify it in your environment. Or could you provide more information to make others can reproduce your problem? from pyspark.mllib.linalg.distributed import CoordinateMatrix, Ma

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Spark does not support exporting ML models to PMML currently. You can try the third party jpmml-spark (https://github.com/jpmml/jpmml-spark) package which supports a part of ML models. Thanks Yanbo 2016-07-20 11:14 GMT-07:00 Ajinkya Kale : > Just found Google dataproc has a preview of spark 2.0.

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Sorry for the wrong link, what you should refer is jpmml-sparkml ( https://github.com/jpmml/jpmml-sparkml). Thanks Yanbo 2016-07-24 4:46 GMT-07:00 Yanbo Liang : > Spark does not support exporting ML models to PMML currently. You can try > the third party jpmml-spark (https://github.com

Re: Locality sensitive hashing

2016-07-24 Thread Yanbo Liang
Hi Janardhan, Please refer the JIRA (https://issues.apache.org/jira/browse/SPARK-5992) for the discussion about LSH. Regards Yanbo 2016-07-24 7:13 GMT-07:00 Karl Higley : > Hi Janardhan, > > I collected some LSH papers while working on an RDD-based implementation. > Links at the end of the READ

Re: Frequent Item Pattern Spark ML Dataframes

2016-07-24 Thread Yanbo Liang
You can refer this JIRA (https://issues.apache.org/jira/browse/SPARK-14501) for porting spark.mllib.fpm to spark.ml. Thanks Yanbo 2016-07-24 11:18 GMT-07:00 janardhan shetty : > Is there any implementation of FPGrowth and Association rules in Spark > Dataframes ? > We have in RDD but any pointer

Re: K-means Evaluation metrics

2016-07-24 Thread Yanbo Liang
Spark MLlib KMeansModel provides "computeCost" function which return the sum of squared distances of points to their nearest center as the k-means cost on the given dataset. Thanks Yanbo 2016-07-24 17:30 GMT-07:00 janardhan shetty : > Hi, > > I was trying to evaluate k-means clustering predictio

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Yanbo Liang
Hi Hao, HashingTF directly apply a hash function (Murmurhash3) to the features to determine their column index. It excluded any thought about the term frequency or the length of the document. It does similar work compared with sklearn FeatureHasher. The result is increased speed and reduced memory

Re: Logistic regression formula string

2016-08-10 Thread Yanbo Liang
I think you can output the schema of DataFrame which will be feed into the estimator such as LogisticRegression. The output array will be the encoded feature names corresponding the coefficients of the model. Thanks Yanbo 2016-08-08 15:53 GMT-07:00 Cesar : > > I have a data frame with four colum

Re: Random forest binary classification H20 difference Spark

2016-08-10 Thread Yanbo Liang
Hi Samir, Did you use VectorAssembler to assemble some columns into the feature column? If there are NULLs in your dataset, VectorAssembler will throw this exception. You can use DataFrame.drop() or DataFrame.replace() to drop/substitute NULL values. Thanks Yanbo 2016-08-07 19:51 GMT-07:00 Javie

Re: using matrix as column datatype in SparkSQL Dataframe

2016-08-10 Thread Yanbo Liang
A good way is to implement your own data source to load data of matrix format. You can refer the LibSVM data format ( https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/ml/source/libsvm) which contains one column of vector type which is very similar with matrix. Than

Re: Linear regression, weights constraint

2016-08-16 Thread Yanbo Liang
Spark MLlib does not support boxed constraints on model coefficients currently. Thanks Yanbo 2016-08-15 3:53 GMT-07:00 letaiv : > Hi all, > > Is there any approach to add constrain for weights in linear regression? > What I need is least squares regression with non-negative constraints on > the

Re: Spark's Logistic Regression runs unstable on Yarn cluster

2016-08-16 Thread Yanbo Liang
Could you check the log to see how much iterations does your LoR runs? Does your program output same model between different attempts? Thanks Yanbo 2016-08-12 3:08 GMT-07:00 olivierjeunen : > I'm using pyspark ML's logistic regression implementation to do some > classification on an AWS EMR Yarn

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-16 Thread Yanbo Liang
MLlib will keep the original dataset during transformation, it just append new columns to existing DataFrame. That is you can get both prediction value and original features from the output DataFrame of model.transform. Thanks Yanbo 2016-08-16 17:48 GMT-07:00 ayan guha : > Hi > > I have a datase

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-16 Thread Yanbo Liang
It seams that VectorUDT is private and can not be accessed out of Spark currently. It should be public but we need to do some refactor before make it public. You can refer the discussion at https://github.com/apache/spark/pull/12259 . Thanks Yanbo 2016-08-16 9:48 GMT-07:00 alexeys : > I am writi

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-17 Thread Yanbo Liang
apache.spark.ml to be >> able to access private classes. >> >> Thanks, >> Alexey >> >> On Tue, Aug 16, 2016 at 11:13 PM, Yanbo Liang wrote: >> >>> It seams that VectorUDT is private and can not be accessed out of Spark >>> currently. It should

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-17 Thread Yanbo Liang
together. My question is how to tie them back to other parts of the data, > which was not in LP. > > For example, I have a bunch of other dimensions which are not part of > features or label. > > Sorry if this is a stupid question. > > On Wed, Aug 17, 2016 at 12:57 PM, Yanb

Re: Spark MLlib question: load model failed with exception:org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String

2016-08-18 Thread Yanbo Liang
It looks like you mixed use ALS in spark.ml and spark.mllib package. You can train the model by either one, meanwhile, you should use the corresponding save/load functions. You can not train/save the model by spark.mllib ALS, and then use spark.ml ALS to load the model. It will throw exceptions. I

Re: SparkML. RandomForest predict performance for small dataset.

2015-12-11 Thread Yanbo Liang
I think you are finding the ability of prediction on single instance. It's a feature on the development, please refer SPARK-10413. 2015-12-10 4:37 GMT+08:00 Eugene Morozov : > Hello, > > I'm using RandomForest pipeline (ml package). Everything is working fine > (learning models, prediction, etc),

Re: Concatenate a string to a Column of type string in DataFrame

2015-12-12 Thread Yanbo Liang
Hi Satish, You can refer the following code snippet: df.select(concat(col("String_Column"), lit("00:00:000"))) Yanbo 2015-12-12 16:01 GMT+08:00 satish chandra j : > HI, > I am trying to update a column value in DataFrame, incrementing a column > of integer data type than the below code works >

Re: Concatenate a string to a Column of type string in DataFrame

2015-12-13 Thread Yanbo Liang
Sorry, it was added since 1.5.0. 2015-12-13 2:07 GMT+08:00 Satish : > Hi, > Will the below mentioned snippet work for Spark 1.4.0 > > Thanks for your inputs > > Regards, > Satish > ------ > From: Yanbo Liang > Sent: ‎12-‎12-‎2015 20:54 >

Re: How to save Multilayer Perceptron Classifier model.

2015-12-13 Thread Yanbo Liang
Hi Vadim, It does not support save/load for Multilayer Perceptron Model currently, you can track the issues at SPARK-11871 . Yanbo 2015-12-14 2:31 GMT+08:00 Vadim Gribanov : > Hey everyone! I’m new with spark and scala. I looked at examples in

Re: MLlib: Feature Importances API

2015-12-17 Thread Yanbo Liang
Hi Asim, The "featureImportances" is only exposed at ML not MLlib. You need to update your code to use RandomForestClassifier of ML to train and get one RandomForestClassificationModel. Then you can call RandomForestClassificationModel.featureImportances

Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-17 Thread Yanbo Liang
Spark 1.5 officially use Parquet 1.7.0, but Spark 1.3 use Parquet 1.6.0. It's better to check which version of Parquet is used in your environment. 2015-12-17 10:26 GMT+08:00 Joseph Bradley : > This method is tested in the Spark 1.5 unit tests, so I'd guess it's a > problem with the Parquet depen

Re: Are there some solution to complete the transform category variables into dummy variable in scala or spark ?

2015-12-17 Thread Yanbo Liang
Hi Minglei, Spark ML provide a transformer named "OneHotEncoder" to map a column of category indices to a column of binary vectors. It's similar with pandas.get_dummies and OneHotEncoder of sklearn, but the output will be a column of vector type rather than multiple columns. You can refer the offi

Re: Need clarifications in Regression

2015-12-17 Thread Yanbo Liang
Hi Arunkumar, There are two implementation for LinearRegression, one under ml package and another one

Re: Linear Regression with OLS

2015-12-17 Thread Yanbo Liang
Hi Arunkumar, You can refer the officially examples of LinearRegression under ML package( https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala ). If you want to train this LinearRegressionModel with OLS, you o

Re: MLlib: Feature Importances API

2015-12-17 Thread Yanbo Liang
a JIRA for exposing featureImportances on > org.apache.spark.mllib.tree.RandomForest?, or could you create one? I am > unable to create an issue on JIRA against Spark. > > Thanks. > > Asim > > On Thu, Dec 17, 2015 at 12:07 AM, Yanbo Liang wrote: > >> Hi Asim, >>

Re: Creating vectors from a dataframe

2015-12-20 Thread Yanbo Liang
Hi Arunkumar, If you want to create a vector from multiple columns of DataFrame, Spark ML provided VectorAssembler to help us. Yanbo 2015-12-21 13:44 GMT+08:00 Arunkumar Pillai : > Hi > > > I'm trying to use Linear Regression from ml library > > but the problem is the independent variable shoul

Re: Extract SSerr SStot from Linear Regression using ml package

2015-12-22 Thread Yanbo Liang
Hi Arunkumar, Could you tell me the cause of getting SSerr, SStot and SSreg? Traditional we use explainedVariance, meanAbsoluteError, meanSquaredError, rootMeanSquaredError and r2 as metrics of LinearRegression. Although actually you can get SSerr, SStot and SSreg from the composition of above met

Re: DataFrameWriter.format(String) is there a list of options?

2015-12-23 Thread Yanbo Liang
If you want to use CSV format, please refer the spark-csv project and the examples. https://github.com/databricks/spark-csv 2015-12-24 4:40 GMT+08:00 Zhan Zhang : > Now json, parquet, orc(in hivecontext), text are natively supported. If > you use avro or others, you have to include the package,

Re: How to ignore case in dataframe groupby?

2015-12-24 Thread Yanbo Liang
You can use DF.groupBy(upper(col("a"))).agg(sum(col("b"))). DataFrame provide function "upper" to update column to uppercase. 2015-12-24 20:47 GMT+08:00 Eran Witkon : > Use DF.withColumn("upper-code",df("countrycode).toUpper)) > or just run a map function that does the same > > On Thu, Dec 24, 20

Re: How to handle categorical variables in Spark MLlib?

2015-12-25 Thread Yanbo Liang
Hi Hokam, You can use OneHotEncoder to encode category variables to feature vector, Spark ML provide this transformer. To weight for individual category, there is no exist method to do this, but you can implement a UDF which can multiple a factor to specified column of a vector. Yanbo 2015-12-23

Re: Retrieving the PCA parameters in pyspark

2015-12-25 Thread Yanbo Liang
Hi Rohit, This is a known bug, but you can get these parameters if you use Scala version. Yanbo 2015-12-03 0:36 GMT+08:00 Rohit Girdhar : > Hi > > I'm using PCA through the python interface for spark, as per the > instructions on this page: > https://spark.apache.org/docs/1.5.1/ml-features.html

Re: SparkML algos limitations question.

2015-12-27 Thread Yanbo Liang
Hi Eugene, AFAIK, the current implementation of MultilayerPerceptronClassifier have some scalability problems if the model is very huge (such as >10M), although I think the limitation can cover many use cases already. Yanbo 2015-12-16 6:00 GMT+08:00 Joseph Bradley : > Hi Eugene, > > The maxDept

Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Yanbo Liang
Load csv file: df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv", header = "true") Calculate covariance: cov <- cov(df, "col1", "col2") Cheers Yanbo 2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com>: > hi all, > I want to use sparkR or spark MLlib load csv fi

Re: [SparkSQL][Parquet] Read from nested parquet data

2015-12-30 Thread Yanbo Liang
This problem has been discussed long before, but I think there is no straight forward way to read only col_g. 2015-12-30 17:48 GMT+08:00 lin : > Hi all, > > We are trying to read from nested parquet data. SQL is "select > col_b.col_d.col_g from some_table" and the data schema for some_table is: >

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-30 Thread Yanbo Liang
Hi Jia, You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether it can produce stable performance. The storage level of MEMORY_AND_DISK will store the partitions that don't fit on disk and read them from there when they are needed. Actually, it's not necessary to set so large drive

Re: K means clustering in spark

2015-12-31 Thread Yanbo Liang
Hi Anjali, The main output of KMeansModel is clusterCenters which is Array[Vector]. It has k elements where k is the number of clusters and each elements is the center of the specified cluster. Yanbo 2015-12-31 12:52 GMT+08:00 : > Hi, > > I am trying to use kmeans for clustering in spark using

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2016-01-01 Thread Yanbo Liang
s.map(new ParsePoint()); > > points.persist(StorageLevel.MEMORY_AND_DISK()); > > KMeansModel model = KMeans.train(points.rdd(), k, iterations, runs, > KMeans.K_MEANS_PARALLEL()); > > > Thank you very much! > > Best Regards, > Jia > > On Wed, Dec 30, 201

Re: How to specify the numFeatures in HashingTF

2016-01-01 Thread Yanbo Liang
You can refer the following code snippet to set numFeatures for HashingTF: val hashingTF = new HashingTF() .setInputCol("words") .setOutputCol("features") .setNumFeatures(n) 2015-10-16 0:17 GMT+08:00 Nick Pentreath : > Setting the numfeatures higher than vocab size will tend t

Re: does HashingTF maintain a inverse index?

2016-01-01 Thread Yanbo Liang
Hi Andy, Spark ML/MLlib does not provide a transformer to map HashingTF generated feature back to words currently. 2016-01-01 8:37 GMT+08:00 Hayri Volkan Agun : > Hi, > > If you are using pipeline api, you do not need to map features back to > documents. > Your input (which is the document text)

Re: NotSerializableException exception while using TypeTag in Scala 2.10

2016-01-01 Thread Yanbo Liang
I also hit this bug, have you resolved this issue? Or could you give some suggestions? 2014-07-28 18:33 GMT+08:00 Aniket Bhatnagar : > I am trying to serialize objects contained in RDDs using runtime > relfection via TypeTag. However, the Spark job keeps > failing java.io.NotSerializableException

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-02 Thread Yanbo Liang
Hi Tomasz, The GMM is bind with the peer Java GMM object, so it need reference to SparkContext. Some of MLlib(not ML) models are simple object such as KMeansModel, LinearRegressionModel etc., but others will refer SparkContext. The later ones and corresponding member functions should not called in

Re: frequent itemsets

2016-01-02 Thread Yanbo Liang
Hi Roberto, Could you share your code snippet that others can help to diagnose your problems? 2016-01-02 7:51 GMT+08:00 Roberto Pagliari : > When using the frequent itemsets APIs, I’m running into stackOverflow > exception whenever there are too many combinations to deal with and/or too > many

Re: GLM I'm ml pipeline

2016-01-03 Thread Yanbo Liang
AFAIK, Spark MLlib will improve and support most GLM functions in the next release(Spark 2.0). 2016-01-03 23:02 GMT+08:00 : > keyStoneML could be an alternative. > > Ardo. > > On 03 Jan 2016, at 15:50, Arunkumar Pillai > wrote: > > Is there any road map for glm in pipeline? > >

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-04 Thread Yanbo Liang
in (near :) ) future? Ability to > call this function only on local data (ie not in rdd) seems to be rather > serious limitation. > > cheers, > Tomasz > > On 02.01.2016 09:45, Yanbo Liang wrote: > >> Hi Tomasz, >> >> The GMM is bind with the peer Java GMM objec

Re: SparkML algos limitations question.

2016-01-04 Thread Yanbo Liang
ster should > have more memory because it runs LBFGS) In my experiments, I’ve trained the > models 12M and 32M parameters without issues. > > > > Best regards, Alexander > > > > *From:* Yanbo Liang [mailto:yblia...@gmail.com] > *Sent:* Sunday, December 27, 2015 2:23 A

Re: finding distinct count using dataframe

2016-01-05 Thread Yanbo Liang
Hi Arunkumar, You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or approxCountDistinct for a approximate result. 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai : > Hi > > Is there any functions to find distinct count of all the variables in > dataframe. > > val sc = new SparkCont

Re: sparkR ORC support.

2016-01-06 Thread Yanbo Liang
You should ensure your sqlContext is HiveContext. sc <- sparkR.init() sqlContext <- sparkRHive.init(sc) 2016-01-06 20:35 GMT+08:00 Sandeep Khurana : > Felix > > I tried the option suggested by you. It gave below error. I am going to > try the option suggested by Prem . > > Error in writeJobj

Re: Date Time Regression as Feature

2016-01-07 Thread Yanbo Liang
First extracting year, month, day, time from the datetime. Then you should decide which variables can be treated as category features such as year/month/day and encode them to boolean form using OneHotEncoder. At last using VectorAssembler to assemble the encoded output vector and the other raw inp

Re: Predictive Modelling in sparkR

2016-01-07 Thread Yanbo Liang
Hi Chandan, Do you mean to run your own LR algorithm based on SparkR? Actually, SparkR provide the ability to run the distributed Spark MLlib LR and the interface is similar with the R GLM. For your refer: https://spark.apache.org/docs/latest/sparkr.html#binomial-glm-model 2016-01-07 2:45 GMT+08:

Re: StandardScaler in spark.ml.feature requires vector input?

2016-01-11 Thread Yanbo Liang
Hi Kristina, The input column of StandardScaler must be vector type, because it's usually used as feature scaling before model training and the type of feature column should be vector in most cases. If you only want to standardize a numeric column, you can wrap it as a vector and feed into Standar

Re: broadcast params to workers at the very beginning

2016-01-11 Thread Yanbo Liang
Hi, The parameters should be broadcasted again after you update it at driver side, then you can get updated version at worker side. Thanks Yanbo 2016-01-09 23:12 GMT+08:00 octavian.ganea : > Hi, > > In my app, I have a Params scala object that keeps all the specific > (hyper)parameters of my pr

Re: Deploying model built in SparkR

2016-01-11 Thread Yanbo Liang
Hi Chandan, Could you tell us the meaning of deploying model? Using the model to make prediction by R? Thanks Yanbo 2016-01-11 20:40 GMT+08:00 Chandan Verma : > Hi All, > > Does any one over here has deployed a model produced in SparkR or atleast > help me with the steps for deployment. > > Reg

Re: ml.classification.NaiveBayesModel how to reshape theta

2016-01-13 Thread Yanbo Liang
Yep, row of Matrix theta is the number of classes and column of theta is the number of features. 2016-01-13 10:47 GMT+08:00 Andy Davidson : > I am trying to debug my trained model by exploring theta > Theta is a Matrix. The java Doc for Matrix says that it is column major > formate > > I have tra

Re: AIC in Linear Regression in ml pipeline

2016-01-15 Thread Yanbo Liang
Hi Arunkumar, It does not support output AIC value for Linear Regression currently. This feature is under development and will be released at Spark 2.0. Thanks Yanbo 2016-01-15 17:20 GMT+08:00 Arunkumar Pillai : > Hi > > Is it possible to get AIC value in Linear Regression using ml pipeline ? >

Re: has any one implemented TF_IDF using ML transformers?

2016-01-17 Thread Yanbo Liang
Hi Andy, Actually, the output of ML IDF model is the TF-IDF vector of each instance rather than IDF vector. So it's unnecessary to do member wise multiplication to calculate TF-IDF value. You can refer the code at here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/sp

Re: Feature importance for RandomForestRegressor in Spark 1.5

2016-01-17 Thread Yanbo Liang
Hi Robin, #1 This feature is available from Spark 1.5.0. #2 You should use the new ML rather than the old MLlib package to train the Random Forest model and get featureImportances, because it was only exposed at ML package. You can refer the documents: https://spark.apache.org/docs/latest/ml-class

Re: has any one implemented TF_IDF using ML transformers?

2016-01-19 Thread Yanbo Liang
gt; int indexOfAnother = tf.indexOf("another"); > > System.err.println("AEDWIP: indexOfAnother: " + indexOfAnother); > > > for (Vector v: localTfIdfs) { > > System.err.println("AEDWIP: V.toString() " + v.toString()); > >

Re: Extracting p values in Logistic regression using mllib scala

2016-01-24 Thread Yanbo Liang
Hi Chandan, MLlib only support getting p-value, t-value from Linear Regression model, other models such as Logistic Model are not supported currently. This feature is under development and will be released at the next version(Spark 2.0). Thanks Yanbo 2016-01-18 16:45 GMT+08:00 Chandan Verma : >

Re: has any one implemented TF_IDF using ML transformers?

2016-01-24 Thread Yanbo Liang
o not get the same results. I’ll put my code up on github over the weekend > if anyone is interested > > Andy > > From: Yanbo Liang > Date: Tuesday, January 19, 2016 at 1:11 AM > > To: Andrew Davidson > Cc: "user @spark" > Subject: Re: has any one impl

Re: can we create dummy variables from categorical variables, using sparkR

2016-01-24 Thread Yanbo Liang
Hi Devesh, RFormula will encode category variables(column of string type) as dummy variables automatically. You do not need to do dummy transform explicitly if you want to train machine learning model using SparkR. Although SparkR only supports a limited ML algorithms(GLM) currently. Thanks Yanbo

Re: how to save Matrix type result to hdfs file using java

2016-01-24 Thread Yanbo Liang
Matrix can be save as column of type MatrixUDT.

Re: [MLLib] Is the order of the coefficients in a LogisticRegresionModel kept ?

2016-02-02 Thread Yanbo Liang
For you case, it's true. But not always correct for a pipeline model, some transformers in pipeline will change the features such as OneHotEncoder. 2016-02-03 1:21 GMT+08:00 jmvllt : > Hi everyone, > > This may sound like a stupid question but I need to be sure of this : > > Given a dataframe com

Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-15 Thread Yanbo Liang
Hi Stuti, This is a bug of AFTSurvivalRegression, we did not handle "lossSum == infinity" properly. I have open https://issues.apache.org/jira/browse/SPARK-13322 to track this issue and will send a PR. Thanks for reporting this issue. Yanbo 2016-02-12 15:03 GMT+08:00 Stuti Awasthi : > Hi All, >

Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-16 Thread Yanbo Liang
= standardScaler.fit(ovarian2) val ovarian3 = ssModel.transform(ovarian2) val aft = new AFTSurvivalRegression().setFeaturesCol("standardized_features") val model = aft.fit(ovarian3) val newCoefficients = model.coefficients.toArray.zip(ssModel.std.toArray).map { x => x._1 / x._2 }

Re: Saving and Loading Dataframes

2016-02-25 Thread Yanbo Liang
Hi Raj, Could you share your code which can help others to diagnose this issue? Which version did you use? I can not reproduce this problem in my environment. Thanks Yanbo 2016-02-26 10:49 GMT+08:00 raj.kumar : > Hi, > > I am using mllib. I use the ml vectorization tools to create the vectorize

Re: Calculation of histogram bins and frequency in Apache spark 1.6

2016-02-25 Thread Yanbo Liang
Actually Spark SQL `groupBy` with `count` can get frequency in each bin. You can also try with DataFrameStatFunctions.freqItems() to get the frequent items for columns. Thanks Yanbo 2016-02-24 1:21 GMT+08:00 Burak Yavuz : > You could use the Bucketizer transformer in Spark ML. > > Best, > Burak

Re: Survival Curves using AFT implementation in Spark

2016-02-26 Thread Yanbo Liang
Hi Stuti, AFTSurvivalRegression does not support computing the predicted survival functions/curves currently. I don't know whether the quantile predictions can help you, you can refer the example

Re: Saving and Loading Dataframes

2016-02-28 Thread Yanbo Liang
ow; df.printSchema > > df.write.format("json").mode("overwrite").save( OutputDir ) > val data = sqlc.read.format("json").load( OutputDir ) > data.show; data.printSchema > > def main( args: Array[String]):Unit = {} > } > > > --

Re: SparkML Using Pipeline API locally on driver

2016-02-28 Thread Yanbo Liang
Hi Jean, DataFrame is connected with SQLContext which is connected with SparkContext, so I think it's impossible to run `model.transform` without touching Spark. I think what you need is model should support prediction on single instance, then you can make prediction without Spark. You can track t

Re: Stopping criteria for gradient descent

2015-09-29 Thread Yanbo Liang
Hi Nishanth, The diff of solution vectors is compared to relative tolerance or absolute tolerance, you can set convergenceTol which can affect the convergence criteria of SGD. 2015-09-17 8:31 GMT+08:00 Nishanth P S : > Hi, > > I am running LogisticRegressionWithSGD in spark 1.4.1 and it always t

Re: RandomForestClassifer does not recognize number of classes, nor can number of classes be set

2015-09-30 Thread Yanbo Liang
Hi Kristina, Currently StringIndexer is a requirement step before training DecisionTree, RandomForest and GBT related models. Though it does not necessary by other models such as LogisticRegression and NaiveBayes, it also strongly recommend to make this preprocessing step otherwise it may lead inc

Re: Mllib explain feature for tree ensembles

2015-10-28 Thread Yanbo Liang
Spark ML/MLlib has provided featureImportances to estimate the importance of each feature. 2015-10-28 18:29 GMT+08:00 Eugen Cepoi : > Hey, > > Is there some kind

Re: Getting info from DecisionTreeClassificationModel

2015-10-28 Thread Yanbo Liang
AFAIK, you can not traverse the tree from the rootNode of DecisionTreeClassificationModel, because type Node does not have information of its children. Type InternalNode has children information but it's private that users can not access. I think the best way to get the probability of each predicti

  1   2   3   >