Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread Yanbo
First use groupByKey(), you get a tuple RDD with (key:K,value:ArrayBuffer[V]). Then use map() on this RDD with a function has different operations depending on the key which act as a parameter of this function. > 在 2014年11月18日,下午8:59,jelgh 写道: > > Hello everyone, > > I'm new to Spark and I ha

Re: Converting a column to a map

2014-11-24 Thread Yanbo
jsonFiles in your code is schemaRDD rather than RDD[Array]. If it is a column in schemaRDD, you can first use Spark SQL query to get a certain column. Or schemaRDD support some SQL like operation such as select / where can also get specific column. > 在 2014年11月24日,上午4:01,Daniel Haviv 写道: > > H

Re: MLLib: LinearRegressionWithSGD performance

2014-11-24 Thread Yanbo
From the metrics page, it reveals that only two executors work parallel for each iteration. You need to improve parallel threads numbers. Some tips maybe helpful: Increase "spark.default.parallelism"; Use repartition() or coalesce() to increase partition number. > 在 2014年11月22日,上午3:18,Sameer Ti

Re: How to keep a local variable in each cluster?

2014-11-24 Thread Yanbo
发自我的 iPad > 在 2014年11月24日,上午9:41,zh8788 <78343...@qq.com> 写道: > > Hi, > > I am new to spark. This is the first time I am posting here. Currently, I > try to implement ADMM optimization algorithms for Lasso/SVM > Then I come across a problem: > > Since the training data(label, feature) is larg

Re: Setting network variables in spark-shell

2014-11-30 Thread Yanbo
Try to use spark-shell --conf spark.akka.frameSize=1 > 在 2014年12月1日,上午12:25,Brian Dolan 写道: > > Howdy Folks, > > What is the correct syntax in 1.0.0 to set networking variables in spark > shell? Specifically, I'd like to set the spark.akka.frameSize > > I'm attempting this: > spark-shel

Re: Serialization issue when using HBase with Spark

2014-12-14 Thread Yanbo
In #1, class HTable can not be serializable. You also need to check you self defined function getUserActions and make sure it is a member function of one class who implement serializable interface. 发自我的 iPad > 在 2014年12月12日,下午4:35,yangliuyu 写道: > > The scenario is using HTable instance to scan

Re: JSON Input files

2014-12-14 Thread Yanbo
Pay attention to your JSON file, try to change it like following. Each record represent as a JSON string. {"NAME" : "Device 1", "GROUP" : "1", "SITE" : "qqq", "DIRECTION" : "East", } {"NAME" : "Device 2", "GROUP" : "2", "SITE" : "sss", "DIRECTION" : "

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Yanbo Liang
If you run Spark on YARN, the simplest way is replace the $SPARK_HOME/lib/spark-.jar with your own version spark jar file and run your application. The spark-submit script will upload this jar to YARN cluster automatically and then you can run your application as usual. It does not care about w

Re: Can it works in load the MatrixFactorizationModel and predict product with Spark Streaming?

2015-06-17 Thread Yanbo Liang
The logs have told you what cause the error that you can not invoke RDD transformations and actions in other transformations. You have not do this explicitly but the implementation of MatrixFactorizationModel .recommendProducts do that, you can refer https://github.com/apache/spark/blob/master/mlli

Re: Retrieving Spark Configuration properties

2015-07-16 Thread Yanbo Liang
This is because that you did not set the parameter "spark.sql. hive.metastore.version". You can check other parameters that you have set, it will work well. Or you can first set this parameter, and then get it. 2015-07-17 11:53 GMT+08:00 RajG : > I am using this version of Spark : *spark-1.4.0-bi

Re: Reg:Reading a csv file with String label into labelepoint

2016-03-16 Thread Yanbo Liang
featureCol and labelCol. Thanks Yanbo 2016-03-16 13:41 GMT+08:00 Dharmin Siddesh J : > Hi > > I am trying to read a csv with few double attributes and String Label . > How can i convert it to labelpoint RDD so that i can run it with spark > mllib classification algorithms. >

Re: Possible bug involving Vectors with a single element

2016-05-27 Thread Yanbo Liang
Spark MLlib Vector only supports data of double type, it's reasonable to throw exception when you creating a Vector with element of unicode type. 2016-05-24 7:27 GMT-07:00 flyinggip : > Hi there, > > I notice that there might be a bug in pyspark.mllib.linalg.Vectors when > dealing with a vector w

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
Hi Abhi, In SparkR glm, category features (columns of type string) will be one-hot encoded automatically. So pre-processing like `as.factor` is not necessary, you can directly feed your data to the model training. Thanks Yanbo 2016-05-30 2:06 GMT-07:00 Abhishek Anand : > Hi , > > I wa

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
Yes, you are right. 2016-05-30 2:34 GMT-07:00 Abhishek Anand : > > Thanks Yanbo. > > So, you mean that if I have a variable which is of type double but I want > to treat it like String in my model I just have to cast those columns into > string and simply run the glm model. S

Re: ML regression - spark context dies without error

2016-06-05 Thread Yanbo Liang
Could you tell me which regression algorithm, the parameters you set and the detail exception information? Or it's better to paste your code and exception here if it's applicable, then other members can help you to diagnose the problem. Thanks Yanbo 2016-05-12 2:03 GMT-07:00 Al

Re: Spark ML - Java implementation of custom Transformer

2016-07-02 Thread Yanbo Liang
solution for the compatibility issue has been figured out, we will add it back at 2.1. Thanks Yanbo 2016-06-27 11:57 GMT-07:00 Mehdi Meziane : > Hi all, > > We have some problems while implementing custom Transformers in JAVA > (SPARK 1.6.1). > We do override the method copy, but

Re: Custom Optimizer

2016-07-02 Thread Yanbo Liang
Spark MLlib does not support optimizer as a plugin, since the optimizer interface is private. Thanks Yanbo 2016-06-23 16:56 GMT-07:00 Stephen Boesch : > My team has a custom optimization routine that we would have wanted to > plug in as a replacement for the default LBFGS / OWLQN for

Re: Ideas to put a Spark ML model in production

2016-07-02 Thread Yanbo Liang
;/tmp/lr-model") val data = newDataset val prediction = model.transform(data) However, usually we save/load PipelineModel which include necessary feature transformers and model training process rather than the single model, but they are similar operations. Thanks Yanbo 2016-06-23 10:54 GMT-0

Re: Get both feature importance and ROC curve from a random forest classifier

2016-07-02 Thread Yanbo Liang
ble, label: Double) => (rawPrediction, label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) metrics.roc() Thanks Yanbo 2016-06-15 7:13 GMT-07:00 matd : > Hi ml folks ! > > I'm using a Random Forest for a binary classification. > I'm interested in gett

Re: Trainning a spark ml linear regresion model fail after migrating from 1.5.2 to 1.6.1

2016-07-02 Thread Yanbo Liang
Yes, WeightedLeastSquares can not solve some ill-conditioned problem currently, the community members have paid some efforts to resolve it (SPARK-13777). For the work around, you can set the solver to "l-bfgs" which will train the LogisticRegressionModel by L-BFGS optimization method. 2016-06-09 7

Re: Several questions about how pyspark.ml works

2016-07-02 Thread Yanbo Liang
Hi Nick, Please see my inline reply. Thanks Yanbo 2016-06-12 3:08 GMT-07:00 XapaJIaMnu : > Hey, > > I have some additional Spark ML algorithms implemented in scala that I > would > like to make available in pyspark. For a reference I am looking at the > available l

Re: Graphframe Error

2016-07-04 Thread Yanbo Liang
bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar to launch PySpark with graphframes enabled. You should set "--py-files" and "--jars" options with the directory where you saved graphframes.jar. Thanks Yanbo 2016-07-03 15:48 GMT-07:00 Arun Patel : &g

Re: Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-04 Thread Yanbo Liang
Would you mind to file a JIRA to track this issue? I will take a look when I have time. 2016-07-04 14:09 GMT-07:00 mshiryae : > Hi, > > I am trying to train model by MultilayerPerceptronClassifier. > > It works on sample data from > data/mllib/sample_multiclass_classification_data.txt with 4 feat

Re: mllib based on dataset or dataframe

2016-07-10 Thread Yanbo Liang
DataFrame is a kind of special case of Dataset, so they mean the same thing. Actually the ML pipeline API will accept Dataset[_] instead of DataFrame in Spark 2.0. We can say that MLlib will focus on the Dataset-based API for futher development more accurately. Thanks Yanbo 2016-07-10 20:35 GMT

Re: Isotonic Regression, run method overloaded Error

2016-07-10 Thread Yanbo Liang
Hi Swaroop, Would you mind to share your code that others can help you to figure out what caused this error? I can run the isotonic regression examples well. Thanks Yanbo 2016-07-08 13:38 GMT-07:00 dsp : > Hi I am trying to perform Isotonic Regression on a data set with 9 features >

Re: Isotonic Regression, run method overloaded Error

2016-07-11 Thread Yanbo Liang
t;prediction").rdd.map { case Row(pred) => pred }.collect() assert(predictions === Array(1, 2, 2, 2, 6, 16.5, 16.5, 17, 18)) Thanks Yanbo 2016-07-11 6:14 GMT-07:00 Fridtjof Sander : > Hi Swaroop, > > from my understanding, Isotonic Regression is currently limited to data &g

Re: QuantileDiscretizer not working properly with big dataframes

2016-07-16 Thread Yanbo Liang
Could you tell us the Spark version you used? We have fixed this bug at Spark 1.6.2 and Spark 2.0, please upgrade to these versions and retry. If this issue still exists, please let us know. Thanks Yanbo 2016-07-12 11:03 GMT-07:00 Pasquinell Urbani < pasquinell.urb...@exalitica.com>: &g

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-16 Thread Yanbo Liang
= sc.parallelize(data) model = ChiSqSelector(1).fit(rdd) filteredRDD = model.transform(rdd.map(lambda lp: lp.features)) filteredRDD.collect() However, we strongly recommend you to migrate to DataFrame-based API since the RDD-based API is switched to maintain mode. Thanks Yanbo 2016-07-14 13:23 GMT

Re: Dense Vectors outputs in feature engineering

2016-07-16 Thread Yanbo Liang
orm(df2) df3.show() // Decode to get the original categories. val group = AttributeGroup.fromStructField(df3.schema("encodedName")) val categories = group.attributes.get.map(_.name.get) println(categories.mkString(",")) // Output: b,a,c Thanks Yanbo 2016-07-14 6:46

Re: bisecting kmeans model tree

2016-07-16 Thread Yanbo Liang
Currently we do not expose the APIs to get the Bisecting KMeans tree structure, they are private in the ml.clustering package scope. But I think we should make a plan to expose these APIs like what we did for Decision Tree. Thanks Yanbo 2016-07-12 11:45 GMT-07:00 roni : > Hi Spark,Mlib expe

Re: Feature importance IN random forest

2016-07-16 Thread Yanbo Liang
elCol="indexed", seed=42) model = rf.fit(td) model.featureImportances Then you can get the feature importances which is a Vector. Thanks Yanbo 2016-07-12 10:30 GMT-07:00 pseudo oduesp : > Hi, > i use pyspark 1.5.0 > can i ask you how i can get feature imprtance for a randomforest

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-17 Thread Yanbo Liang
Hi Tobi, Thanks for clarifying the question. It's very straight forward to convert the filtered RDD to DataFrame, you can refer the following code snippets: from pyspark.sql import Row rdd2 = filteredRDD.map(lambda v: Row(features=v)) df = rdd2.toDF() Thanks Yanbo 2016-07-16 14:51 GMT-

Re: Distributed Matrices - spark mllib

2016-07-24 Thread Yanbo Liang
, MatrixEntry l = [(1, 1, 10), (2, 2, 20), (3, 3, 30)] df = sqlContext.createDataFrame(l, ['row', 'column', 'value']) rdd = df.select('row', 'column', 'value').rdd.map(lambda row: MatrixEntry(*row)) mat = CoordinateMatrix(rdd) mat.entries.

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Spark does not support exporting ML models to PMML currently. You can try the third party jpmml-spark (https://github.com/jpmml/jpmml-spark) package which supports a part of ML models. Thanks Yanbo 2016-07-20 11:14 GMT-07:00 Ajinkya Kale : > Just found Google dataproc has a preview of spark

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Sorry for the wrong link, what you should refer is jpmml-sparkml ( https://github.com/jpmml/jpmml-sparkml). Thanks Yanbo 2016-07-24 4:46 GMT-07:00 Yanbo Liang : > Spark does not support exporting ML models to PMML currently. You can try > the third party jpmml-spark (https://github.com

Re: Locality sensitive hashing

2016-07-24 Thread Yanbo Liang
Hi Janardhan, Please refer the JIRA (https://issues.apache.org/jira/browse/SPARK-5992) for the discussion about LSH. Regards Yanbo 2016-07-24 7:13 GMT-07:00 Karl Higley : > Hi Janardhan, > > I collected some LSH papers while working on an RDD-based implementation. > Links at th

Re: Frequent Item Pattern Spark ML Dataframes

2016-07-24 Thread Yanbo Liang
You can refer this JIRA (https://issues.apache.org/jira/browse/SPARK-14501) for porting spark.mllib.fpm to spark.ml. Thanks Yanbo 2016-07-24 11:18 GMT-07:00 janardhan shetty : > Is there any implementation of FPGrowth and Association rules in Spark > Dataframes ? > We have in RD

Re: K-means Evaluation metrics

2016-07-24 Thread Yanbo Liang
Spark MLlib KMeansModel provides "computeCost" function which return the sum of squared distances of points to their nearest center as the k-means cost on the given dataset. Thanks Yanbo 2016-07-24 17:30 GMT-07:00 janardhan shetty : > Hi, > > I was trying to evaluate

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Yanbo Liang
compute term frequency divided by the length of the document, you should write your own function based on transformers provided by MLlib. Thanks Yanbo 2016-08-01 15:29 GMT-07:00 Hao Ren : > When computing term frequency, we can use either HashTF or CountVectorizer > feature extractors. > Howe

Re: Logistic regression formula string

2016-08-10 Thread Yanbo Liang
I think you can output the schema of DataFrame which will be feed into the estimator such as LogisticRegression. The output array will be the encoded feature names corresponding the coefficients of the model. Thanks Yanbo 2016-08-08 15:53 GMT-07:00 Cesar : > > I have a data frame wit

Re: Random forest binary classification H20 difference Spark

2016-08-10 Thread Yanbo Liang
Hi Samir, Did you use VectorAssembler to assemble some columns into the feature column? If there are NULLs in your dataset, VectorAssembler will throw this exception. You can use DataFrame.drop() or DataFrame.replace() to drop/substitute NULL values. Thanks Yanbo 2016-08-07 19:51 GMT-07:00

Re: using matrix as column datatype in SparkSQL Dataframe

2016-08-10 Thread Yanbo Liang
. Thanks Yanbo 2016-08-08 11:06 GMT-07:00 Vadla, Karthik : > Hello all, > > > > I'm trying to load set of medical images(dicom) into spark SQL dataframe. > Here each image is loaded into matrix column of dataframe. I see spark > recently added MatrixUDT to support this kind of

Re: Linear regression, weights constraint

2016-08-16 Thread Yanbo Liang
Spark MLlib does not support boxed constraints on model coefficients currently. Thanks Yanbo 2016-08-15 3:53 GMT-07:00 letaiv : > Hi all, > > Is there any approach to add constrain for weights in linear regression? > What I need is least squares regression with non-negative constrai

Re: Spark's Logistic Regression runs unstable on Yarn cluster

2016-08-16 Thread Yanbo Liang
Could you check the log to see how much iterations does your LoR runs? Does your program output same model between different attempts? Thanks Yanbo 2016-08-12 3:08 GMT-07:00 olivierjeunen : > I'm using pyspark ML's logistic regression implementation to do some > classificati

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-16 Thread Yanbo Liang
MLlib will keep the original dataset during transformation, it just append new columns to existing DataFrame. That is you can get both prediction value and original features from the output DataFrame of model.transform. Thanks Yanbo 2016-08-16 17:48 GMT-07:00 ayan guha : > Hi > >

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-16 Thread Yanbo Liang
It seams that VectorUDT is private and can not be accessed out of Spark currently. It should be public but we need to do some refactor before make it public. You can refer the discussion at https://github.com/apache/spark/pull/12259 . Thanks Yanbo 2016-08-16 9:48 GMT-07:00 alexeys : > I

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-17 Thread Yanbo Liang
mode, so we strongly recommend users to use the DataFrame-based spark.ml API. Thanks Yanbo 2016-08-17 11:46 GMT-07:00 Michał Zieliński : > I'm using Spark 1.6.2 for Vector-based UDAF and this works: > > def inputSchema: StructType = new StructType().add("input", new >

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-17 Thread Yanbo Liang
If you want to tie them with other data, I think the best way is to use DataFrame join operation on condition that they share an identity column. Thanks Yanbo 2016-08-16 20:39 GMT-07:00 ayan guha : > Hi > > Thank you for your reply. Yes, I can get prediction and original features >

Re: Spark MLlib question: load model failed with exception:org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String

2016-08-18 Thread Yanbo Liang
maintenance mode. So do all your work under the same APIs. Thanks Yanbo 2016-08-17 1:30 GMT-07:00 : > Hello guys: > I have a problem in loading recommend model. I have 2 models, one is > good(able to get recommend result) and another is not working. I checked > these 2 mode

Re: SparkML. RandomForest predict performance for small dataset.

2015-12-11 Thread Yanbo Liang
I think you are finding the ability of prediction on single instance. It's a feature on the development, please refer SPARK-10413. 2015-12-10 4:37 GMT+08:00 Eugene Morozov : > Hello, > > I'm using RandomForest pipeline (ml package). Everything is working fine > (learning models, prediction, etc),

Re: Concatenate a string to a Column of type string in DataFrame

2015-12-12 Thread Yanbo Liang
Hi Satish, You can refer the following code snippet: df.select(concat(col("String_Column"), lit("00:00:000"))) Yanbo 2015-12-12 16:01 GMT+08:00 satish chandra j : > HI, > I am trying to update a column value in DataFrame, incrementing a column > of integer data t

Re: Concatenate a string to a Column of type string in DataFrame

2015-12-13 Thread Yanbo Liang
Sorry, it was added since 1.5.0. 2015-12-13 2:07 GMT+08:00 Satish : > Hi, > Will the below mentioned snippet work for Spark 1.4.0 > > Thanks for your inputs > > Regards, > Satish > ------ > From: Yanbo Liang > Sent: ‎12-‎12-‎2015 20:54 >

Re: How to save Multilayer Perceptron Classifier model.

2015-12-13 Thread Yanbo Liang
Hi Vadim, It does not support save/load for Multilayer Perceptron Model currently, you can track the issues at SPARK-11871 <https://issues.apache.org/jira/browse/SPARK-11871>. Yanbo 2015-12-14 2:31 GMT+08:00 Vadim Gribanov : > Hey everyone! I’m new with spark and scala. I looked at ex

Re: MLlib: Feature Importances API

2015-12-17 Thread Yanbo Liang
e/spark/examples/ml/RandomForestClassifierExample.scala> . Yanbo 2015-12-17 13:41 GMT+08:00 Asim Jalis : > I wanted to use get feature importances related to a Random Forest as > described in this JIRA: https://issues.apache.org/jira/browse/SPARK-5133 > > However, I don’t see how to

Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-17 Thread Yanbo Liang
Spark 1.5 officially use Parquet 1.7.0, but Spark 1.3 use Parquet 1.6.0. It's better to check which version of Parquet is used in your environment. 2015-12-17 10:26 GMT+08:00 Joseph Bradley : > This method is tested in the Spark 1.5 unit tests, so I'd guess it's a > problem with the Parquet depen

Re: Are there some solution to complete the transform category variables into dummy variable in scala or spark ?

2015-12-17 Thread Yanbo Liang
u can refer the officially example <https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala> . Yanbo 2015-12-17 16:00 GMT+08:00 zml张明磊 : > Hi , > > > > I am a new to scala and spark. Recently, I n

Re: Need clarifications in Regression

2015-12-17 Thread Yanbo Liang
is one to test. If you still get different result, please file a JIRA to track it. Yanbo 2015-12-16 14:35 GMT+08:00 Arunkumar Pillai : > Hi > > The Regression algorithm in the MLlib is using Loss function to calculate > the regression estimates and R is using matrix method to calcul

Re: Linear Regression with OLS

2015-12-17 Thread Yanbo Liang
only need to set solver to "normal". val lr = new LinearRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) .setSolver("normal") Yanbo ​

Re: MLlib: Feature Importances API

2015-12-17 Thread Yanbo Liang
Hi Asim, I think it's not necessary to back port featureImportances to mllib.tree.RandomForest. You can use ml.RandomForestClassifier and ml.RandomForestRegressor directly. Yanbo 2015-12-17 19:39 GMT+08:00 Asim Jalis : > Yanbo, > > Thanks for the reply. > > Is there

Re: Creating vectors from a dataframe

2015-12-20 Thread Yanbo Liang
Hi Arunkumar, If you want to create a vector from multiple columns of DataFrame, Spark ML provided VectorAssembler to help us. Yanbo 2015-12-21 13:44 GMT+08:00 Arunkumar Pillai : > Hi > > > I'm trying to use Linear Regression from ml library > > but the problem is t

Re: Extract SSerr SStot from Linear Regression using ml package

2015-12-22 Thread Yanbo Liang
metrics. Yanbo 2015-12-22 12:23 GMT+08:00 Arunkumar Pillai : > Hi > > I'm using Linear Regression using ml package > > I'm able to see SSerr SStot and SSreg from > > > val model = lr.fit(dat1) > > > model.summary.metric > > But this metric is not

Re: DataFrameWriter.format(String) is there a list of options?

2015-12-23 Thread Yanbo Liang
If you want to use CSV format, please refer the spark-csv project and the examples. https://github.com/databricks/spark-csv 2015-12-24 4:40 GMT+08:00 Zhan Zhang : > Now json, parquet, orc(in hivecontext), text are natively supported. If > you use avro or others, you have to include the package,

Re: How to ignore case in dataframe groupby?

2015-12-24 Thread Yanbo Liang
You can use DF.groupBy(upper(col("a"))).agg(sum(col("b"))). DataFrame provide function "upper" to update column to uppercase. 2015-12-24 20:47 GMT+08:00 Eran Witkon : > Use DF.withColumn("upper-code",df("countrycode).toUpper)) > or just run a map function that does the same > > On Thu, Dec 24, 20

Re: How to handle categorical variables in Spark MLlib?

2015-12-25 Thread Yanbo Liang
Hi Hokam, You can use OneHotEncoder to encode category variables to feature vector, Spark ML provide this transformer. To weight for individual category, there is no exist method to do this, but you can implement a UDF which can multiple a factor to specified column of a vector. Yanbo 2015-12

Re: Retrieving the PCA parameters in pyspark

2015-12-25 Thread Yanbo Liang
Hi Rohit, This is a known bug, but you can get these parameters if you use Scala version. Yanbo 2015-12-03 0:36 GMT+08:00 Rohit Girdhar : > Hi > > I'm using PCA through the python interface for spark, as per the > instructions on this page: > https://spark.apac

Re: SparkML algos limitations question.

2015-12-27 Thread Yanbo Liang
Hi Eugene, AFAIK, the current implementation of MultilayerPerceptronClassifier have some scalability problems if the model is very huge (such as >10M), although I think the limitation can cover many use cases already. Yanbo 2015-12-16 6:00 GMT+08:00 Joseph Bradley : > Hi Eugene, &

Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Yanbo Liang
Load csv file: df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv", header = "true") Calculate covariance: cov <- cov(df, "col1", "col2") Cheers Yanbo 2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com>: &

Re: [SparkSQL][Parquet] Read from nested parquet data

2015-12-30 Thread Yanbo Liang
This problem has been discussed long before, but I think there is no straight forward way to read only col_g. 2015-12-30 17:48 GMT+08:00 lin : > Hi all, > > We are trying to read from nested parquet data. SQL is "select > col_b.col_d.col_g from some_table" and the data schema for some_table is: >

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-30 Thread Yanbo Liang
t so large driver memory in your case, because KMeans use low memory for driver if your k is not very large. Cheers Yanbo 2015-12-30 22:20 GMT+08:00 Jia Zou : > I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU > cores and 30GB memory. Executor memory is set to 15GB, and

Re: K means clustering in spark

2015-12-31 Thread Yanbo Liang
Hi Anjali, The main output of KMeansModel is clusterCenters which is Array[Vector]. It has k elements where k is the number of clusters and each elements is the center of the specified cluster. Yanbo 2015-12-31 12:52 GMT+08:00 : > Hi, > > I am trying to use kmeans for clustering

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2016-01-01 Thread Yanbo Liang
pared with the dataset at one executor. Cheers Yanbo 2015-12-31 22:31 GMT+08:00 Jia Zou : > Thanks, Yanbo. > The results become much more reasonable, after I set driver memory to 5GB > and increase worker memory to 25GB. > > So, my question is for following code snippet extracted f

Re: How to specify the numFeatures in HashingTF

2016-01-01 Thread Yanbo Liang
You can refer the following code snippet to set numFeatures for HashingTF: val hashingTF = new HashingTF() .setInputCol("words") .setOutputCol("features") .setNumFeatures(n) 2015-10-16 0:17 GMT+08:00 Nick Pentreath : > Setting the numfeatures higher than vocab size will tend t

Re: does HashingTF maintain a inverse index?

2016-01-01 Thread Yanbo Liang
Hi Andy, Spark ML/MLlib does not provide a transformer to map HashingTF generated feature back to words currently. 2016-01-01 8:37 GMT+08:00 Hayri Volkan Agun : > Hi, > > If you are using pipeline api, you do not need to map features back to > documents. > Your input (which is the document text)

Re: NotSerializableException exception while using TypeTag in Scala 2.10

2016-01-01 Thread Yanbo Liang
I also hit this bug, have you resolved this issue? Or could you give some suggestions? 2014-07-28 18:33 GMT+08:00 Aniket Bhatnagar : > I am trying to serialize objects contained in RDDs using runtime > relfection via TypeTag. However, the Spark job keeps > failing java.io.NotSerializableException

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-02 Thread Yanbo Liang
in map(). Cheers Yanbo 2016-01-01 4:12 GMT+08:00 Tomasz Fruboes : > Dear All, > > I'm trying to implement a procedure that iteratively updates a rdd using > results from GaussianMixtureModel.predictSoft. In order to avoid problems > with local variable (the obtained GMM) b

Re: frequent itemsets

2016-01-02 Thread Yanbo Liang
Hi Roberto, Could you share your code snippet that others can help to diagnose your problems? 2016-01-02 7:51 GMT+08:00 Roberto Pagliari : > When using the frequent itemsets APIs, I’m running into stackOverflow > exception whenever there are too many combinations to deal with and/or too > many

Re: GLM I'm ml pipeline

2016-01-03 Thread Yanbo Liang
AFAIK, Spark MLlib will improve and support most GLM functions in the next release(Spark 2.0). 2016-01-03 23:02 GMT+08:00 : > keyStoneML could be an alternative. > > Ardo. > > On 03 Jan 2016, at 15:50, Arunkumar Pillai > wrote: > > Is there any road map for glm in pipeline? > >

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-04 Thread Yanbo Liang
like the following code snippet: gmmModel.predictSoft(rdd) then you will get a new RDD which is the soft prediction result. And all the models in ML package follow this rule. Yanbo 2016-01-04 22:16 GMT+08:00 Tomasz Fruboes : > Hi Yanbo, > > thanks for info. Is it likely to change

Re: SparkML algos limitations question.

2016-01-04 Thread Yanbo Liang
Hi Alexander, That's cool! Thanks for the clarification. Yanbo 2016-01-05 5:06 GMT+08:00 Ulanov, Alexander : > Hi Yanbo, > > > > As long as two models fit into memory of a single machine, there should be > no problems, so even 16GB machines can handle large models. (ma

Re: finding distinct count using dataframe

2016-01-05 Thread Yanbo Liang
Hi Arunkumar, You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or approxCountDistinct for a approximate result. 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai : > Hi > > Is there any functions to find distinct count of all the variables in > dataframe. > > val sc = new SparkCont

Re: sparkR ORC support.

2016-01-06 Thread Yanbo Liang
You should ensure your sqlContext is HiveContext. sc <- sparkR.init() sqlContext <- sparkRHive.init(sc) 2016-01-06 20:35 GMT+08:00 Sandeep Khurana : > Felix > > I tried the option suggested by you. It gave below error. I am going to > try the option suggested by Prem . > > Error in writeJobj

Re: Date Time Regression as Feature

2016-01-07 Thread Yanbo Liang
input into the features which can be feed into model trainer. OneHotEncoder and VectorAssembler are feature transformers provided by Spark ML, you can refer https://spark.apache.org/docs/latest/ml-features.html Thanks Yanbo 2016-01-08 7:52 GMT+08:00 Annabel Melongo : > Or he can also transform

Re: Predictive Modelling in sparkR

2016-01-07 Thread Yanbo Liang
Hi Chandan, Do you mean to run your own LR algorithm based on SparkR? Actually, SparkR provide the ability to run the distributed Spark MLlib LR and the interface is similar with the R GLM. For your refer: https://spark.apache.org/docs/latest/sparkr.html#binomial-glm-model 2016-01-07 2:45 GMT+08:

Re: StandardScaler in spark.ml.feature requires vector input?

2016-01-11 Thread Yanbo Liang
into StandardScaler. Thanks Yanbo 2016-01-10 8:10 GMT+08:00 Kristina Rogale Plazonic : > Hi, > > The code below gives me an unexpected result. I expected that > StandardScaler (in ml, not mllib) will take a specified column of an input > dataframe and subtract the mean of the c

Re: broadcast params to workers at the very beginning

2016-01-11 Thread Yanbo Liang
Hi, The parameters should be broadcasted again after you update it at driver side, then you can get updated version at worker side. Thanks Yanbo 2016-01-09 23:12 GMT+08:00 octavian.ganea : > Hi, > > In my app, I have a Params scala object that keeps all the specific > (hyper)para

Re: Deploying model built in SparkR

2016-01-11 Thread Yanbo Liang
Hi Chandan, Could you tell us the meaning of deploying model? Using the model to make prediction by R? Thanks Yanbo 2016-01-11 20:40 GMT+08:00 Chandan Verma : > Hi All, > > Does any one over here has deployed a model produced in SparkR or atleast > help me with the steps fo

Re: ml.classification.NaiveBayesModel how to reshape theta

2016-01-13 Thread Yanbo Liang
Yep, row of Matrix theta is the number of classes and column of theta is the number of features. 2016-01-13 10:47 GMT+08:00 Andy Davidson : > I am trying to debug my trained model by exploring theta > Theta is a Matrix. The java Doc for Matrix says that it is column major > formate > > I have tra

Re: AIC in Linear Regression in ml pipeline

2016-01-15 Thread Yanbo Liang
Hi Arunkumar, It does not support output AIC value for Linear Regression currently. This feature is under development and will be released at Spark 2.0. Thanks Yanbo 2016-01-15 17:20 GMT+08:00 Arunkumar Pillai : > Hi > > Is it possible to get AIC value in Linear Regression using ml

Re: has any one implemented TF_IDF using ML transformers?

2016-01-17 Thread Yanbo Liang
pache/spark/ml/feature/IDF.scala#L121 I found the document of IDF is not very clear, we need to update it. Thanks Yanbo 2016-01-16 6:10 GMT+08:00 Andy Davidson : > I wonder if I am missing something? TF-IDF is very popular. Spark ML has a > lot of transformers how ever it TF_IDF is no

Re: Feature importance for RandomForestRegressor in Spark 1.5

2016-01-17 Thread Yanbo Liang
-classification-regression.html#random-forest-classifier . Thanks Yanbo 2016-01-16 0:16 GMT+08:00 Robin East : > re 1. > The pull requests reference the JIRA ticket in this case > https://issues.apache.org/jira/browse/SPARK-5133. The JIRA says it was > rel

Re: has any one implemented TF_IDF using ML transformers?

2016-01-19 Thread Yanbo Liang
/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala#L226 Thanks Yanbo 2016-01-19 7:05 GMT+08:00 Andy Davidson : > Hi Yanbo > > I am using 1.6.0. I am having a hard of time trying to figure out what the > exact equation is. I do not know Scala. > &g

Re: Extracting p values in Logistic regression using mllib scala

2016-01-24 Thread Yanbo Liang
Hi Chandan, MLlib only support getting p-value, t-value from Linear Regression model, other models such as Logistic Model are not supported currently. This feature is under development and will be released at the next version(Spark 2.0). Thanks Yanbo 2016-01-18 16:45 GMT+08:00 Chandan Verma

Re: has any one implemented TF_IDF using ML transformers?

2016-01-24 Thread Yanbo Liang
Hi Andy, I will take a look at your code after your share it. Thanks! Yanbo 2016-01-23 0:18 GMT+08:00 Andy Davidson : > Hi Yanbo > > I recently code up the trivial example from > http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html > I > d

Re: can we create dummy variables from categorical variables, using sparkR

2016-01-24 Thread Yanbo Liang
Yanbo 2016-01-20 1:15 GMT+08:00 Vinayak Agrawal : > Yes, you can use Rformula library. Please see > > https://databricks.com/blog/2015/10/05/generalized-linear-models-in-sparkr-and-r-formula-support-in-mllib.html > > On Tue, Jan 19, 2016 at 10:34 AM, Devesh Raj Singh > wrote: >

Re: how to save Matrix type result to hdfs file using java

2016-01-24 Thread Yanbo Liang
Matrix can be save as column of type MatrixUDT.

Re: [MLLib] Is the order of the coefficients in a LogisticRegresionModel kept ?

2016-02-02 Thread Yanbo Liang
For you case, it's true. But not always correct for a pipeline model, some transformers in pipeline will change the features such as OneHotEncoder. 2016-02-03 1:21 GMT+08:00 jmvllt : > Hi everyone, > > This may sound like a stupid question but I need to be sure of this : > > Given a dataframe com

Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-15 Thread Yanbo Liang
Hi Stuti, This is a bug of AFTSurvivalRegression, we did not handle "lossSum == infinity" properly. I have open https://issues.apache.org/jira/browse/SPARK-13322 to track this issue and will send a PR. Thanks for reporting this issue. Yanbo 2016-02-12 15:03 GMT+08:00 Stuti Awasthi :

Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-16 Thread Yanbo Liang
= standardScaler.fit(ovarian2) val ovarian3 = ssModel.transform(ovarian2) val aft = new AFTSurvivalRegression().setFeaturesCol("standardized_features") val model = aft.fit(ovarian3) val newCoefficients = model.coefficients.toArray.zip(ssModel.std.toArray).map { x => x._1 / x._2 }

Re: Saving and Loading Dataframes

2016-02-25 Thread Yanbo Liang
Hi Raj, Could you share your code which can help others to diagnose this issue? Which version did you use? I can not reproduce this problem in my environment. Thanks Yanbo 2016-02-26 10:49 GMT+08:00 raj.kumar : > Hi, > > I am using mllib. I use the ml vectorization tools to c

Re: Calculation of histogram bins and frequency in Apache spark 1.6

2016-02-25 Thread Yanbo Liang
Actually Spark SQL `groupBy` with `count` can get frequency in each bin. You can also try with DataFrameStatFunctions.freqItems() to get the frequent items for columns. Thanks Yanbo 2016-02-24 1:21 GMT+08:00 Burak Yavuz : > You could use the Bucketizer transformer in Spark ML. > > Best

  1   2   3   >