How to specify the numFeatures in HashingTF

2015-10-15 Thread Jianguo Li
Hi, There is a parameter in the HashingTF called "numFeatures". I was wondering what is the best way to set the value to this parameter. In the use case of text categorization, do you need to know in advance the number of words in your vocabulary? or do you set it to be a large value, greater than

Re: workaround for groupByKey

2015-06-23 Thread Jianguo Li
then use a mapPartitions perhaps? > > From: Jianguo Li > Date: Monday, June 22, 2015 at 6:21 PM > To: Silvio Fiorito > Cc: "user@spark.apache.org" > Subject: Re: workaround for groupByKey > > Thanks for your suggestion. I guess aggregateByKey is similar to > combi

Re: workaround for groupByKey

2015-06-22 Thread Jianguo Li
You can use aggregateByKey as one option: > > val input: RDD[Int, String] = ... > > val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a += > b, (a, b) => a ++ b) > > From: Jianguo Li > Date: Monday, June 22, 2015 at 5:12 PM > To: "u

workaround for groupByKey

2015-06-22 Thread Jianguo Li
Hi, I am processing an RDD of key-value pairs. The key is an user_id, and the value is an website url the user has ever visited. Since I need to know all the urls each user has visited, I am tempted to call the groupByKey on this RDD. However, since there could be millions of users and urls, the

spark ml model info

2015-04-14 Thread Jianguo Li
Hi, I am training a model using the logistic regression algorithm in ML. I was wondering if there is any API to access the weight vectors (aka the co-efficients for each feature). I need those co-efficients for real time predictions. Thanks, Jianguo

feature scaling in GeneralizedLinearAlgorithm.scala

2015-04-13 Thread Jianguo Li
Hi, In the GeneralizedLinearAlgorithm, which Logistic Regression relied on, it says "if userFeatureScaling is enabled, we will standardize the training features , and trained the model in the scaled space. Then we transform the coefficients from the scaled space to the original space ...". My

Spark ML pipeline

2015-02-11 Thread Jianguo Li
Hi, I really like the pipeline in the spark.ml in Spark1.2 release. Will there be more machine learning algorithms implemented for the pipeline framework in the next major release? Any idea when the next major release comes out? Thanks, Jianguo

Re: Does the kFold in Spark always give you the same split?

2015-01-30 Thread Jianguo Li
i, Jan 30, 2015 at 4:12 PM, Jianguo Li > wrote: > > Hi, > > > > I am using the utility function kFold provided in Spark for doing k-fold > > cross validation using logistic regression. However, each time I run the > > experiment, I got different different result. S

Does the kFold in Spark always give you the same split?

2015-01-30 Thread Jianguo Li
Hi, I am using the utility function kFold provided in Spark for doing k-fold cross validation using logistic regression. However, each time I run the experiment, I got different different result. Since everything else stays constant, I was wondering if this is due to the kFold function I used. Doe

unit tests with "java.io.IOException: Could not create FileClient"

2015-01-19 Thread Jianguo Li
Hi, I created some unit tests to test some of the functions in my project which use Spark. However, when I used the sbt tool to build it and then ran the "sbt test", I ran into "java.io.IOException: Could not create FileClient": 2015-01-19 08:50:38,1894 ERROR Client fs/client/fileclient/cc/client

Re: component.AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use

2015-01-14 Thread Jianguo Li
I solved the issue. In case anyone else is looking for an answer, by default, scalatest executes all the tests in parallel. To disable this, just put the following line in your build.sbt parallelExecution in Test := false Thanks On Wed, Jan 14, 2015 at 2:30 PM, Jianguo Li wrote: > Hi, &g

component.AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use

2015-01-14 Thread Jianguo Li
Hi, I am using the sbt tool to build and run the scala tests related to spark. In my /src/test/scala directory, there are two test classes (TestA, TestB), both of which use the class in Spark for creating SparkContext, something like trait LocalTestSparkContext extends BeforeAndAfterAll { self: S

java.lang.NoClassDefFoundError: io/netty/util/TimerTask Error when running sbt test

2015-01-14 Thread Jianguo Li
I am using Spark-1.1.1. When I used "sbt test", I ran into the following exceptions. Any idea how to solve it? Thanks! I think somebody posted this question before, but no one seemed to have answered it. Could it be the version of "io.netty" I put in my build.sbt? I included an dependency "libraryD

[no subject]

2015-01-14 Thread Jianguo Li
I am using Spark-1.1.1. When I used "sbt test", I ran into the following exceptions. Any idea how to solve it? Thanks! I think somebody posted this question before, but no one seemed to have answered it. Could it be the version of "io.netty" I put in my build.sbt? I included an dependency "libraryD

including the spark-mllib in build.sbt

2015-01-12 Thread Jianguo Li
Hi, I am trying to build my own scala project using sbt. The project is dependent on both spark-score and spark-mllib. I included the following two dependencies in my build.sbt file libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.1.1" libraryDependencies += "org.apache.spark" %% "

confidence/probability for prediction in MLlib

2015-01-06 Thread Jianguo Li
Hi, A while ago, somebody asked about getting a confidence value of a prediction with MLlib's implementation of Naive Bayes's classification. I was wondering if there is any plan in the near future for the predict function to return both a label and a confidence/probability? Or could the private