Hi,
There is a parameter in the HashingTF called "numFeatures". I was wondering
what is the best way to set the value to this parameter. In the use case of
text categorization, do you need to know in advance the number of words in
your vocabulary? or do you set it to be a large value, greater than
then use a mapPartitions perhaps?
>
> From: Jianguo Li
> Date: Monday, June 22, 2015 at 6:21 PM
> To: Silvio Fiorito
> Cc: "user@spark.apache.org"
> Subject: Re: workaround for groupByKey
>
> Thanks for your suggestion. I guess aggregateByKey is similar to
> combi
You can use aggregateByKey as one option:
>
> val input: RDD[Int, String] = ...
>
> val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a +=
> b, (a, b) => a ++ b)
>
> From: Jianguo Li
> Date: Monday, June 22, 2015 at 5:12 PM
> To: "u
Hi,
I am processing an RDD of key-value pairs. The key is an user_id, and the
value is an website url the user has ever visited.
Since I need to know all the urls each user has visited, I am tempted to
call the groupByKey on this RDD. However, since there could be millions of
users and urls, the
Hi,
I am training a model using the logistic regression algorithm in ML. I was
wondering if there is any API to access the weight vectors (aka the
co-efficients for each feature). I need those co-efficients for real time
predictions.
Thanks,
Jianguo
Hi,
In the GeneralizedLinearAlgorithm, which Logistic Regression relied on, it
says "if userFeatureScaling is enabled, we will standardize the training
features , and trained the model in the scaled space. Then we transform
the coefficients from the scaled space to the original space ...".
My
Hi,
I really like the pipeline in the spark.ml in Spark1.2 release. Will there
be more machine learning algorithms implemented for the pipeline framework
in the next major release? Any idea when the next major release comes out?
Thanks,
Jianguo
i, Jan 30, 2015 at 4:12 PM, Jianguo Li
> wrote:
> > Hi,
> >
> > I am using the utility function kFold provided in Spark for doing k-fold
> > cross validation using logistic regression. However, each time I run the
> > experiment, I got different different result. S
Hi,
I am using the utility function kFold provided in Spark for doing k-fold
cross validation using logistic regression. However, each time I run the
experiment, I got different different result. Since everything else stays
constant, I was wondering if this is due to the kFold function I used. Doe
Hi,
I created some unit tests to test some of the functions in my project which
use Spark. However, when I used the sbt tool to build it and then ran the
"sbt test", I ran into "java.io.IOException: Could not create FileClient":
2015-01-19 08:50:38,1894 ERROR Client fs/client/fileclient/cc/client
I solved the issue. In case anyone else is looking for an answer, by
default, scalatest executes all the tests in parallel. To disable this,
just put the following line in your build.sbt
parallelExecution in Test := false
Thanks
On Wed, Jan 14, 2015 at 2:30 PM, Jianguo Li
wrote:
> Hi,
&g
Hi,
I am using the sbt tool to build and run the scala tests related to spark.
In my /src/test/scala directory, there are two test classes (TestA, TestB),
both of which use the class in Spark for creating SparkContext, something
like
trait LocalTestSparkContext extends BeforeAndAfterAll { self: S
I am using Spark-1.1.1. When I used "sbt test", I ran into the
following exceptions. Any idea how to solve it? Thanks! I think
somebody posted this question before, but no one seemed to have
answered it. Could it be the version of "io.netty" I put in my
build.sbt? I included an dependency "libraryD
I am using Spark-1.1.1. When I used "sbt test", I ran into the
following exceptions. Any idea how to solve it? Thanks! I think
somebody posted this question before, but no one seemed to have
answered it. Could it be the version of "io.netty" I put in my
build.sbt? I included an dependency "libraryD
Hi,
I am trying to build my own scala project using sbt. The project is
dependent on both spark-score and spark-mllib. I included the following two
dependencies in my build.sbt file
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.1.1"
libraryDependencies += "org.apache.spark" %% "
Hi,
A while ago, somebody asked about getting a confidence value of a
prediction with MLlib's implementation of Naive Bayes's classification.
I was wondering if there is any plan in the near future for the predict
function to return both a label and a confidence/probability? Or could the
private
16 matches
Mail list logo