K-Means And Class Tags

2015-01-08 Thread Devl Devel
Hi All,

I'm trying a simple K-Means example as per the website:

val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))

but I'm trying to write a Java based validation method first so that
missing values are omitted or replaced with 0.

public RDD prepareKMeans(JavaRDD data) {
JavaRDD words = data.flatMap(new FlatMapFunction() {
public Iterable call(String s) {
String[] split = s.split(",");
ArrayList add = new ArrayList();
if (split.length != 2) {
add.add(Vectors.dense(0, 0));
} else
{
add.add(Vectors.dense(Double.parseDouble(split[0]),
   Double.parseDouble(split[1])));
}

return add;
}
});

return words.rdd();
}

When I then call from scala:

val parsedData=dc.prepareKMeans(data);
val p=parsedData.collect();

I get Exception in thread "main" java.lang.ClassCastException:
[Ljava.lang.Object; cannot be cast to
[Lorg.apache.spark.mllib.linalg.Vector;

Why is the class tag is object rather than vector?

1) How do I get this working correctly using the Java validation example
above or
2) How can I modify val parsedData = data.map(s =>
Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size <2 I
ignore the line? or
3) Is there a better way to do input validation first?

Using spark and mlib:
libraryDependencies += "org.apache.spark" % "spark-core_2.10" %  "1.2.0"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.2.0"

Many thanks in advance
Dev


Re: K-Means And Class Tags

2015-01-08 Thread Devl Devel
Thanks for the suggestion, can anyone offer any advice on the ClassCast
Exception going from Java to Scala? Why does JavaRDD.rdd() and then a
collect() result in this exception?

On Thu, Jan 8, 2015 at 4:13 PM, Yana Kadiyska 
wrote:

> How about
>
> data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]),
> Double.parseDouble(good_entry[1]))
> ​
> (full disclosure, I didn't actually run this). But after the first map you
> should have an RDD[Array[String]], then you'd discard everything shorter
> than 2, and convert the rest to dense vectors?...In fact if you're
> expecting length exactly 2 might want to filter ==2...
>
>
> On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel 
> wrote:
>
>> Hi All,
>>
>> I'm trying a simple K-Means example as per the website:
>>
>> val parsedData = data.map(s =>
>> Vectors.dense(s.split(',').map(_.toDouble)))
>>
>> but I'm trying to write a Java based validation method first so that
>> missing values are omitted or replaced with 0.
>>
>> public RDD prepareKMeans(JavaRDD data) {
>> JavaRDD words = data.flatMap(new FlatMapFunction> Vector>() {
>> public Iterable call(String s) {
>> String[] split = s.split(",");
>> ArrayList add = new ArrayList();
>> if (split.length != 2) {
>> add.add(Vectors.dense(0, 0));
>> } else
>> {
>> add.add(Vectors.dense(Double.parseDouble(split[0]),
>>Double.parseDouble(split[1])));
>> }
>>
>> return add;
>> }
>> });
>>
>> return words.rdd();
>> }
>>
>> When I then call from scala:
>>
>> val parsedData=dc.prepareKMeans(data);
>> val p=parsedData.collect();
>>
>> I get Exception in thread "main" java.lang.ClassCastException:
>> [Ljava.lang.Object; cannot be cast to
>> [Lorg.apache.spark.mllib.linalg.Vector;
>>
>> Why is the class tag is object rather than vector?
>>
>> 1) How do I get this working correctly using the Java validation example
>> above or
>> 2) How can I modify val parsedData = data.map(s =>
>> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size <2
>> I
>> ignore the line? or
>> 3) Is there a better way to do input validation first?
>>
>> Using spark and mlib:
>> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %  "1.2.0"
>> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.2.0"
>>
>> Many thanks in advance
>> Dev
>>
>
>


Re: K-Means And Class Tags

2015-01-09 Thread Devl Devel
Hi Joseph

Thanks for the suggestion, however retag is a private method and when I
call in Scala:

val retaggedInput = parsedData.retag(classOf[Vector])

I get:

Symbol retag is inaccessible from this place

However I can do this from Java, and it works in Scala:

return words.rdd().retag(Vector.class);

Dev



On Thu, Jan 8, 2015 at 9:35 PM, Joseph Bradley 
wrote:

> I believe you're running into an erasure issue which we found in
> DecisionTree too.  Check out:
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala#L134
>
> That retags RDDs which were created from Java to prevent the exception
> you're running into.
>
> Hope this helps!
> Joseph
>
> On Thu, Jan 8, 2015 at 12:48 PM, Devl Devel 
> wrote:
>
>> Thanks for the suggestion, can anyone offer any advice on the ClassCast
>> Exception going from Java to Scala? Why does JavaRDD.rdd() and then a
>> collect() result in this exception?
>>
>> On Thu, Jan 8, 2015 at 4:13 PM, Yana Kadiyska 
>> wrote:
>>
>> > How about
>> >
>> >
>> data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]),
>> > Double.parseDouble(good_entry[1]))
>> > ​
>> > (full disclosure, I didn't actually run this). But after the first map
>> you
>> > should have an RDD[Array[String]], then you'd discard everything shorter
>> > than 2, and convert the rest to dense vectors?...In fact if you're
>> > expecting length exactly 2 might want to filter ==2...
>> >
>> >
>> > On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel > >
>> > wrote:
>> >
>> >> Hi All,
>> >>
>> >> I'm trying a simple K-Means example as per the website:
>> >>
>> >> val parsedData = data.map(s =>
>> >> Vectors.dense(s.split(',').map(_.toDouble)))
>> >>
>> >> but I'm trying to write a Java based validation method first so that
>> >> missing values are omitted or replaced with 0.
>> >>
>> >> public RDD prepareKMeans(JavaRDD data) {
>> >> JavaRDD words = data.flatMap(new
>> FlatMapFunction> >> Vector>() {
>> >> public Iterable call(String s) {
>> >> String[] split = s.split(",");
>> >> ArrayList add = new ArrayList();
>> >> if (split.length != 2) {
>> >> add.add(Vectors.dense(0, 0));
>> >> } else
>> >> {
>> >> add.add(Vectors.dense(Double.parseDouble(split[0]),
>> >>Double.parseDouble(split[1])));
>> >> }
>> >>
>> >> return add;
>> >> }
>> >> });
>> >>
>> >> return words.rdd();
>> >> }
>> >>
>> >> When I then call from scala:
>> >>
>> >> val parsedData=dc.prepareKMeans(data);
>> >> val p=parsedData.collect();
>> >>
>> >> I get Exception in thread "main" java.lang.ClassCastException:
>> >> [Ljava.lang.Object; cannot be cast to
>> >> [Lorg.apache.spark.mllib.linalg.Vector;
>> >>
>> >> Why is the class tag is object rather than vector?
>> >>
>> >> 1) How do I get this working correctly using the Java validation
>> example
>> >> above or
>> >> 2) How can I modify val parsedData = data.map(s =>
>> >> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size
>> <2
>> >> I
>> >> ignore the line? or
>> >> 3) Is there a better way to do input validation first?
>> >>
>> >> Using spark and mlib:
>> >> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %
>> "1.2.0"
>> >> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" %
>> "1.2.0"
>> >>
>> >> Many thanks in advance
>> >> Dev
>> >>
>> >
>> >
>>
>
>


Re: LinearRegressionWithSGD accuracy

2015-01-15 Thread Devl Devel
Thanks, that helps a bit at least with the NaN but the MSE is still very
high even with that step size and 10k iterations:

training Mean Squared Error = 3.3322561285919316E7

Does this method need say 100k iterations?






On Thu, Jan 15, 2015 at 5:42 PM, Robin East  wrote:

> -dev, +user
>
> You’ll need to set the gradient descent step size to something small - a
> bit of trial and error shows that 0.0001 works.
>
> You’ll need to create a LinearRegressionWithSGD instance and set the step
> size explicitly:
>
> val lr = new LinearRegressionWithSGD()
> lr.optimizer.setStepSize(0.0001)
> lr.optimizer.setNumIterations(100)
> val model = lr.run(parsedData)
>
> On 15 Jan 2015, at 16:46, devl.development 
> wrote:
>
> From what I gather, you use LinearRegressionWithSGD to predict y or the
> response variable given a feature vector x.
>
> In a simple example I used a perfectly linear dataset such that x=y
> y,x
> 1,1
> 2,2
> ...
>
> 1,1
>
> Using the out-of-box example from the website (with and without scaling):
>
> val data = sc.textFile(file)
>
>val parsedData = data.map { line =>
>  val parts = line.split(',')
> LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y
> and x
>
>}
>val scaler = new StandardScaler(withMean = true, withStd = true)
>  .fit(parsedData.map(x => x.features))
>val scaledData = parsedData
>  .map(x =>
>  LabeledPoint(x.label,
>scaler.transform(Vectors.dense(x.features.toArray
>
>// Building the model
>val numIterations = 100
>val model = LinearRegressionWithSGD.train(parsedData, numIterations)
>
>// Evaluate model on training examples and compute training error *
> tried using both scaledData and parsedData
>val valuesAndPreds = scaledData.map { point =>
>  val prediction = model.predict(point.features)
>  (point.label, prediction)
>}
>val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
>println("training Mean Squared Error = " + MSE)
>
> Both scaled and unscaled attempts give:
>
> training Mean Squared Error = NaN
>
> I've even tried x, y+(sample noise from normal with mean 0 and stddev 1)
> still comes up with the same thing.
>
> Is this not supposed to work for x and y or 2 dimensional plots? Is there
> something I'm missing or wrong in the code above? Or is there a limitation
> in the method?
>
> Thanks for any advice.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>


Re: LinearRegressionWithSGD accuracy

2015-01-15 Thread Devl Devel
It was a bug in the code, however adding the step parameter got the results
to work.  Mean Squared Error = 2.610379825794694E-5

I've also opened a jira to put the step parameter in the examples so that
people new to mllib have a way to improve the MSE.

https://issues.apache.org/jira/browse/SPARK-5273

On Thu, Jan 15, 2015 at 8:23 PM, Joseph Bradley 
wrote:

> It looks like you're training on the non-scaled data but testing on the
> scaled data.  Have you tried this training & testing on only the scaled
> data?
>
> On Thu, Jan 15, 2015 at 10:42 AM, Devl Devel 
> wrote:
>
>> Thanks, that helps a bit at least with the NaN but the MSE is still very
>> high even with that step size and 10k iterations:
>>
>> training Mean Squared Error = 3.3322561285919316E7
>>
>> Does this method need say 100k iterations?
>>
>>
>>
>>
>>
>>
>> On Thu, Jan 15, 2015 at 5:42 PM, Robin East 
>> wrote:
>>
>> > -dev, +user
>> >
>> > You’ll need to set the gradient descent step size to something small - a
>> > bit of trial and error shows that 0.0001 works.
>> >
>> > You’ll need to create a LinearRegressionWithSGD instance and set the
>> step
>> > size explicitly:
>> >
>> > val lr = new LinearRegressionWithSGD()
>> > lr.optimizer.setStepSize(0.0001)
>> > lr.optimizer.setNumIterations(100)
>> > val model = lr.run(parsedData)
>> >
>> > On 15 Jan 2015, at 16:46, devl.development 
>> > wrote:
>> >
>> > From what I gather, you use LinearRegressionWithSGD to predict y or the
>> > response variable given a feature vector x.
>> >
>> > In a simple example I used a perfectly linear dataset such that x=y
>> > y,x
>> > 1,1
>> > 2,2
>> > ...
>> >
>> > 1,1
>> >
>> > Using the out-of-box example from the website (with and without
>> scaling):
>> >
>> > val data = sc.textFile(file)
>> >
>> >val parsedData = data.map { line =>
>> >  val parts = line.split(',')
>> > LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble))
>> //y
>> > and x
>> >
>> >}
>> >val scaler = new StandardScaler(withMean = true, withStd = true)
>> >  .fit(parsedData.map(x => x.features))
>> >val scaledData = parsedData
>> >  .map(x =>
>> >  LabeledPoint(x.label,
>> >scaler.transform(Vectors.dense(x.features.toArray
>> >
>> >// Building the model
>> >val numIterations = 100
>> >val model = LinearRegressionWithSGD.train(parsedData, numIterations)
>> >
>> >// Evaluate model on training examples and compute training error *
>> > tried using both scaledData and parsedData
>> >val valuesAndPreds = scaledData.map { point =>
>> >  val prediction = model.predict(point.features)
>> >  (point.label, prediction)
>> >}
>> >val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p),
>> 2)}.mean()
>> >println("training Mean Squared Error = " + MSE)
>> >
>> > Both scaled and unscaled attempts give:
>> >
>> > training Mean Squared Error = NaN
>> >
>> > I've even tried x, y+(sample noise from normal with mean 0 and stddev 1)
>> > still comes up with the same thing.
>> >
>> > Is this not supposed to work for x and y or 2 dimensional plots? Is
>> there
>> > something I'm missing or wrong in the code above? Or is there a
>> limitation
>> > in the method?
>> >
>> > Thanks for any advice.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>> >
>>
>
>


Some praise and comments on Spark

2015-02-25 Thread Devl Devel
Hi Spark Developers,

First, apologies if this doesn't belong on this list but the
comments/praise are relevant to all developers. This is just a small note
about what we really like about Spark, I/we don't mean to start a whole
long discussion thread in this forum but just share our positive
experiences with Spark thus far.

To start, as you can tell, we think that the Spark project is amazing and
we love it! Having put in nearly half a decade worth of sweat and tears
into production Hadoop, MapReduce clusters and application development it's
so refreshing to see something arguably simpler and more elegant to
supersede it.

These are the things we love about Spark and hope these principles continue:

-the one command build; make-distribution.sh. Simple, clean  and ideal for
deployment and devops and rebuilding on different environments and nodes.
-not having too much runtime and deploy config; as admins and developers we
are sick of setting props like io.sort and mapred.job.shuffle.merge.percent
and dfs file locations and temp directories and so on and on again and
again every time we deploy a job, new cluster, environment or even change
company.
-a fully built-in stack, one global project for SQL, dataframes, MLlib etc,
so there is no need to add on projects to it on as per Hive, Hue, Hbase
etc. This helps life and keeps everything in one place.
-single (global) user based operation - no creation of a hdfs mapred unix
user, makes life much simpler
-single quick-start daemons; master and slaves. Not having to worry about
JT, NN, DN , TT, RM, Hbase master ... and doing netstat and jps on hundreds
of clusters makes life much easier.
-proper code versioning, feature releases and release management.
- good & well organised documentation with good examples.

In addition to the comments above this is where we hope Spark never ends
up:

-tonnes of configuration properties and "go faster" type flags. For example
Hadoop and Hbase users will know that there are a whole catalogue of
properties for regions, caches, network properties, block sizes, etc etc.
Please don't end up here for example:
https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful
having to configure all of this and then create a set of properties for
each environment and then tie this into CI and deployment tools.
-no more daemons and processes to have to monitor and manipulate and
restart and crash.
-a project that penalises developers (that will ultimately help promote
Spark to their managers and budget holders) with expensive training,
certification, books and accreditation. Ideally this open source should be
free, free training= more users = more commercial uptake.

Anyway, those are our thoughts for what they are worth, keep up the good
work, we just had to mention it. Again sorry if this is not the right place
or if there is another forum for this stuff.

Cheers


Re: Integrating Spark with Ignite File System

2015-04-11 Thread Devl Devel
Hi Dmitriy,

Thanks for the input, I think as per my previous email it would be good to
have a bridge project that for example, creates a IgniteFS RDD, similar to
the JDBC or HDFS one in which we can extract blocks and populate RDD
partitions, I'll post this proposal on your list.

Thanks
Devl



On Sat, Apr 11, 2015 at 9:28 AM, Reynold Xin  wrote:

> Welcome, Dmitriy, to the Spark dev list!
>
>
> On Sat, Apr 11, 2015 at 1:14 AM, Dmitriy Setrakyan 
> wrote:
>
> > Hello Everyone,
> >
> > I am one of the committers to Apache Ignite and have noticed some talks
> on
> > this dev list about integrating Ignite In-Memory File System (IgniteFS)
> > with Spark. We definitely like the idea. If you have any questions about
> > Apache Ignite at all, feel free to forward them to the Ignite dev list.
> We
> > are going to be monitoring this list as well.
> >
> > Ignite mailing list: dev-subscr...@ignite.incubator.apache.org
> >
> > Regards,
> > Dmitriy
> >
>


Stop Master and Slaves without SSH

2015-06-03 Thread Devl Devel
Hey All,

start-slaves.sh and stop-slaves.sh make use of SSH to connect to remote
clusters. Are there alternative methods to do this without SSH?

For example using:

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

is fine but there is no way to kill the Worker without using
stop-slave(s).sh or using ps -ef and then kill.

Are there alternatives available such as Hadoop's: hadoop-daemon.sh
start|stop xyz?

I noticed spark-daemon.sh exists but maybe we need to increase the
documentation around it, for instance:

 Usage: spark-daemon.sh [--config ] (start|stop|status)
  

what are the valid spark-commands? Can this be used to start and stop
workers on the current node?

Many thanks
Devl


Recreating JIRA SPARK-8142

2015-06-09 Thread Devl Devel
Hi All

We are having some trouble with:

sparkConf.set("spark.driver.userClassPathFirst","true");
sparkConf.set("spark.executor.userClassPathFirst","true");

and would appreciate some independent verification. The issue comes down to
this:

Spark 1.3.1 hadoop 2.6 is deployed on the cluster. In my application code I
use maven to bring in:

hadoop-common 2.6.0 - provided
hadoop-client 2.6.0 provided
hadoop -hdfs 2.6.0 provided
spark-sql_s.10 provided
spark-core_2.10 provided
hbase-client 1.1.0 included.packaged
hbase -protocol 1.1.0 included/packaged
hbase -server 1.1.0 included/packaged

When I set userClasspath* to true I get a ClassCastException: Full details
are in

https://issues.apache.org/jira/browse/SPARK-8142

Can someone help verify this? I.e. If you have spark 1.3.1, Hadoop and
Hbase can you create a simple Spark job say to read an HBase table into a
RDD. Then set the Spark and Hadoop dependencies above as "provided" and
then set:

sparkConf.set("spark.driver.userClassPathFirst","true");
sparkConf.set("spark.executor.userClassPathFirst","true");

and repeat the job.

Do you get the same exception in the JIRA or missing classes or event run
into this https://issues.apache.org/jira/browse/SPARK-1867?

Please comment on the JIRA, it would be useful to have a second
verification of this. I know userClasspath* options are experimental but
it's good to know what's going on.

Cheers
Devl


SparkR installation not working

2015-09-19 Thread Devl Devel
Hi All,

I've built spark 1.5.0 with hadoop 2.6 with a fresh download :

build/mvn  -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package

I try to run SparkR it launches the normal R without the spark addons:

./bin/sparkR --master local[*]
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

With no "Welcome to SparkR"

also

> sc <- sparkR.init()
Error: could not find function "sparkR.init"
> sqlContext <- sparkRSQL.init(sc)
Error: could not find function "sparkRSQL.init"
>

Spark-shell and other components are fine. Using scala 2.10.6 and Java
1.8_45, Ubuntu 15.0.4. Please can anyone give me any pointers? Is there a
spark maven profile I need to enable?

Thanks
Devl


Spark Avro Generation

2014-08-11 Thread Devl Devel
Hi

So far I've been managing to build Spark from source but since a change in
spark-streaming-flume I have no idea how to generate classes (e.g.
SparkFlumeProtocol) from the avro schema.

I have used sbt to run avro:generate (from the top level spark dir) but it
produces nothing - it just says:

> avro:generate
[success] Total time: 0 s, completed Aug 11, 2014 12:26:49 PM.

Please can someone send me their build.sbt or just tell me how to build
spark so that all avro files get generated as well?

Sorry for the noob question but I really have tried by best on this one!
Cheers


Re: Spark Avro Generation

2014-08-12 Thread Devl Devel
Thanks very much that helps, not having to generate the entire build.


On Mon, Aug 11, 2014 at 6:09 PM, Ron Gonzalez  wrote:

> If you don't want to build the entire thing, you can also do
>
> mvn generate-sources in externals/flume-sink
>
> Thanks,
> Ron
>
> Sent from my iPhone
>
> > On Aug 11, 2014, at 8:32 AM, Hari Shreedharan 
> wrote:
> >
> > Jay running sbt compile or assembly should generate the sources.
> >
> >> On Monday, August 11, 2014, Devl Devel 
> wrote:
> >>
> >> Hi
> >>
> >> So far I've been managing to build Spark from source but since a change
> in
> >> spark-streaming-flume I have no idea how to generate classes (e.g.
> >> SparkFlumeProtocol) from the avro schema.
> >>
> >> I have used sbt to run avro:generate (from the top level spark dir) but
> it
> >> produces nothing - it just says:
> >>
> >>> avro:generate
> >> [success] Total time: 0 s, completed Aug 11, 2014 12:26:49 PM.
> >>
> >> Please can someone send me their build.sbt or just tell me how to build
> >> spark so that all avro files get generated as well?
> >>
> >> Sorry for the noob question but I really have tried by best on this one!
> >> Cheers
> >>
>


Compie error with XML elements

2014-08-12 Thread Devl Devel
When compiling the master checkout of spark. The Intellij compile fails
with:

Error:(45, 8) not found: value $scope
  
   ^
which is caused by HTML elements in classes like HistoryPage.scala:

val content =
  
...

How can I compile these classes that have html node elements in them?

Thanks in advance.