etosa, virginica would be created with 0 and 1 as
> values
On Mon, Jan 25, 2016 at 12:37 PM, Deborah Siegel
wrote:
> Maybe not ideal, but since read.df is inferring all columns from the csv
> containing "NA" as type of strings, one could filter them rather than using
> dropna()
think the problem is with reading of csv files. read.df is not
> considering NAs in the CSV file
>
> So what would be a workable solution in dealing with NAs in csv files?
>
>
>
> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel
> wrote:
>
>> Hi Devesh,
>>
>&
Hi,
Can PCA be implemented in a SparkR-MLLib integration?
perhaps 2 separate issues..
1) Having the methods in SparkRWrapper and RFormula which will send the
right input types through the pipeline
MLLib PCA operates either on a RowMatrix, or the feature vector of an
RDD[LabeledPoint]. The label
in-hadoop2.4/bin/spark-submit` exists ? The
> error message seems to indicate it is trying to pick up Spark from
> that location and can't seem to find Spark installed there.
>
> Thanks
> Shivaram
>
> On Thu, Aug 20, 2015 at 3:30 PM, Deborah Siegel
> wrote:
> > He
Hello,
I have previously successfully run SparkR in RStudio, with:
>Sys.setenv(SPARK_HOME="~/software/spark-1.4.1-bin-hadoop2.4")
>.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
>library(SparkR)
>sc <- sparkR.init(master="local[2]",appName="SparkR-example")
Then I tr
I think I just answered my own question. The privitization of the RDD API
might have resulted in my error, because this worked:
> randomMatBr <- SparkR:::broadcast(sc, randomMat)
On Mon, Aug 3, 2015 at 4:59 PM, Deborah Siegel
wrote:
> Hello,
>
> In looking at the SparkR codeba
Hello,
In looking at the SparkR codebase, it seems as if broadcast variables ought
to be working based on the tests.
I have tried the following in sparkR shell, and similar code in RStudio,
but in both cases got the same message
> randomMat <- matrix(nrow=10, ncol=10, data=rnorm(100))
> randomMa
Hi,
I selected a "starter task" in JIRA, and made changes to my github fork of
the current code.
I assumed I would be able to build and test.
% mvn clean compile was fine
but
%mvn package failed
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-surefire-plugin:2.18:test (default-test
Harika,
I think you can modify existing spark on ec2 cluster to run Yarn mapreduce,
not sure if this is what you are looking for.
To try:
1) logon to master
2) go into either ephemeral-hdfs/conf/ or persistent-hdfs/conf/
and add this to mapred-site.xml :
mapreduce.framework.name
ya
Hello,
I'm new to ec2. I've set up a spark cluster on ec2 and am using
persistent-hdfs with the data nodes mounting ebs. I launched my cluster
using spot-instances
./spark-ec2 -k mykeypair -i ~/aws/mykeypair.pem -t m3.xlarge -s 4 -z
us-east-1c --spark-version=1.2.0 --spot-price=.0321
--hadoop-maj
Hello,
I am running through examples given on
http://spark.apache.org/docs/1.2.1/graphx-programming-guide.html
The section for Map Reduce Triplets Transition Guide (Legacy) indicates
that one can run the following .aggregateMessages code
val graph: Graph[Int, Float] = ...
def msgFun(triplet: Edg
Hi,
Someone else will have a better answer. I think that for standalone mode,
executors will grab whatever cores they can based on either configurations
on the worker, or application specific configurations. Could be wrong, but
I believe mesos is similar to this- and that YARN is alone in the abil
Hi Michael,
Would you help me understand the apparent difference here..
The Spark 1.2.1 programming guide indicates:
"Note that if you call schemaRDD.cache() rather than
sqlContext.cacheTable(...), tables will *not* be cached using the in-memory
columnar format, and therefore sqlContext.cacheTa
Hi Abe,
I'm new to Spark as well, so someone else could answer better. A few
thoughts which may or may not be the right line of thinking..
1) Spark properties can be set on the SparkConf, and with flags in
spark-submit, but settings on SparkConf take precedence. I think your jars
flag for spark-su
Hi Yong,
Have you tried increasing your level of parallelism? How many tasks are you
getting in failing stage? 2-3 tasks per CPU core is recommended, though
maybe you need more for your shuffle operation?
You can configure spark.default.parallelism, or pass in a level of
parallelism as second par
15 matches
Mail list logo