Re: PySpark, numpy arrays and binary data

2014-08-07 Thread Rok Roskar
thanks for the quick answer! > numpy array only can support basic types, so we can not use it during > collect() > by default. > sure, but if you knew that a numpy array went in on one end, you could safely use it on the other end, no? Perhaps it would require an extension of the RDD class an

Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-07 Thread Zongheng Yang
Hi Pranay, If this is data format is to be assumed, then I believe the issue starts at lines <- textFile(sc,"/sparkdev/datafiles/covariance.txt") totals <- lapply(lines, function(lines) After the first line, `lines` becomes an RDD of strings, each of which is a line of the form "1,1". Th

Re: Naive Bayes parameters

2014-08-07 Thread SK
I followed the example in examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveBayes.scala. IN this file Params is defined as follows: case class Params ( input: String = null, minPartitions: Int = 0, numFeatures: Int = -1, lambda: Double = 1.0) In the main functio

Re: Spark SQL

2014-08-07 Thread Cheng Lian
To use Spark SQL you need at least spark-sql_2.10 and spark-catalyst_2.10. If you want Hive support, spark-hive_2.10 must also be included. On Aug 7, 2014, at 2:33 PM, vdiwakar.malladi wrote: > Hi, > > I'm new to Spark. I configured spark in standalone mode on my lap. Right now > i'm in the p

Re:[GraphX] Can't zip RDDs with unequal numbers of partitions

2014-08-07 Thread Bin
OK, I think I've figured it out. It seems to be a bug which has been reported at: https://issues.apache.org/jira/browse/SPARK-2823 and https://github.com/apache/spark/pull/1763. As it says: "If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition n

Re: Regularization parameters

2014-08-07 Thread SK
Hi, I am following the code in examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala For setting the parameters and parsing the command line options, I am just reusing that code.Params is defined as follows. case class Params( input: String = null, numIt

Re: Regularization parameters

2014-08-07 Thread Xiangrui Meng
Which Spark version are you using? -Xiangrui On Thu, Aug 7, 2014 at 1:12 AM, SK wrote: > Hi, > > I am following the code in > examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala > For setting the parameters and parsing the command line options, I am just > reusing t

Re: Naive Bayes parameters

2014-08-07 Thread Xiangrui Meng
It is used in data loading: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveBayes.scala#L76 On Thu, Aug 7, 2014 at 12:47 AM, SK wrote: > I followed the example in > examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveB

Spark with HBase

2014-08-07 Thread Deepa Jayaveer
Hi I read your white paper about " " . We wanted to do a Proof of Concept on Spark with HBase. Documents are not much available to set up the spark cluster in Hadoop 2 environment. If you have any, can you please give us some reference URLs Also, some sample program to connect to HBase using Sp

Re: Regularization parameters

2014-08-07 Thread SK
Spark 1.0.1 thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameters-tp11601p11631.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - T

Re: Naive Bayes parameters

2014-08-07 Thread SK
Ok, thanks for clarifying. So looks like numFeatures is only relevant for lib SVM format. I am using LabeledPoint, so if data is not sparse, perhaps numFeatures is not required. I thought that the Params class defines all the parameters passed to the ML algorithm. But it looks like it also include

Re: Spark Streaming- Input from Kafka, output to HBase

2014-08-07 Thread Khanderao Kand
I hope this has been resolved, were u connected to right zookeeper? did Kafka and HBase share the same zookeeper and port? If not, did u set a right config for Hbase job? -- Khanderao On Wed, Jul 2, 2014 at 4:12 PM, JiajiaJing wrote: > Hi, > > I am trying to write a program that take input from

Re: Spark with HBase

2014-08-07 Thread Akhil Das
You can download and compile spark against your existing hadoop version. Here's a quick start https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types You can also read a bit here http://docs.sigmoidanalytics.com/index.php/Installing_Spark_andSetting_Up_Your_Cluster ( the

Where do my partitions go?

2014-08-07 Thread losmi83
Hi guys, the latest Spark version 1.0.2 exhibits a very strange behavior when it comes to deciding on which node a given partition should reside. The following example was tested in the standalone Spark mode. val partitioner = new HashPartitioner(10) val dummyJob1 = sc.parallelize

Re: Spark stream data from kafka topics and output as parquet file on HDFS

2014-08-07 Thread Sameer Sayyed
hello, Code: ZkState zkState = new ZkState(kafkaConfig); DynamicBrokersReader kafkaBrokerReader = new DynamicBrokersReader(kafkaConfig, zkState); int partionCount = kafkaBrokerReader.getNumPartitions(); SparkConf _sparkConf = new SparkConf().setAppName("KafkaReceiver"); final JavaStreamingContex

Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-08-07 Thread Rohit Rai
Alan/TD, We are facing the problem in a project going to production. Was there any progress on this? Are we able to confirm that this is a bug/limitation in the current streaming code? Or there is anything wrong in user scope? Regards, Rohit *Founder & CEO, **Tuplejump, Inc.* __

Got error “"java.lang.IllegalAccessError" when using HiveContext in Spark shell on AWS

2014-08-07 Thread Zhun Shen
Hi, When I try to use HiveContext in Spark shell on AWS, I got the error "java.lang.IllegalAccessError: tried to access method com.google.common.collect.MapMaker.makeComputingMap(Lcom/google/common/base/Function;)Ljava/util/concurrent/ConcurrentMap". I follow the steps below to compile and instal

Re: Got error “"java.lang.IllegalAccessError" when using HiveContext in Spark shell on AWS

2014-08-07 Thread Cheng Lian
Hey Zhun, Thanks for the detailed problem description. Please see my comments inlined below. On Thu, Aug 7, 2014 at 6:18 PM, Zhun Shen wrote: Caused by: java.lang.IllegalAccessError: tried to access method > com.google.common.collect.MapMaker.makeComputingMap(Lcom/google/common/base/Function;)L

Re: Issue with Spark on EC2 using spark-ec2 script

2014-08-07 Thread Nick Pentreath
Ryan, did you come right with this? I've just ran into the same problem on a new 1.0.0 cluster I spun up. The issue was that my app was not running against the Spark master, but in local mode (a default setting in my app that was a throwback from 0.9.1 and was overriding the spark defaults on the

Re: Bad Digest error while doing aws s3 put

2014-08-07 Thread lmk
This was a completely misleading error message.. The problem was due to a log message getting dumped to the stdout. This was getting accumulated in the workers and hence there was no space left on device after some time. When I re-tested with spark-0.9.1, the saveAsTextFile api threw "no space le

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
Matei Zaharia wrote > If you use s3n:// for both, you should be able to pass the exact same file > to load as you did to save. I'm trying to write a file to s3n in a Spark app and to read it in another one using the same file name, but without luck. Writing data to s3n as val data = Array(1.0, 1

Re: Spark SQL

2014-08-07 Thread vdiwakar.malladi
Thanks for your response. I could able to compile my code now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-tp11618p11644.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

reduceByKey to get all associated values

2014-08-07 Thread Konstantin Kudryavtsev
Hi there, I'm interested if it is possible to get the same behavior as for reduce function from MR framework. I mean for each key K get list of associated values List. There is function reduceByKey that works only with separate V from list. Is it exist any way to get list? Because I have to sort

How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread yaochunnan
Our lab need to do some simulation on online social networks. We need to handle a 5000*5000 adjacency matrix, namely, to get its largest eigenvalue and corresponding eigenvector. Matlab can be used but it is time-consuming. Is Spark effective in linear algebra calculations and transformations? Late

Re: Spark with HBase

2014-08-07 Thread chutium
this two posts should be good for setting up spark+hbase environment and use the results of hbase table scan as RDD settings http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html some samples: http://www.abcn.net/2014/07/spark-hbase-result-keyvalue-bytearray.html -- View this

Re: reduceByKey to get all associated values

2014-08-07 Thread Cheng Lian
You may use groupByKey in this case. On Aug 7, 2014, at 9:18 PM, Konstantin Kudryavtsev wrote: > Hi there, > > I'm interested if it is possible to get the same behavior as for reduce > function from MR framework. I mean for each key K get list of associated > values List. > > There is funct

Low Performance of Shark over Spark.

2014-08-07 Thread vinay . kashyap
Dear all, I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0. 6 worker nodes each with memory 96GB and 32 cores. I am using Shark Shell to execute queries on Spark. I have a raw_table ( of size 3TB with replication 3 ) which is partitioned by year, month and day. I am running

Re: NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass with spark-submit

2014-08-07 Thread Nick Pentreath
I'm also getting this - Ryan we both seem to be running into this issue with elasticsearch-hadoop :) I tried spark.files.userClassPathFirst true on command line and that doesn;t work If I put it that line in spark/conf/spark-defaults it works but now I'm getting: java.lang.NoClassDefFoundError: o

Re: Spark Hbase job taking long time

2014-08-07 Thread Ted Yu
Forgot to include user@ Another email from Amit indicated that there is 1 region in his table. This wouldn't give you the benefit TableInputFormat is expected to deliver. Please split your table into multiple regions. See http://hbase.apache.org/book.html#d3593e6847 and related links. Cheers

Re: reduceByKey to get all associated values

2014-08-07 Thread chutium
a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about performance (http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/) that, reduceByKey will be more efficient than groupByKey... he mentioned groupByKey "copies all data over network".

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
sparkuser2345 wrote > I'm using Spark 1.0.0. The same works when - Using Spark 0.9.1. - Saving to and reading from local file system (Spark 1.0.0) - Saving to and reading from HDFS (Spark 1.0.0) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a

KMeans Input Format

2014-08-07 Thread AlexanderRiggers
I want to perform a K-Means task and fail training the model and get kicked out of Sparks scala shell before I get my result metrics. I am not sure if the input format is the problem or something else. I use Spark 1.0.0 and my input textile (400MB) looks like this: 86252 3711 15.4 4.18 86252 3504

Re: Save an RDD to a SQL Database

2014-08-07 Thread 诺铁
I haven't seen people write directly to sql database, mainly because it's difficult to deal with failure, what if network broken in half of the process? should we drop all data in database and restart from beginning? if the process is "Appending" data to database, then things becomes even complex

[Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread linkpatrickliu
Hi, Following the "" document: # Cloudera CDH 4.2.0 mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package I compile Spark 1.0.2 with this cmd: mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests clean package However, I got two errors: [INFO] Compiling 14 Scala so

[Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread linkpatrickliu
Hi, Following the "" document: # Cloudera CDH 4.2.0 mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package I compile Spark 1.0.2 with this cmd: mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests clean package However, I got two errors: [INFO] Compiling 14 Scala so

Re: reduceByKey to get all associated values

2014-08-07 Thread Cheng Lian
The point is that in many cases the operation passed to reduceByKey aggregates data into much smaller size, say + and * for integer. String concatenation doesn’t actually “shrink” data, thus in your case, rdd.reduceByKey(_ ++ _) and rdd.groupByKey suffer similar performance issue. In general, do

Re: Save an RDD to a SQL Database

2014-08-07 Thread Thomas Nieborowski
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala On Thu, Aug 7, 2014 at 8:08 AM, 诺铁 wrote: > I haven't seen people write directly to sql database, > mainly because it's difficult to deal with failure, > what if network broken in half of the pro

Re: Save an RDD to a SQL Database

2014-08-07 Thread Cheng Lian
Maybe a little off topic, but would you mind to share your motivation of saving the RDD into an SQL DB? If you’re just trying to do further transformations/queries with SQL for convenience, then you may just use Spark SQL directly within your Spark application without saving them into DB: va

Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas
On Thu, Aug 7, 2014 at 11:08 AM, 诺铁 wrote: > what if network broken in half of the process? should we drop all data in > database and restart from beginning? The best way to deal with this -- which, unfortunately, is not commonly supported -- is with a two-phase commit that can span connection

spark streaming multiple file output paths

2014-08-07 Thread Chen Song
In Spark Streaming, is there a way to write output to different paths based on the partition key? The saveAsTextFiles method will write output in the same directory. For example, if the partition key has a hour/day column and I want to separate DStream output into different directories by hour/day

Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas
On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian wrote: > Maybe a little off topic, but would you mind to share your motivation of > saving the RDD into an SQL DB? Many possible reasons (Vida, please chime in with yours!): - You have an existing database you want to load new data into so ever

JVM Error while building spark

2014-08-07 Thread Rasika Pohankar
Hello, I am trying to build Apache Spark version 1.0.1 on Ubuntu 12.04 LTS. After unzipping the file and running sbt/sbt assembly I get the following error : rasika@rasikap:~/spark-1.0.1$ sbt/sbt package Error occurred during initialization of VM Could not reserve enough space for object heap Err

Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Hello all, I am not sure what is going on – I am getting a NotSerializedException and initially I thought it was due to not registering one of my classes with Kryo but that doesn’t seem to be the case. I am essentially eliminating duplicates in a spark streaming application by using a “window”

Re: KMeans Input Format

2014-08-07 Thread Burak Yavuz
Hi, Could you try running spark-shell with the flag --driver-memory 2g or more if you have more RAM available and try again? Thanks, Burak - Original Message - From: "AlexanderRiggers" To: u...@spark.incubator.apache.org Sent: Thursday, August 7, 2014 7:37:40 AM Subject: KMeans Input F

Initial job has not accepted any resources

2014-08-07 Thread arnaudbriche
Hi, I'm trying a simple thing: create an RDD from a text file (~3GB) located in GlusterFS, which is mounted by all Spark cluster machines, and calling rdd.count(); but Spark never managed to complete the job, giving message like the following: WARN TaskSchedulerImpl: Initial job has not accepted

Re: reduceByKey to get all associated values

2014-08-07 Thread Evan R. Sparks
Specifically, reduceByKey expects a commutative/associative reduce operation, and will automatically do this locally before a shuffle, which means it acts like a "combiner" in MapReduce terms - http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions On Thu

Re: How to read a multipart s3 file?

2014-08-07 Thread Ashish Rangole
Specify a folder instead of a file name for input and output code, as in: Output: s3n://your-bucket-name/your-data-folder Input: (when consuming the above output) s3n://your-bucket-name/your-data-folder/* On May 6, 2014 5:19 PM, "kamatsuoka" wrote: > I have a Spark app that writes out a file,

Re: Initial job has not accepted any resources

2014-08-07 Thread Marcelo Vanzin
There are two problems that might be happening: - You're requesting more resources than the master has available, so your executors are not starting. Given your explanation this doesn't seem to be the case. - The executors are starting, but are having problems connecting back to the driver. In th

Re: Save an RDD to a SQL Database

2014-08-07 Thread chutium
right, Spark is more like to act as an OLAP, i believe no one will use spark as an OLTP, so there is always some question about how to share the data between these two platform efficiently and a more important is that most of enterprise BI tools rely on RDBMS or at least a JDBC/ODBC interface

Re: How to read a multipart s3 file?

2014-08-07 Thread paul
darkjh wrote > But in my experience, when reading directly from > s3n, spark create only 1 input partition per file, regardless of the file > size. This may lead to some performance problem if you have big files. This is actually not true, Spark uses the underlying hadoop input formats to read the

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
It could be because of the variable "enableOpStat". Since its defined outside foreachRDD, referring to it inside the rdd.foreach is probably causing the whole streaming context being included in the closure. Scala funkiness. Try this, see if it works. msgCount.join(ddCount).foreachRDD((rdd: RDD[(I

Re: spark streaming multiple file output paths

2014-08-07 Thread Tathagata Das
The problem boils down to how to write an RDD in that way. You could use the HDFS Filesystem API to write each partition directly. pairRDD.groupByKey().foreachPartition(iterator => iterator.map { case (key, values) => // Open an output stream to destination file /key/ // Write valu

Re: Spark Streaming- Input from Kafka, output to HBase

2014-08-07 Thread Tathagata Das
For future reference in this thread, a better set of examples than the MetricAggregatorHBase on the JIRA to look at are here https://github.com/tmalaska/SparkOnHBase On Thu, Aug 7, 2014 at 1:41 AM, Khanderao Kand wrote: > I hope this has been resolved, were u connected to right zookeeper? di

Spark Streaming Workflow Validation

2014-08-07 Thread Dan H.
I wanted to post for validation to understand if there is more efficient way to achieve my goal. I'm currently performing this flow for two distinct calculations executing in parallel: 1) Sum key/value pair, by using a simple witnessed count(apply 1 to a mapToPair() and then groupByKey() 2) Su

Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-08-07 Thread Tathagata Das
Another possible reason behind this maybe that there are two versions of Akka present in the classpath, which are interfering with each other. This could happen through many scenarios. 1. Launching Spark application with Scala brings in Akka from Scala, which interferes with Spark's Akka 2. Multip

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Sean Owen
(-incubator, +user) If your matrix is symmetric (and real I presume), and if my linear algebra isn't too rusty, then its SVD is its eigendecomposition. The SingularValueDecomposition object you get back has U and V, both of which have columns that are the eigenvectors. There are a few SVDs in the

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Thanks TD but unfortunately that did not work. From: Tathagata Das mailto:tathagata.das1...@gmail.com>> Date: Thursday, August 7, 2014 at 10:55 AM To: Mahesh Padmanabhan mailto:mahesh.padmanab...@twc-contractor.com>> Cc: "user@spark.apache.org" mailto:user@spark.ap

Re: KMeans Input Format

2014-08-07 Thread Sean Owen
It's not running out of memory on the driver though, right? the executors may need more memory, or use more executors. --executory-memory would let you increase from the default of 512MB. On Thu, Aug 7, 2014 at 5:07 PM, Burak Yavuz wrote: > Hi, > > Could you try running spark-shell with the flag

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
As a follow up, I commented out that entire code and I am still getting the exception. It may be related to what you are suggesting so are there any best practices so that I can audit other parts of the code? Thanks, Mahesh From: , Mahesh Padmanabhan mailto:mahesh.padmanab...@twc-contractor.co

Re: JVM Error while building spark

2014-08-07 Thread Sean Owen
(-incubator, +user) It's not Spark running out of memory, but SBT, so those env variables have no effect. They're options to Spark at runtime anyway, not compile time, and you're intending to compile I take it. SBT is a memory hog, and Spark is a big build. You will probably need to give it more

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
Ashish Rangole wrote > Specify a folder instead of a file name for input and output code, as in: > > Output: > s3n://your-bucket-name/your-data-folder > > Input: (when consuming the above output) > > s3n://your-bucket-name/your-data-folder/* Unfortunately no luck: Exception in thread "main" o

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
Can you enable the java flag -Dsun.io.serialization.extendedDebugInfo=true for driver in your driver startup-script? That should give an indication of the sequence of object references that lead to the StremaingContext being included in the closure. TD On Thu, Aug 7, 2014 at 10:23 AM, Padmanabha

Re: [Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread Marcelo Vanzin
Can you try with "-Pyarn" instead of "-Pyarn-alpha"? I'm pretty sure CDH4 ships with the newer Yarn API. On Thu, Aug 7, 2014 at 8:11 AM, linkpatrickliu wrote: > Hi, > > Following the "" document: > > # Cloudera CDH 4.2.0 > mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean packag

Re: Regularization parameters

2014-08-07 Thread Xiangrui Meng
Then this may be a bug. Do you mind sharing the dataset that we can use to reproduce the problem? -Xiangrui On Thu, Aug 7, 2014 at 1:20 AM, SK wrote: > Spark 1.0.1 > > thanks > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameter

How to Start a JOB programatically from an EC2 machine?

2014-08-07 Thread SankarS
Hello All, If I execute a hive sql from ec2 machine, I am getting Broken pipe exception often, I want to execute the job programatically. is there any way available? Please guide me Thanks and regards, SankarS -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.

Re: Low Performance of Shark over Spark.

2014-08-07 Thread Xiangrui Meng
Did you cache the table? There are couple ways of caching a table in Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide On Thu, Aug 7, 2014 at 6:51 AM, wrote: > Dear all, > > I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0. > > 6 worker nodes each with memory 96GB

Re: How to read a multipart s3 file?

2014-08-07 Thread Sean Owen
That won't be it, since you can see from the directory listing that there are no data files under test -- only "_" files and dirs. The output looks like it was written, or partially written at least, but didn't finish, in that the part-* files were never moved to the target dir. I don't know why, b

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Evan R. Sparks
Reza Zadeh has contributed the distributed implementation of (Tall/Skinny) SVD (http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html), which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data

Re: Save an RDD to a SQL Database

2014-08-07 Thread Vida Ha
The use case I was thinking of was outputting calculations made in Spark into a SQL database for the presentation layer to access. So in other words, having a Spark backend in Java that writes to a SQL database and then having a Rails front-end that can display the data nicely. On Thu, Aug 7, 20

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Li Pu
@Miles, the latest SVD implementation in mllib is partially distributed. Matrix-vector multiplication is computed among all workers, but the right singular vectors are all stored in the driver. If your symmetric matrix is n x n and you want the first k eigenvalues, you will need to fit n x k double

Re: Spark Streaming Workflow Validation

2014-08-07 Thread Tathagata Das
I am not sure if it is a typo-error or not, but how are you using groupByKey to get the summed_values? Assuming you meant reduceByKey(), these workflows seems pretty efficient. TD On Thu, Aug 7, 2014 at 10:18 AM, Dan H. wrote: > I wanted to post for validation to understand if there is more ef

trouble with saveAsParquetFile

2014-08-07 Thread Brad Miller
Hi All, I'm having a bit of trouble with nested data structures in pyspark with saveAsParquetFile. I'm running master (as of yesterday) with this pull request added: https://github.com/apache/spark/pull/1802. *# these all work* > sqlCtx.jsonRDD(sc.parallelize(['{"record": null}'])).saveAsParquet

Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas
Vida, What kind of database are you trying to write to? For example, I found that for loading into Redshift, by far the easiest thing to do was to save my output from Spark as a CSV to S3, and then load it from there into Redshift. This is not a slow as you think, because Spark can write the outp

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Shivaram Venkataraman
If you just want to find the top eigenvalue / eigenvector you can do something like the Lanczos method. There is a description of a MapReduce based algorithm in Section 4.2 of [1] [1] http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf On Thu, Aug 7, 2014 at 10:54 AM, Li Pu wrote: > @Miles

Re: KMeans Input Format

2014-08-07 Thread AlexanderRiggers
Thanks for your answers. The dataset is only 400MB, so I shouldn't run out of memory. I restructured my code now, because I forgot to cache my dataset and set down number of iterations to 2, but still get kicked out of Spark. Did I cache the data wrong (sorry not an expert): scala> import org.apac

Re: [Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread Sean Owen
Yep, this command given in the Spark docs is correct: mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package and while I also would hope that this works, it doesn't compile: mvn -Pyarn -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests clean package I believe later 4.x includes eff

Re: Save an RDD to a SQL Database

2014-08-07 Thread Flavio Pompermaier
Isn't sqoop export meant for that? http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1 On Aug 7, 2014 7:59 PM, "Nicholas Chammas" wrote: > Vida, > > What kind of database are you trying to write to? > > For example, I found that for loading into Redshift, by far the ea

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Does this help? I can’t figure out anything new from this extra information. Thanks, Mahesh 2014-08-07 12:27:00,170 [spark-akka.actor.default-dispatcher-4] ERROR akka.actor.OneForOneStrategy - org.apache.spark.streaming.StreamingContext - field (class "com.twc.needle.ep.EventPersister$$

Re: [Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread Marcelo Vanzin
I think Cloudera only started adding Spark to CDH4 starting with 4.6, so maybe that's the minimum if you want to try out Spark on CDH4. On Thu, Aug 7, 2014 at 11:22 AM, Sean Owen wrote: > Yep, this command given in the Spark docs is correct: > > mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -D

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread amit
There is one more configuration option called spark.closure.serializer that can be used to specify serializer for closures. Maybe in the the class you have Streaming Context as a field, so when spark tries to serialize the whole class it uses the spark.closure.serializer to serialize even the stre

Re: Save an RDD to a SQL Database

2014-08-07 Thread Vida Ha
That's a good idea - to write to files first and then load. Thanks. On Thu, Aug 7, 2014 at 11:26 AM, Flavio Pompermaier wrote: > Isn't sqoop export meant for that? > > > http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1 > On Aug 7, 2014 7:59 PM, "Nicholas Chammas"

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
>From the extended info, I see that you have a function called createStreamingContext() in your code. Somehow that is getting referenced in in the foreach function. Is the whole foreachRDD code inside the createStreamingContext() function? Did you try marking the ssc field as transient? Here is a

Re: Spark Streaming Workflow Validation

2014-08-07 Thread Dan H.
Yes, thanks, I did in fact mean reduceByKey(), thus allowing the convenience method process the summation by key. Thanks for your feedback! DH -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Workflow-Validation-tp11677p11706.html Sent from

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Thanks TD, Amit. I think I figured out where the problem is through the process of commenting out individual lines of code one at a time :( Can either of you help me find the right solution? I tried creating the SparkContext outside the foreachRDD but that didn’t help. I have an object (let’s

questions about MLLib recommendation models

2014-08-07 Thread Jay Hutfles
I have a few questions regarding a collaborative filtering model, and was hoping for some recommendations (no pun intended...) *Setup* I have a csv file with user/movie/ratings named unimaginatively 'movies.csv'. Here are the contents: 0,0,5 0,1,5 0,2,0 0,3,0 1,0,5 1,3,0 2,1,4 2,2,0 3,0,0 3,1,0

Re: questions about MLLib recommendation models

2014-08-07 Thread Sean Owen
On Thu, Aug 7, 2014 at 9:06 PM, Jay Hutfles wrote: > 0,0,5 > 0,1,5 > 0,2,0 > 0,3,0 > 1,0,5 > 1,3,0 > 2,1,4 > 2,2,0 > 3,0,0 > 3,1,0 > 3,2,5 > 3,3,4 > 4,0,0 > 4,1,0 > 4,2,5 > val rank = 10 This is likely the problem? your rank is actually larger than the number of users or items. The error could p

Re: questions about MLLib recommendation models

2014-08-07 Thread Burak Yavuz
Hi Jay, I've had the same problem you've been having in Question 1 with a synthetic dataset. I thought I wasn't producing the dataset well enough. This seems to be a bug. I will open a JIRA for it. Instead of using: ratings.map{ case Rating(u,m,r) => { val pred = model.predict(u, m) (r

Re: memory issue on standalone master

2014-08-07 Thread maddenpj
It looks like your Java heap space is too low: -Xmx512m. It's only using .5G of RAM, try bumping this up -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/memory-issue-on-standalone-master-tp11610p11711.html Sent from the Apache Spark User List mailing list ar

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
Hi Brad, It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908 to track it. It will be fixed soon. Thanks, Yin On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller wrote: > Hi All, > > I'm having a bit of trouble with nested data structures in pyspark with > saveAsParquetFile.

RE: Save an RDD to a SQL Database

2014-08-07 Thread Jim Donahue
Depending on what you mean by "save," you might be able to use the Twitter Storehaus package to do this. There was a nice talk about this at a Spark meetup -- "Stores, Monoids and Dependency Injection - Abstractions for Spark Streaming Jobs." Video here: https://www.youtube.com/watch?v=C7gWtx

Re: trouble with saveAsParquetFile

2014-08-07 Thread Brad Miller
Thanks Yin! best, -Brad On Thu, Aug 7, 2014 at 1:39 PM, Yin Huai wrote: > Hi Brad, > > It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908 > to track it. It will be fixed soon. > > Thanks, > > Yin > > > On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller > wrote: > >> Hi All,

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
Actually, the issue is if values of a field are always null (or this field is missing), we cannot figure out the data type. So, we use NullType (it is an internal data type). Right now, we have a step to convert the data type from NullType to StringType. This logic in the master has a bug. We will

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
The PR is https://github.com/apache/spark/pull/1840. On Thu, Aug 7, 2014 at 1:48 PM, Yin Huai wrote: > Actually, the issue is if values of a field are always null (or this field > is missing), we cannot figure out the data type. So, we use NullType (it is > an internal data type). Right now, we

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
Well I dont see the rdd in the foreachRDD being passed into the A.func1() so I am not sure what is purpose of the function. Assuming that you do want to pass on that RDD into that function, and also want to have access to the sparkContext, you can only pass on the RDD and then access the associated

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Slap my head moment – using rdd.context solved it! Thanks TD, Mahesh From: Tathagata Das mailto:tathagata.das1...@gmail.com>> Date: Thursday, August 7, 2014 at 3:06 PM To: Mahesh Padmanabhan mailto:mahesh.padmanab...@twc-contractor.com>> Cc: amit mailto:amit.codenam...@gmail.com>>, "u...@spar

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
LOL! Glad it solved it. TD On Thu, Aug 7, 2014 at 2:23 PM, Padmanabhan, Mahesh (contractor) < mahesh.padmanab...@twc-contractor.com> wrote: > Slap my head moment – using rdd.context solved it! > > Thanks TD, > > Mahesh > > From: Tathagata Das > Date: Thursday, August 7, 2014 at 3:06 PM > To: M

Re: KMeans Input Format

2014-08-07 Thread durin
Not all memory can be used for Java heap space, so maybe it does run out. Could you try repartitioning the data? To my knowledge you shouldn't be thrown out as long as a single partition fits into memory, even if the whole dataset does not. To do that, exchange val train = parsedData.cache() wit

Lost executors

2014-08-07 Thread rpandya
I'm running into a problem with executors failing, and it's not clear what's causing it. Any suggestions on how to diagnose & fix it would be appreciated. There are a variety of errors in the logs, and I don't see a consistent triggering error. I've tried varying the number of executors per machin

Re: PySpark + executor lost

2014-08-07 Thread Davies Liu
What is the environment ? YARN or Mesos or Standalone? It will be more helpful if you could show more loggings. On Wed, Aug 6, 2014 at 7:25 PM, Avishek Saha wrote: > Hi, > > I get a lot of executor lost error for "saveAsTextFile" with PySpark > and Hadoop 2.4. > > For small datasets this error o

Missing SparkSQLCLIDriver and Beeline drivers in Spark

2014-08-07 Thread ajatix
Hi I wish to migrate from shark to the spark-sql shell, where I am facing some difficulties in setting up. I cloned the "branch-1.0-jdbc" to test out the spark-sql shell, but I am unable to run it after building the source. I've tried two methods for building (with Hadoop 1.0.4) - sbt/sbt assem

  1   2   >