Re: cache changes precision

2014-07-24 Thread Ron Gonzalez
Cool I'll take a look and give it a try! Thanks, Ron Sent from my iPad > On Jul 24, 2014, at 10:35 PM, Andrew Ash wrote: > > Hi Ron, > > I think you're encountering the issue where cacheing data from Hadoop ends up > with many duplicate values instead of what you expect. Try adding a .clone

Re: rdd.saveAsTextFile blows up

2014-07-24 Thread Eric Friedman
I ported the same code to scala. No problems. But in pyspark, this fails consistently: ctx = SQLContext(sc) pf = ctx.parquetFile("...") rdd = pf.map(lambda x: x) crdd = ctx.inferSchema(rdd) crdd.saveAsParquetFile("...") If I do rdd = sc.parallelize(["hello", "world"]) rdd.saveAsTextFile(...) It

Re: rdd.saveAsTextFile blows up

2014-07-24 Thread Akhil Das
Most likely you are closing the connection with HDFS. Can you paste the piece of code that you are executing? We were having similar problem when we closed the FileSystem object in our code. Thanks Best Regards On Thu, Jul 24, 2014 at 11:00 PM, Eric Friedman wrote: > I'm trying to run a simpl

Re: cache changes precision

2014-07-24 Thread Andrew Ash
Hi Ron, I think you're encountering the issue where cacheing data from Hadoop ends up with many duplicate values instead of what you expect. Try adding a .clone() to the datum() call. The issue is that Hadoop returns the same object many times but with its contents changed. This is an optimizat

actor serialization error

2014-07-24 Thread Alan Ngai
Hi, I’m running into a new problem trying to get streaming going. I have a test class that sets up my pipeline and runs it fine. The actual production implementation sets up the pipeline from within an actor. At first, I ran into a bunch of issues relating to the serialization of closures fr

Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-07-24 Thread Alan Ngai
bump. any ideas? On Jul 24, 2014, at 3:09 AM, Alan Ngai wrote: > it looks like when you configure sparkconfig to use the kryoserializer in > combination of using an ActorReceiver, bad things happen. I modified the > ActorWordCount example program from > > val sparkConf = new SparkConf(

Re: GraphX Pragel implementation

2014-07-24 Thread Ankur Dave
On Thu, Jul 24, 2014 at 9:52 AM, Arun Kumar wrote: > While using pregel API for Iterations how to figure out which super step > the iteration currently in. The Pregel API doesn't currently expose this, but it's very straightforward to modify Pregel.scala

Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-24 Thread Jianshi Huang
I can successfully run my code in local mode using spark-submit (--master local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode. Any hints what is the problem? Is it a closure serialization problem? How can I debug it? Your answers would be very helpful. 14/07/25 11:48:14 WA

Re: Spark Function setup and cleanup

2014-07-24 Thread Yanbo Liang
You can refer this topic http://www.mapr.com/developercentral/code/loading-hbase-tables-spark 2014-07-24 22:32 GMT+08:00 Yosi Botzer : > In my case I want to reach HBase. For every record with userId I want to > get some extra information about the user and add it to result record for > further

Re: mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-24 Thread Matei Zaharia
The Pair ones return a JavaPairRDD, which has additional operations on key-value pairs. Take a look at http://spark.apache.org/docs/latest/programming-guide.html#working-with-key-value-pairs for details. Matei On Jul 24, 2014, at 3:41 PM, abhiguruvayya wrote: > Can any one help me understand

Re: Simple record matching using Spark SQL

2014-07-24 Thread Yin Huai
Hi Sarath, Have you tried the current branch 1.0? If not, can you give it a try and see if the problem can be resolved? Thanks, Yin On Thu, Jul 24, 2014 at 11:17 AM, Yin Huai wrote: > Hi Sarath, > > I will try to reproduce the problem. > > Thanks, > > Yin > > > On Wed, Jul 23, 2014 at 11:32

Re: NotSerializableException in Spark Streaming

2014-07-24 Thread Nicholas Chammas
Yep, here goes! Here are my environment vitals: - Spark 1.0.0 - EC2 cluster with 1 slave spun up using spark-ec2 - twitter4j 3.0.3 - spark-shell called with --jars argument to load spark-streaming-twitter_2.10-1.0.0.jar as well as all the twitter4j jars. Now, while I’m in the S

Re: streaming sequence files?

2014-07-24 Thread Barnaby
I have the streaming program writing sequence files. I can find one of the files and load it in the shell using: scala> val rdd = sc.sequenceFile[String, Int]("tachyon://localhost:19998/files/WordCounts/20140724-213930") 14/07/24 21:47:50 INFO storage.MemoryStore: ensureFreeSpace(

Re: Getting the number of slaves

2014-07-24 Thread Nicolas Mai
Thanks, this is what I needed :) I should have searched more... Something I noticed though: after the SparkContext is initialized, I had to wait for a few seconds until sc.getExecutorStorageStatus.length returns the correct number of workers in my cluster (otherwise it returns 1, for the driver)..

Re: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-07-24 Thread Tathagata Das
You can set the Java option "-Dsun.io.serialization.extendedDebugInfo=true" to have more information about the object be printed. It will help you trace down the how the SparkContext is getting included in some kind of closure. TD On Thu, Jul 24, 2014 at 9:48 AM, lihu wrote: > ​Which code do y

mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-24 Thread abhiguruvayya
Can any one help me understand the key difference between mapToPair vs flatMapToPair vs flatMap functions and also when to apply these functions in particular. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mapToPair-vs-flatMapToPair-vs-flatMap-function-usa

cache changes precision

2014-07-24 Thread Ron Gonzalez
Hi,   I'm doing the following:   def main(args: Array[String]) = {     val sparkConf = new SparkConf().setAppName("AvroTest").setMaster("local[2]")     val sc = new SparkContext(sparkConf)     val conf = new Configuration()     val job = new Job(conf)     val path = new Path("/tmp/a.avro");     va

KMeans: expensiveness of large vectors

2014-07-24 Thread durin
As a source, I have a textfile with n rows that each contain m comma-separated integers. Each row is then converted into a feature vector with m features each. I've noticed, that given the same total filesize and number of features, a larger number of columns is much more expensive for training a

Re: continuing processing when errors occur

2014-07-24 Thread Imran Rashid
whoops! just realized I was retyring the function even on success. didn't pay enough attention to the output from my calls. Slightly updated definitions: class RetryFunction[-A](nTries: Int,f: A => Unit) extends Function[A,Unit] { def apply(a: A): Unit = { var tries = 0 var success =

Re: Configuring Spark Memory

2014-07-24 Thread John Omernik
SO this is good information for standalone, but how is memory distributed within Mesos? There's coarse grain mode where the execute stays active, or theres fine grained mode where it appears each task is it's only process in mesos, how to memory allocations work in these cases? Thanks! On Thu,

Re: continuing processing when errors occur

2014-07-24 Thread Imran Rashid
Hi Art, I have some advice that isn't spark-specific at all, so it doesn't *exactly* address your questions, but you might still find helpful. I think using an implicit to add your retyring behavior might be useful. I can think of two options: 1. enriching RDD itself, eg. to add a .retryForeach

Kmeans: set initial centers explicitly

2014-07-24 Thread SK
Hi, The mllib.clustering.kmeans implementation supports a random or parallel initialization mode to pick the initial centers. is there a way to specify the initial centers explictly? It would be useful to have a setCenters() method where we can explicitly specify the initial centers. (For e.g. R

Emacs Setup Anyone?

2014-07-24 Thread Steve Nunez
Anyone out there have a good configuration for emacs? Scala-mode sort of works, but I¹d love to see a fully-supported spark-mode with an inferior shell. Searching didn¹t turn up much of anything. Any emacs users out there? What setup are you using? Cheers, - SteveN -- CONFIDENTIALITY NOTICE

Re: Getting the number of slaves

2014-07-24 Thread Evan R. Sparks
Try sc.getExecutorStorageStatus().length SparkContext's getExecutorMemoryStatus or getExecutorStorageStatus will give you back an object per executor - the StorageStatus objects are what drives a lot of the Spark Web UI. https://spark.apache.org/docs/1.0.1/api/scala/index.html#org.apache.spark.Sp

Spark Training at Scala By the Bay with Databricks, Fast Tracl to Scala

2014-07-24 Thread Alexy Khrabrov
Scala By the Bay (www.scalabythebay.org) is happy to confirm that our Spark training on August 11-12 will be run by Databricks and By the Bay together. It will be focused on Scala, and is the first Spark Training at a major Scala venue. Spark is written in Scala, with the unified data pipeline an

Re: Simple record matching using Spark SQL

2014-07-24 Thread Yin Huai
Hi Sarath, I will try to reproduce the problem. Thanks, Yin On Wed, Jul 23, 2014 at 11:32 PM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote: > Hi Michael, > > Sorry for the delayed response. > > I'm using Spark 1.0.1 (pre-built version for hadoop 1). I'm running spark > prog

Getting the number of slaves

2014-07-24 Thread Nicolas Mai
Hi, Is there a way to get the number of slaves/workers during runtime? I searched online but didn't find anything :/ The application I'm working will run on different clusters corresponding to different deployment stages (beta -> prod). It would be great to get the number of slaves currently in u

continuing processing when errors occur

2014-07-24 Thread Art Peel
Our system works with RDDs generated from Hadoop files. It processes each record in a Hadoop file and for a subset of those records generates output that is written to an external system via RDD.foreach. There are no dependencies between the records that are processed. If writing to the external s

GraphX canonical conflation issues

2014-07-24 Thread e5c
Hi there, This issue has been mentioned in: http://apache-spark-user-list.1001560.n3.nabble.com/Java-IO-Stream-Corrupted-Invalid-Type-AC-td6925.html However I'm starting a new thread since the issue is distinct from the above topic's designated subject. I'm test-running canonical conflation o

Re: Starting with spark

2014-07-24 Thread Sean Owen
As was already noted - you have to enable Spark in the Quickstart VM as it's not on by default. Just choose to start the service it in CM. For anyone that's had any issue with it, please let me know offline. My experience has been that it works out of the box. There's likely a simple resolution, o

Re: akka 2.3.x?

2014-07-24 Thread Matei Zaharia
This is being tracked here: https://issues.apache.org/jira/browse/SPARK-1812, since it will also be needed for cross-building with Scala 2.11. Maybe we can do it before that. Probably too late for 1.1, but you should open an issue for 1.2. In that JIRA I linked, there's a pull request from a mo

rdd.saveAsTextFile blows up

2014-07-24 Thread Eric Friedman
I'm trying to run a simple pipeline using PySpark, version 1.0.1 I've created an RDD over a parquetFile and am mapping the contents with a transformer function and now wish to write the data out to HDFS. All of the executors fail with the same stack trace (below) I do get a directory on HDFS, bu

Re: Starting with spark

2014-07-24 Thread Kostiantyn Kudriavtsev
Hi Sam, I tried Spark on Cloudera a couple month age, any there were a lot of issues… Fortunately, I was able to switch to Hortonworks and exerting works perfect. In general, you can try two mode: standalone and via YARN. Personally, I found using Spark via YARN more comfortable special for adm

Re: akka 2.3.x?

2014-07-24 Thread yardena
We are also eagerly waiting for akka 2.3.4 support as we use Akka (and Spray) directly in addition to Spark. Yardena -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/akka-2-3-x-tp10513p10597.html Sent from the Apache Spark User List mailing list archive at

Re: Starting with spark

2014-07-24 Thread Jerry
Hi Sameer, I think it is much easier to start using Spark in standalone mode on a single machine. Last time I tried cloudera manager to deploy spark, it wasn't very straight forward and I hit couple of obstacles along the way. However, standalone mode is very easy to start exploring spark. Bes

Re: Configuring Spark Memory

2014-07-24 Thread Aaron Davidson
Whoops, I was mistaken in my original post last year. By default, there is one executor per node per Spark Context, as you said. "spark.executor.memory" is the amount of memory that the application requests for each of its executors. SPARK_WORKER_MEMORY is the amount of memory a Spark Worker is wil

Re: Configuring Spark Memory

2014-07-24 Thread Martin Goodson
Great - thanks for the clarification Aaron. The offer stands for me to write some documentation and an example that covers this without leaving *any* room for ambiguity. -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Thu, Jul 24, 2014 at 6:09 PM, Aaron David

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-24 Thread Tathagata Das
You will have to define your own stream-to-iterator function and use the socketStream. The function should return custom delimited object as bytes are continuously coming in. When data is insufficient, the function should block. TD On Jul 23, 2014 6:52 PM, "kytay" wrote: > Hi TD > > You are righ

Re: GraphX Pragel implementation

2014-07-24 Thread Arun Kumar
Hi While using pregel API for Iterations how to figure out which super step the iteration currently in. Regards Arun On Thu, Jul 17, 2014 at 4:24 PM, Arun Kumar wrote: > Hi > > > > I am trying to implement belief propagation algorithm in GraphX using the > pragel API. > > *def* pregel[A] > >

Re: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-07-24 Thread lihu
​Which code do you used, do you caused by your own code or something in spark itself? On Tue, Jul 22, 2014 at 8:50 AM, hsy...@gmail.com wrote: > I have the same problem > > > On Sat, Jul 19, 2014 at 12:31 AM, lihu wrote: > >> Hi, >> Everyone. I have a piece of following code. When I run i

Spark got stuck with a loop

2014-07-24 Thread Denis RP
Hi, I ran spark standalone mode on a cluster and it went well for approximately one hour, then the driver's output stopped with the following: 14/07/24 08:07:36 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 36 to spark@worker5.local:47416 14/07/24 08:07:36 INFO M

Re: Starting with spark

2014-07-24 Thread Marco Shaw
First thing... Go into the Cloudera Manager and make sure that the Spark service (master?) is started. Marco On Thu, Jul 24, 2014 at 7:53 AM, Sameer Sayyed wrote: > Hello All, > > I am new user of spark, I am using *cloudera-quickstart-vm-5.0.0-0-vmware* > for execute sample examples of Spark

GraphX for pyspark?

2014-07-24 Thread Eric Friedman
I understand that GraphX is not yet available for pyspark. I was wondering if the Spark team has set a target release and timeframe for doing that work? Thank you, Eric

Re: Spark Function setup and cleanup

2014-07-24 Thread Yosi Botzer
In my case I want to reach HBase. For every record with userId I want to get some extra information about the user and add it to result record for further prcessing On Thu, Jul 24, 2014 at 9:11 AM, Yanbo Liang wrote: > If you want to connect to DB in program, you can use JdbcRDD ( > https://git

Re: Configuring Spark Memory

2014-07-24 Thread Martin Goodson
Thank you Nishkam, I have read your code. So, for the sake of my understanding, it seems that for each spark context there is one executor per node? Can anyone confirm this? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Thu, Jul 24, 2014 at 6:12 AM, Nishkam R

Re: new error for me

2014-07-24 Thread phoenix bai
I am currently facing the same problem. error snapshot as below: 14-07-24 19:15:30 WARN [pool-3-thread-1] SendingConnection: Error finishing connection to r64b22034.tt.net/10.148.129.84:47525 java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Nativ

Re: Spark Function setup and cleanup

2014-07-24 Thread Yanbo Liang
If you want to connect to DB in program, you can use JdbcRDD ( https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ) 2014-07-24 18:32 GMT+08:00 Yosi Botzer : > Hi, > > I am using the Java api of Spark. > > I wanted to know if there is a way to run s

Re: save to HDFS

2014-07-24 Thread lmk
Thanks Akhil. I was able to view the files. Actually I was trying to list the same using regular ls and since it did not show anything I was concerned. Thanks for showing me the right direction. Regards, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/s

Re: save to HDFS

2014-07-24 Thread Akhil Das
This piece of code saveAsHadoopFile[TextOutputFormat[NullWritable,Text]]("hdfs:// masteripaddress:9000/root/test-app/test1/") Saves the RDD into HDFS, and yes you can physically see the files using the hadoop command (hadoop fs -ls /root/test-app/test1 - yes you need to login to the cluster). In

Re: save to HDFS

2014-07-24 Thread lmk
Hi Akhil, I am sure that the RDD that I saved is not empty. I have tested it using take. But is there no way that I can see this saved physically like we do in the normal context? Can't I view this folder as I am already logged into the cluster? And, should I run hadoop fs -ls hdfs://masteripaddres

Re: Starting with spark

2014-07-24 Thread Akhil Das
Here's the complete overview http://spark.apache.org/docs/latest/ And Here's the quick start guidelines http://spark.apache.org/docs/latest/quick-start.html I would suggest you downloading the Spark pre-compiled binaries

Re: save to HDFS

2014-07-24 Thread Akhil Das
Are you sure the RDD that you were saving isn't empty!? Are you seeing a _SUCCESS file in this location? hdfs:// masteripaddress:9000/root/test-app/test1/ (Do hadoop fs -ls hdfs://masteripaddress:9000/root/test-app/test1/) Thanks Best Regards On Thu, Jul 24, 2014 at 4:24 PM, lmk wrote: > Hi

save to HDFS

2014-07-24 Thread lmk
Hi, I have a scala application which I have launched into a spark cluster. I have the following statement trying to save to a folder in the master: saveAsHadoopFile[TextOutputFormat[NullWritable, Text]]("hdfs://masteripaddress:9000/root/test-app/test1/") The application is executed successfully an

Starting with spark

2014-07-24 Thread Sameer Sayyed
Hello All, I am new user of spark, I am using *cloudera-quickstart-vm-5.0.0-0-vmware* for execute sample examples of Spark. I am very sorry for silly and basic question. I am not able to deploy and execute sample examples of spark. please suggest me *how to start with spark*. Please help me Than

Re: streaming sequence files?

2014-07-24 Thread Sean Owen
Can you just call fileStream or textFileStream in the second app, to consume files that appear in HDFS / Tachyon from the first job? On Thu, Jul 24, 2014 at 2:43 AM, Barnaby wrote: > If I save an RDD as a sequence file such as: > > val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) >

Spark Function setup and cleanup

2014-07-24 Thread Yosi Botzer
Hi, I am using the Java api of Spark. I wanted to know if there is a way to run some code in a manner that is like the setup() and cleanup() methods of Hadoop Map/Reduce The reason I need it is because I want to read something from the DB according to each record I scan in my Function, and I wou

Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-24 Thread Sean Owen
Yeah reduce() will leave you with one big collection of sets on the driver. Maybe the set of all identifiers isn't so big -- a hundred million Longs even isn't so much. I'm glad to hear cartesian works but can that scale? you're making an RDD of N^2 elements initially which is just vast. On Thu, J

spark streaming actor receiver doesn't play well with kryoserializer

2014-07-24 Thread Alan Ngai
it looks like when you configure sparkconfig to use the kryoserializer in combination of using an ActorReceiver, bad things happen. I modified the ActorWordCount example program from val sparkConf = new SparkConf().setAppName("ActorWordCount") to val sparkConf = new SparkConf()

Re: Kyro deserialisation error

2014-07-24 Thread Guillaume Pitel
Hi, We've got the same problem here (randomly happens) : Unable to find class: 6 4 ڗ4Ú» 8 &44Úº*Q|T4⛇` j4 Ǥ4ꙴg8 4 ¾4Ú»  4 4Ú» pE4ʽ4ں*WsѴμˁ4ڻ4ʤ4ցbל4ڻ& 4[͝4[ۦ44ڻ!~44