Re: why does driver connects to master fail ?

2014-10-20 Thread randylu
My application is used for LDA(a topic model, with gibbs sampling), it's hard for me to explain LDA, so you need to search it on google if any. I did increase spark.akka.frameSize to 1GB(even 5GB), both in master/workers's spark-defaults.conf and SparkConf, but it has no effect at all. I'm no

Re: org/I0Itec/zkclient/serialize/ZkSerializer ClassNotFound

2014-10-20 Thread skane
I'm having the same problem with Spark 1.0.0. I got the " JavaKafkaWordCount.java" example working on my workstation running Spark locally after doing a build, but when I tried to get the example running on YARN, I got the same error. I used the "uber jar" that was created during the build process

Does start-slave.sh use the values in conf/slaves to launch a worker in Spark standalone cluster mode

2014-10-20 Thread Soumya Simanta
I'm working a cluster where I need to start the workers separately and connect them to a master. I'm following the instructions here and using branch-1.1 http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually and I can start the master using ./sbin/start-master.sh

Re: why does driver connects to master fail ?

2014-10-20 Thread Akhil Das
You could try setting the spark.akka.frameSize while creating the sparkContext, but its strange that the message it shows is saying your master is dead, usually its the other way, executor dies. Can you also explain the behavior of your application (what exactly you are doing over the 8Gb data)? E

Re: default parallelism bug?

2014-10-20 Thread Kevin Jung
I use Spark 1.1.0 and set these options to spark-defaults.conf spark.scheduler.mode FAIR spark.cores.max 48 spark.default.parallelism 72 Thanks, Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/default-parallelism-bug-tp16787p16894.html Sent from the A

spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-20 Thread tridib
Hello Experts, I have two tables build using jsonFile(). I can successfully run join query on these tables. But once I cacheTable(), all join query fails? Here is stackstrace: java.lang.NullPointerException at org.apache.spark.sql.columnar.InMemoryRelation.statistics$lzycompute(InMemoryCol

Re: How to not write empty RDD partitions in RDD.saveAsTextFile()

2014-10-20 Thread Yi Tian
I think you could use `repartition` to make sure there would be no empty partitions. You could also try `coalesce` to combine partitions , but it can't make sure there are no more empty partitions. Best Regards, Yi Tian tianyi.asiai...@gmail.com On Oct 18, 2014, at 20:30, jan.zi...@centrum

Re: default parallelism bug?

2014-10-20 Thread Yi Tian
Could you show your spark version ? And the value of `spark.default.parallelism` you are setting? Best Regards, Yi Tian tianyi.asiai...@gmail.com On Oct 20, 2014, at 12:38, Kevin Jung wrote: > Hi, > I usually use file on hdfs to make PairRDD and analyze it by using > combineByKey,reduceByK

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
Yes, SPARK-3853 just got merged 11 days ago. It should be OK in 1.2.0. And for the first approach, It would be ok after SPARK-4003 is merged. -Original Message- From: tridib [mailto:tridib.sama...@live.com] Sent: Tuesday, October 21, 2014 11:09 AM To: u...@spark.incubator.apache.org Subj

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
The exception of second approach, has been resolved by SPARK-3853. Thanks, Daoyuan -Original Message- From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com] Sent: Tuesday, October 21, 2014 11:06 AM To: tridib; u...@spark.incubator.apache.org Subject: RE: spark sql: timestamp in json - fails

RE: spark sql: timestamp in json - fails

2014-10-20 Thread tridib
Spark 1.1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-timestamp-in-json-fails-tp16864p16888.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - T

RE: Convert Iterable to RDD

2014-10-20 Thread Dai, Kevin
In addition, how to convert Iterable[Iterable[T]] to RDD[T] Thanks, Kevin. From: Dai, Kevin [mailto:yun...@ebay.com] Sent: 2014年10月21日 10:58 To: user@spark.apache.org Subject: Convert Iterable to RDD Hi, All Is there any way to convert iterable to RDD? Thanks, Kevin.

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
That's weird, I think we have that Pattern match in enforceCorrectType. What version of spark are you using? Thanks, Daoyuan -Original Message- From: tridib [mailto:tridib.sama...@live.com] Sent: Tuesday, October 21, 2014 11:03 AM To: u...@spark.incubator.apache.org Subject: Re: spark s

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread Ilya Ganelin
Hey Steve - the way to do this is to use the coalesce() function to coalesce your RDD into a single partition. Then you can do a saveAsTextFile and you'll wind up with outpuDir/part-0 containing all the data. -Ilya Ganelin On Mon, Oct 20, 2014 at 11:01 PM, jay vyas wrote: > sounds more like

Re: spark sql: timestamp in json - fails

2014-10-20 Thread tridib
Stack trace for my second case: 2014-10-20 23:00:36,903 ERROR [Executor task launch worker-0] executor.Executor (Logging.scala:logError(96)) - Exception in task 0.0 in stage 0.0 (TID 0) scala.MatchError: TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$) at org.ap

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread jay vyas
sounds more like a use case for using "collect"... and writing out the file in your program? On Mon, Oct 20, 2014 at 6:53 PM, Steve Lewis wrote: > Sorry I missed the discussion - although it did not answer the question - > In my case (and I suspect the askers) the 100 slaves are doing a lot of >

Convert Iterable to RDD

2014-10-20 Thread Dai, Kevin
Hi, All Is there any way to convert iterable to RDD? Thanks, Kevin.

Spark SQL : sqlContext.jsonFile date type detection and perforormance

2014-10-20 Thread tridib
Hi Spark SQL team, I trying to explore automatic schema detection for json document. I have few questions: 1. What should be the date format to detect the fields as date type? 2. Is automatic schema infer slower than applying specific schema? 3. At this moment I am parsing json myself using map Fun

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
Seems I made a mistake… From: Wang, Daoyuan Sent: Tuesday, October 21, 2014 10:35 AM To: 'Yin Huai' Cc: Michael Armbrust; tridib; u...@spark.incubator.apache.org Subject: RE: spark sql: timestamp in json - fails I got that, it is in JsonRDD.java of `typeOfPrimitiveValues`. I’ll fix that together

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
I got that, it is in JsonRDD.java of `typeOfPrimitiveValues`. I’ll fix that together. Thanks, Daoyuan From: Yin Huai [mailto:huaiyin@gmail.com] Sent: Tuesday, October 21, 2014 10:13 AM To: Wang, Daoyuan Cc: Michael Armbrust; tridib; u...@spark.incubator.apache.org Subject: Re: spark sql: tim

Re: spark sql: timestamp in json - fails

2014-10-20 Thread Yin Huai
Seems the second approach does not go through applySchema. So, I was wondering if there is an issue related to our JSON apis in Java. On Mon, Oct 20, 2014 at 10:04 PM, Wang, Daoyuan wrote: > I think this has something to do with my recent work at > https://issues.apache.org/jira/browse/SPARK-40

Help with an error

2014-10-20 Thread Sunandan Chakraborty
Hi, I am trying to use spark to perform some basic text processing on news articles Recently I am facing issues on codes which ran perfectly well on the same data before I am pasting the last few lines, including the exception message. I am using python. Can anybody suggest a remedy Thanks, Sun

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
I think this has something to do with my recent work at https://issues.apache.org/jira/browse/SPARK-4003 You can check PR https://github.com/apache/spark/pull/2850 . Thanks, Daoyuan From: Yin Huai [mailto:huaiyin@gmail.com] Sent: Tuesday, October 21, 2014 10:00 AM To: Michael Armbrust Cc: tr

Re: why does driver connects to master fail ?

2014-10-20 Thread randylu
The cluster also runs other applcations every hour as normal, so the master is always running. No matter what the cores i use or the quantity of input-data(but big enough), the application just fail at 1.1 hours later. -- View this message in context: http://apache-spark-user-list.1001560.n3.

Re: spark sql: timestamp in json - fails

2014-10-20 Thread Yin Huai
Hi Tridib, For the second approach, can you attach the complete stack trace? Thanks, Yin On Mon, Oct 20, 2014 at 8:24 PM, Michael Armbrust wrote: > I think you are running into a bug that will be fixed by this PR: > https://github.com/apache/spark/pull/2850 > > On Mon, Oct 20, 2014 at 4:34 PM

Re: add external jars to spark-shell

2014-10-20 Thread Denny Lee
–jar (ADD_JARS) is a special class loading for Spark while –driver-class-path (SPARK_CLASSPATH) is captured by the startup scripts and appended to classpath settings that is used to start the JVM running the driver You can reference https://www.concur.com/blog/en-us/connect-tableau-to-sparksql on

RE: Shuffle files

2014-10-20 Thread Shao, Saisai
Hi Song, For what I know in sort-based shuffle. Normally parallel opened file numbers for sort-based shuffle is much smaller than hash-based shuffle. In hash based shuffle, parallel opened file numbers is C * R (where C is core number used and R is the reducer number), as you can see the file

Re: Shuffle files

2014-10-20 Thread Chen Song
My observation is opposite. When my job runs under default spark.shuffle.manager, I don't see this exception. However, when it runs with SORT based, I start seeing this error? How would that be possible? I am running my job in YARN, and I noticed that the YARN process limits (cat /proc/$PID/limits

Re: worker_instances vs worker_cores

2014-10-20 Thread Anny Chen
Thanks a lot Andrew! Yeah I actually realized that later. I made a silly mistake here. On Mon, Oct 20, 2014 at 6:03 PM, Andrew Ash wrote: > Hi Anny, SPARK_WORKER_INSTANCES is the number of copies of spark workers > running on a single box. If you change the number you change how the > hardware

Re: worker_instances vs worker_cores

2014-10-20 Thread Andrew Ash
Hi Anny, SPARK_WORKER_INSTANCES is the number of copies of spark workers running on a single box. If you change the number you change how the hardware you have is split up (useful for breaking large servers into <32GB heaps each which perform better) but doesn't change the amount of hardware you h

add external jars to spark-shell

2014-10-20 Thread Chuang Liu
Hi: I am using Spark 1.1, and want to add an external jars to spark-shell. I dig around, and found others are doing it in two ways. *Method 1* bin/spark-shell --jars "" --master ... *Method 2* ADD_JARS= SPARK_CLASSPATH= bin/spark-shell --master ... What is the difference between these two m

Re: spark sql: timestamp in json - fails

2014-10-20 Thread Michael Armbrust
I think you are running into a bug that will be fixed by this PR: https://github.com/apache/spark/pull/2850 On Mon, Oct 20, 2014 at 4:34 PM, tridib wrote: > Hello Experts, > After repeated attempt I am unable to run query on map json date string. I > tried two approaches: > > *** Approach 1 ***

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I am using globs though raw = sc.textFile("/path/to/dir/*/*") and I have tons of files so 1 file per partition should not be a problem. On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > The biggest danger with gzipped files is this: > > >>> raw = sc.textFi

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Nicholas Chammas
The biggest danger with gzipped files is this: >>> raw = sc.textFile("/path/to/file.gz", 8)>>> raw.getNumPartitions()1 You think you’re telling Spark to parallelize the reads on the input, but Spark cannot parallelize reads against gzipped files. So 1 gzipped file gets assigned to 1 partition. I

spark sql: timestamp in json - fails

2014-10-20 Thread tridib
Hello Experts, After repeated attempt I am unable to run query on map json date string. I tried two approaches: *** Approach 1 *** created a Bean class with timespamp field. When I try to run it I get scala.MatchError: class java.sql.Timestamp (of class java.lang.Class). Here is the code: import j

Re: example.jar caused exception when running pi.py, spark 1.1

2014-10-20 Thread freedafeng
Fixed by recompiling. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/example-jar-caused-exception-when-running-pi-py-spark-1-1-tp16849p16862.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread Steve Lewis
Sorry I missed the discussion - although it did not answer the question - In my case (and I suspect the askers) the 100 slaves are doing a lot of useful work but the generated output is small enough to be handled by a single process. Many of the large data problems I have worked process a lot of da

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
Hi Nicholas, Gzipping is a an impressive guess! Yes, they are. My data sets are too large to make repartitioning viable, but I could try it on a subset. I generally have many more partitions than cores. This was happenning before I started setting those configs. thanks Daniel On Mon, Oct 20, 20

Re: Error while running Streaming examples - no snappyjava in java.library.path

2014-10-20 Thread Buntu Dev
Thanks Akhil. On Mon, Oct 20, 2014 at 1:57 AM, Akhil Das wrote: > Its a known bug in JDK7 and OSX's naming convention, here's how to resolve > it: > > 1. Get the Snappy jar file from > http://central.maven.org/maven2/org/xerial/snappy/snappy-java/ > 2. Copy the appropriate one to your project'

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Nicholas Chammas
Are you dealing with gzipped files by any chance? Does explicitly repartitioning your RDD to match the number of cores in your cluster help at all? How about if you don't specify the configs you listed and just go with defaults all around? On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler wrote: >

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread Sean Owen
This was covered a few days ago: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html The multiple output files is actually essential for parallelism, and certainly not a bad idea. You don't want 100 distributed workers writing to 1 file

worker_instances vs worker_cores

2014-10-20 Thread anny9699
Hi, I have a question about the worker_instances setting and worker_cores setting in aws ec2 cluster. I understand it is a cluster and the default setting in the cluster is *SPARK_WORKER_CORES = 8 SPARK_WORKER_INSTANCES = 1* However after I changed it to *SPARK_WORKER_CORES = 8 SPARK_WORKER_INS

How do you write a JavaRDD into a single file

2014-10-20 Thread Steve Lewis
At the end of a set of computation I have a JavaRDD . I want a single file where each string is printed in order. The data is small enough that it is acceptable to handle the printout on a single processor. It may be large enough that using collect to generate a list might be unacceptable. the sa

saveasSequenceFile with codec and compression type

2014-10-20 Thread gpatcham
Hi All, I'm trying to save RDD as sequencefile and not able to use compresiontype (BLOCK or RECORD) Can any one let me know how we can use compressiontype here is the code I'm using RDD.saveAsSequenceFile(target,Some(classOf[org.apache.hadoop.io.compress.GzipCodec])) Thanks -- View this mes

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I launch the cluster using vanilla spark-ec2 scripts. I just specify the number of slaves and instance type On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler wrote: > I usually run interactively from the spark-shell. > My data definitely has more than enough partitions to keep all the workers > bus

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I usually run interactively from the spark-shell. My data definitely has more than enough partitions to keep all the workers busy. When I first launch the cluster I first do: + cat <>~/spark/conf/spark-defaults.conf spark.serializerorg.apache

example.jar caused exception when running pi.py, spark 1.1

2014-10-20 Thread freedafeng
created a EC2 cluster using spark-ec2 command. If I run the pi.py example in the cluster without using the example.jar, it works. But if I added the example.jar as the driver class (sth like follows), it will fail with an exception. Could anyone help with this? -- what is the cause of the problem?

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Nicholas Chammas
Perhaps your RDD is not partitioned enough to utilize all the cores in your system. Could you post a simple code snippet and explain what kind of parallelism you are seeing for it? And can you report on how many partitions your RDDs have? On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler wrote: >

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniil Osipov
How are you launching the cluster, and how are you submitting the job to it? Can you list any Spark configuration parameters you provide? On Mon, Oct 20, 2014 at 12:53 PM, Daniel Mahler wrote: > > I am launching EC2 clusters using the spark-ec2 scripts. > My understanding is that this configures

Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on ma

Re: ALS implicit error pyspark

2014-10-20 Thread Gen
Hi, everyone, According to Xiangrui Meng(I think that he is the author of ALS), this problem is caused by Kryo serialization: /In PySpark 1.1, we switched to Kryo serialization by default. However, ALS code requires special registration with Kryo in order to work. The error happens when there is

Saving very large data sets as Parquet on S3

2014-10-20 Thread Daniel Mahler
I am trying to convert some json logs to Parquet and save them on S3. In principle this is just import org.apache.spark._ val sqlContext = new sql.SQLContext(sc) val data = sqlContext.jsonFile(s3n://source/path/*/*",10e-8) data.registerAsTable("data") data.saveAsParquetFile("s3n://target/path) Th

Re: RDD Cleanup

2014-10-20 Thread maihung
Hi Prem, I am experiencing the same problem on Spark 1.0.2 and Job Server 0.4.0 Did you find a solution for this problem? Thank you, Hung -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p16843.html Sent from the Apache Spark User List ma

Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Terry Siu
Hi Michael, Thanks again for the reply. Was hoping it was something I was doing wrong in 1.1.0, but I’ll try master. Thanks, -Terry From: Michael Armbrust mailto:mich...@databricks.com>> Date: Monday, October 20, 2014 at 12:11 PM To: Terry Siu mailto:terry@smartfocus.com>> Cc: "user@spark.a

Re: Attaching schema to RDD created from Parquet file

2014-10-20 Thread Michael Armbrust
This has to be done manually: rdd2.map(row => Person(row.getString(0), row.getInt(1))) On Fri, Oct 17, 2014 at 9:30 AM, Akshat Aranya wrote: > Hi, > > How can I convert an RDD loaded from a Parquet file into its original type: > > case class Person(name: String, age: Int) > > val rdd: RDD[Perso

Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Michael Armbrust
Have you tried this on master? There were several problems with resolution of complex queries that were registered as tables in the 1.1.0 release. On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu wrote: > Hi all, > > I’m getting a TreeNodeException for unresolved attributes when I do a > simple se

spark sql not able to find classes with --jars option

2014-10-20 Thread sadhan
when I update the classpath in bin/spark-class by providing the dependency jar, everything works fine but when I try to provide the same jar through --jars option, it throws an error while running sql queries that it cannot find relevant serde class files. I guess this is ok for standalone mode (u

CustomReceiver : ActorOf vs ActorSelection

2014-10-20 Thread vvarma
I have been trying to implement a CustomReceiverStream using akka actors. I have a feeder actor which produces messages and a Receiver actor(i give this to the spark streaming context to get actor stream) which subscribes to the feeder. When i create the feeder actor within the scope of the receive

Re: Oryx + Spark mllib

2014-10-20 Thread Debasish Das
Thanks for the pointers I will look into oryx2 design and see whether we need a spary/akka http based backend...I feel we will specially when we have a model database for a number of scenarios (say 100 scenarios build a different ALS model) I am not sure if we really need a full blown databas

java.lang.OutOfMemoryError: Java heap space during reduce operation

2014-10-20 Thread ayandas84
Hi, *In a reduce operation I am trying to accumulate a list of SparseVectors. The code is given below;* val WNode = trainingData.reduce{(node1:Node,node2:Node) => val wNode = new Node(num1,num2) wNode.WhatList ++= (node1.WList)

SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Terry Siu
Hi all, I’m getting a TreeNodeException for unresolved attributes when I do a simple select from a schemaRDD generated by a join in Spark 1.1.0. A little background first. I am using a HiveContext (against Hive 0.12) to grab two tables, join them, and then perform multiple INSERT-SELECT with GR

Re: How to aggregate data in Apach Spark

2014-10-20 Thread Davies Liu
You also could use Spark SQL: from pyspark.sql import Row, SQLContext row = Row('id', 'C1', 'C2', 'C3') # convert each data = sc.textFile("test.csv").map(lambda line: line.split(',')) sqlContext = SQLContext(sc) rows = data.map(lambda r: row(*r)) sqlContext.inferSchema(rows).registerTempTable("da

Re: How to emit multiple keys for the same value?

2014-10-20 Thread DB Tsai
You can do this using flatMap which return a Seq of (key, value) pairs. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Oct 20, 2014 at 9:31 AM, HARIPRIYA AYYALASOMAYAJULA < aharipriy

Re: How to emit multiple keys for the same value?

2014-10-20 Thread Boromir Widas
flatMap should help, it returns a Seq for every input. On Mon, Oct 20, 2014 at 12:31 PM, HARIPRIYA AYYALASOMAYAJULA < aharipriy...@gmail.com> wrote: > Hello, > > I am facing a problem with implementing this - My mapper should emit > multiple keys for the same value -> for every input (k, v) it sh

How to emit multiple keys for the same value?

2014-10-20 Thread HARIPRIYA AYYALASOMAYAJULA
Hello, I am facing a problem with implementing this - My mapper should emit multiple keys for the same value -> for every input (k, v) it should emit (k, v), (k+1, v),(k+2,v) (k+n,v). In MapReduce, it was pretty straight forward - I used a for loop and performed Context write within that. Thi

Spark Streaming occasionally hangs after processing first batch

2014-10-20 Thread t1ny
Hi all, Spark Streaming occasionally (not always) hangs indefinitely on my program right after the first batch has been processed. As you can see in the following screenshots of the Spark Streaming monitoring UI, it hangs on the map stages that correspond (I assume) to the second batch that is bei

Spark-jobserver for java apps

2014-10-20 Thread Tomer Benyamini
Hi, I'm working on the problem of remotely submitting apps to the spark master. I'm trying to use the spark-jobserver project (https://github.com/ooyala/spark-jobserver) for that purpose. For scala apps looks like things are working smoothly, but for java apps, I have an issue with implementing t

Re: why fetch failed

2014-10-20 Thread DB Tsai
I ran into the same issue when the dataset is very big. Marcelo from Cloudera found that it may be caused by SPARK-2711, so their Spark 1.1 release reverted SPARK-2711, and the issue is gone. See https://issues.apache.org/jira/browse/SPARK-3633 for detail. You can checkout Cloudera's version her

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-20 Thread Terry Siu
Hi Yin, Sorry for the delay, but I’ll try the code change when I get a chance, but Michael’s initial response did solve my problem. In the meantime, I’m hitting another issue with SparkSQL which I will probably post another message if I can’t figure a workaround. Thanks, -Terry From: Yin Huai

Re: Spark Streaming scheduling control

2014-10-20 Thread davidkl
One detail, even forcing partitions (/repartition/), spark is still holding some tasks; if I increase the load of the system (increasing /spark.streaming.receiver.maxRate/), even if all workers are used, the one with the receiver gets twice as many tasks compared with the other workers. Total del

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-20 Thread Matt Narrell
http://spark.apache.org/docs/latest/streaming-programming-guide.html foreachRDD is executed on the driver…. mn > On Oct 20, 2014, at 3:07 AM, Gerard Maas wrote: > > Pinging TD -- I'm sure you know :-) > > -kr, Gerard. >

Re: MLlib linking error Mac OS X

2014-10-20 Thread Evan Sparks
MLlib relies on breeze for much of its linear algebra, which in turn relies on netlib-java. netlib-java will attempt to load a native BLAS at runtime and then attempt to load it's own precompiled version. Failing that, it will default back to a Java version that it has built in. The Java version

Re: How to show RDD size

2014-10-20 Thread Nicholas Chammas
No, I believe unpersist acts immediately. On Mon, Oct 20, 2014 at 10:13 AM, marylucy wrote: > thank you for your reply! > is unpersist operation lazy?if yes,how to decrease memory size as quickly > as possible > > 在 Oct 20, 2014,21:26,"Nicholas Chammas" 写道: > > I believe it won't show up there

Re: How to show RDD size

2014-10-20 Thread marylucy
thank you for your reply! is unpersist operation lazy?if yes,how to decrease memory size as quickly as possible > 在 Oct 20, 2014,21:26,"Nicholas Chammas" 写道: > > I believe it won't show up there until you trigger an action that causes the > RDD to actually be cached. Remember that certain oper

Re: Designed behavior when master is unreachable.

2014-10-20 Thread preeze
Hi Andrew, The behavior that I see now is that under the hood it tries to reconnect endlessly. While this lasts, the thread that tries to fire a new task is blocked at JobWaiter.awaitResult() and never gets released. The full stacktrace for spark-1.0.2 is: "jmsContainer-7" prio=10 tid=0x7f18f

Re: What executes on worker and what executes on driver side

2014-10-20 Thread Saurabh Wadhawan
What about: http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCAF_KkPwk7iiQVD2JzOwVVhQ_U2p3bPVM=-bka18v4s-5-lp...@mail.gmail.com%3E Regards - Saurabh Wadhawan On 20-Oct-2014, at 4:56 pm, Kamal Banga mailto:ban

Re: How to show RDD size

2014-10-20 Thread Nicholas Chammas
I believe it won't show up there until you trigger an action that causes the RDD to actually be cached. Remember that certain operations in Spark are *lazy*, and caching is one of them. Nick On Mon, Oct 20, 2014 at 9:19 AM, marylucy wrote: > in spark-shell,I do in follows > val input = sc.t

How to show RDD size

2014-10-20 Thread marylucy
in spark-shell,I do in follows val input = sc.textfile("hdfs://192.168.1.10/people/testinput/") input.cache() In webui,I cannot see any rdd in storage tab.can anyone tell me how to show rdd size?thank you - To unsubscribe

Re: What does KryoException: java.lang.NegativeArraySizeException mean?

2014-10-20 Thread Guillaume Pitel
Well, reading your logs, here is what happens : You do a combineByKey (so you have a join probably somewhere), which spills on disk because it's too big. To spill on disk it serializes, and the blocks are > 2GB. From a 2GB dataset, it's easy to exand to several TB Increase parallelism, make

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-20 Thread Guillaume Pitel
Hi, The array size you (or the serializer) tries to allocate is just too big for the JVM. No configuration can help : https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit The only option is to split you problem further by increasing parallelism. Guillaume Hi, I’m using S

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-20 Thread Akhil Das
Try setting SPARK_EXECUTOR_MEMORY=5g (not sure how many workers you are having), You can also set the executor memory while creating the sparkContext (like *sparkContext.set("spark.executor.memory","5g")* ) Thanks Best Regards On Mon, Oct 20, 2014 at 5:01 PM, Arian Pasquali wrote: > Hi Akhil, >

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-20 Thread Arian Pasquali
Hi Akhil, thanks for your help but I was originally running without xmx option. With that I was just trying to push the limit of my heap size, but obviously doing it wrong. Arian Pasquali http://about.me/arianpasquali 2014-10-20 12:24 GMT+01:00 Akhil Das : > Hi Arian, > > You will get this e

Re: What executes on worker and what executes on driver side

2014-10-20 Thread Kamal Banga
1. All RDD operations are executed in workers. So reading a text file or executing val x = 1 will happen on worker. (link ) 2. a. Without braodcast: Let's say you have 'n' nodes. You can set hadoop's replication factor to n

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-20 Thread Akhil Das
Hi Arian, You will get this exception because you are trying to create an array that is larger than the maximum contiguous block of memory in your Java VMs heap. Here since you are setting Worker memory as *5Gb* and you are exporting the *_OPTS as *8Gb*, your application actually thinks it has 8G

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-20 Thread Arian Pasquali
Hi, I’m using Spark 1.1.0 and I’m having some issues to setup memory options. I get “Requested array size exceeds VM limit” and I’m probably missing something regarding memory configuration (https://spark.apache.org/docs/1.1.0/configuration.html). My server has 30G of memory and this are my cu

Re: Spark Concepts

2014-10-20 Thread Kamal Banga
1) Yes, a single node can have multiple workers. SPARK_WORKER_INSTANCES (in conf/spark-env.sh) is used to set number of worker instances to run on each machine (default is 1). If you do set this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker

RDD to Multiple Tables SparkSQL

2014-10-20 Thread critikaled
Hi I have a rdd which I want to register as multiple tables based on key val context = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(context) import sqlContext.createSchemaRDD case class KV(key:String,id:String,value:String) val logsRDD = conte

Re: MLlib linking error Mac OS X

2014-10-20 Thread npomfret
I'm getting the same warning on my mac. Accompanied by what appears to be pretty low CPU usage (http://apache-spark-user-list.1001560.n3.nabble.com/mlib-model-build-and-low-CPU-usage-td16777.html), I wonder if they are connected? I've used jblas on a mac several times, it always just works perfec

Re: Spark Streaming scheduling control

2014-10-20 Thread davidkl
Thanks Akhil Das-2: actually I tried setting spark.default.parallelism but no effect :-/ I am running standalone and performing a mix of map/filter/foreachRDD. I had to force parallelism with repartition to get both workers to process tasks, but I do not think this should be required (and I am n

Re: What does KryoException: java.lang.NegativeArraySizeException mean?

2014-10-20 Thread Fengyun RAO
Thank you, Guillaume, my dataset is not that large, it's totally ~2GB 2014-10-20 16:58 GMT+08:00 Guillaume Pitel : > Hi, > > It happened to me with blocks which take more than 1 or 2 GB once > serialized > > I think the problem was that during serialization, a Byte Array is > created, and arrays

Re: How to aggregate data in Apach Spark

2014-10-20 Thread Gen
Hi, I will write the code in python {code:title=test.py} data = sc.textFile(...).map(...) ## Please make sure that the rdd is like[[id, c1, c2, c3], [id, c1, c2, c3],...] keypair = data.map(lambda l: ((l[0],l[1],l[2]), float(l[3]))) keypair = keypair.reduceByKey(add) out = keypair.map(lambda l: l

Re: why does driver connects to master fail ?

2014-10-20 Thread randylu
Dear Akhil Das-2, My application runs in standalone mode, with 50 machines. It's okay if the input file is small, but if i increases the input to 8GB, the application just serveral iterations, and then print following error logs: 14/10/20 17:15:28 WARN AppClient$ClientActor: Connection to akka.t

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-20 Thread Gerard Maas
Pinging TD -- I'm sure you know :-) -kr, Gerard. On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas wrote: > Hi, > > We have been implementing several Spark Streaming jobs that are basically > processing data and inserting it into Cassandra, sorting it among different > keyspaces. > > We've been fo

Re: why fetch failed

2014-10-20 Thread Akhil Das
I used to hit this issue when my data size was too large and the number of partitions was too large ( > 1200 ), I got ride of it by - Reducing the number of partitions - Setting the following while creating the sparkContext: .set("spark.rdd.compress","true") .set("spark.storage.memoryF

Re: Spark SQL on XML files

2014-10-20 Thread Akhil Das
One approach would be to convert those XML file into json file and use the jsonRDD, another approach would be to convert the XML file into parquet file and use parquetFile. Thanks Best Regards On Sun, Oct 19, 2014 at 9:38 AM, gtinside wrote: > Hi , > > I have bunch of Xml files and I want to ru

Re: why does driver connects to master fail ?

2014-10-20 Thread Akhil Das
What is the application that you are running? and what is the cluster setup that you are having? Given the logs, it looks like the master is dead for some reason. Thanks Best Regards On Sun, Oct 19, 2014 at 2:48 PM, randylu wrote: > In additional, driver receives serveral DisassociatedEvent m

Re: Error while running Streaming examples - no snappyjava in java.library.path

2014-10-20 Thread Akhil Das
Its a known bug in JDK7 and OSX's naming convention, here's how to resolve it: 1. Get the Snappy jar file from http://central.maven.org/maven2/org/xerial/snappy/snappy-java/ 2. Copy the appropriate one to your project's class path. Thanks Best Regards On Sun, Oct 19, 2014 at 10:18 PM, bdev

What does KryoException: java.lang.NegativeArraySizeException mean?

2014-10-20 Thread Fengyun RAO
The exception drives me crazy, because it occurs randomly. I didn't know which line of my code causes this exception. I didn't even understand what "KryoException: java.lang.NegativeArraySizeException" means, or even implies? 14/10/20 15:59:01 WARN scheduler.TaskSetManager: Lost task 32.2 in stag

Re: Spark Streaming scheduling control

2014-10-20 Thread Akhil Das
What operation are you performing? And what is your cluster configuration? If you are doing some operation like groupBy, reduceBy, join etc then you could try providing the level of parallelism. if you give 16, then mostly each of your worker will get 8 tasks to execute. Thanks Best Regards On Mo

Re: How to write a RDD into One Local Existing File?

2014-10-20 Thread Akhil Das
If you don't need part-xxx files in the output but 1 file, then you should repartition (or coalesce) the RDD into 1 (This will be bottleneck since you are disabling the parallelism - its like giving everything to 1 machine to process). You are better off merging those part-xxx files afterwards spar

  1   2   >