Re: Spark memory settings on yarn

2014-08-20 Thread centerqi hu
thanks Marcelo Vanzin I see. Org.apache.spark.executor.CoarseGrainedExecutorBackend will be executed Thanks 2014-08-21 12:34 GMT+08:00 Marcelo Vanzin : > That command line you mention in your e-mail doesn't look like > something started by Spark. Spark would start one of > ApplicationMaster, Exec

Re: Re: How to pass env variables from master to executors within spark-shell

2014-08-20 Thread Zhanfeng Huo
Hi, friend: I set env variables in conf/spark-defaults.conf(not spark-default.conf) and use bin/spark-submit shell to submit spark application. It effect. In addition, you can use bin/spark-submit with param --properties-file(FILE Path to a file from which to load extra properties. If not

RE: Trying to run SparkSQL over Spark Streaming

2014-08-20 Thread Shao, Saisai
Hi, StreamSQL (https://github.com/thunderain-project/StreamSQL) is a POC project based on Spark to combine the power of Catalyst and Spark Streaming, to offer people the ability to manipulate SQL on top of DStream as you wanted, this keep the same semantics with SparkSQL as offer a SchemaDStrea

Re: DStream cannot write to text file

2014-08-20 Thread Mayur Rustagi
provide the fullpath of where to write( like hdfs:// etc) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Aug 21, 2014 at 8:29 AM, cuongpham92 wrote: > Hi, > I tried to write to text file from DStream in Spark Stre

Re: Mapping with extra arguments

2014-08-20 Thread Mayur Rustagi
You can add that as part of your RDD, so as output of your map operation generate the input of your next map operation.. ofcourse the obscure logic of generating that data has to be map .. another way is nested def def factorial(number: Int) : Int = { def factorialWithAccumulator(accumulator:

Re: Mapping with extra arguments

2014-08-20 Thread zzl
def foo(extra_arg): …. def bar(row): # your code here return bar then pass foo(extra_arg) to spark map function. -- Best Regards! On Thursday, August 21, 2014 at 2:33 PM, TJ Klein wrote: > Hi, > > I am using Spark in Python. I wonder if there is a possibility for pass

Mapping with extra arguments

2014-08-20 Thread TJ Klein
Hi, I am using Spark in Python. I wonder if there is a possibility for passing extra arguments to the mapping function. In my scenario, after each map I update parameters, which I want to use in the folllowning new iteration of mapping. Any idea? Thanks in advance. -Tassilo -- View this messa

Merging two Spark SQL tables?

2014-08-20 Thread Evan Chan
Is it possible to merge two cached Spark SQL tables into a single table so it can queried with one SQL statement? ie, can you do schemaRdd1.union(schemaRdd2), then register the new schemaRdd and run a query over it? Ideally, both schemaRdd1 and schemaRdd2 would be cached, so the union should run

Re: Trying to run SparkSQL over Spark Streaming

2014-08-20 Thread praveshjain1991
Oh right. Got it. Thanks Also found this link on that discussion: https://github.com/thunderain-project/StreamSQL Does this provide more features than Spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp1253

Re: Trying to run SparkSQL over Spark Streaming

2014-08-20 Thread Tobias Pfeiffer
Hi, On Thu, Aug 21, 2014 at 3:11 PM, praveshjain1991 wrote: > > The part that you mentioned "*/the variable `result ` is of type > DStream[Row]. That is, the meta-information from the SchemaRDD is lost and, > from what I understand, there is then no way to learn about the column > names > of the

Re: Got NotSerializableException when access broadcast variable

2014-08-20 Thread tianyi
Thanks for help On Aug 21, 2014, at 10:56, Yin Huai wrote: > If you want to filter the table name, you can use > > hc.sql("show tables").filter(row => !"test".equals(row.getString(0 > > Seems making functionRegistry transient can fix the error. > > > On Wed, Aug 20, 2014 at 8:53 PM,

Re: Trying to run SparkSQL over Spark Streaming

2014-08-20 Thread praveshjain1991
Hi Thanks for the reply and the link. Its working now. >From the discussion on the link, I understand that there are some shortcomings while using SQL over streaming. The part that you mentioned "*/the variable `result ` is of type DStream[Row]. That is, the meta-information from the SchemaRDD

Re: How to pass env variables from master to executors within spark-shell

2014-08-20 Thread Akhil Das
One approach would be to set these environment variables inside the spark-env.sh in all workers then you can access them using the System.getEnv("WHATEVER") Thanks Best Regards On Wed, Aug 20, 2014 at 9:49 PM, Darin McBeath wrote: > Can't seem to figure this out. I've tried several different

Re: Trying to run SparkSQL over Spark Streaming

2014-08-20 Thread Tobias Pfeiffer
Hi, On Thu, Aug 21, 2014 at 2:19 PM, praveshjain1991 wrote: > > Using Spark SQL with batch data works fine so I'm thinking it has to do > with > how I'm calling streamingcontext.start(). Any ideas what is the issue? Here > is the code: > Please have a look at http://apache-spark-user-list.100

Trying to run SparkSQL over Spark Streaming

2014-08-20 Thread praveshjain1991
I am trying to run SQL queries over streaming data in spark. This looks pretty straight forward but when I try it, I get the error table not found : tablename>. It unable to find the table I've registered. Using Spark SQL with batch data works fine so I'm thinking it has to do with how I'm calling

Re: Spark memory settings on yarn

2014-08-20 Thread Marcelo Vanzin
That command line you mention in your e-mail doesn't look like something started by Spark. Spark would start one of ApplicationMaster, ExecutableRunner or CoarseGrainedSchedulerBackend, not "org.apache.hadoop.mapred.YarnChild". On Wed, Aug 20, 2014 at 6:56 PM, centerqi hu wrote: > Spark memory se

Re: Web UI doesn't show some stages

2014-08-20 Thread Zhan Zhang
Try to answer your another question. One sortByKey is triggered by rangePartition which does sample to calculate the range boundaries, which again triggers the first reduceByKey. The second sortByKey is doing the real work to sort based on the partition calculated, which again trigger the reduc

RE: Got NotSerializableException when access broadcast variable

2014-08-20 Thread Yin Huai
PR is https://github.com/apache/spark/pull/2074. -- From: Yin Huai Sent: ‎8/‎20/‎2014 10:56 PM To: Vida Ha Cc: tianyi ; Fengyun RAO ; user@spark.apache.org Subject: Re: Got NotSerializableException when access broadcast variable If you want to filter the table name, y

DStream cannot write to text file

2014-08-20 Thread cuongpham92
Hi, I tried to write to text file from DStream in Spark Streaming, using DStream.saveAsTextFile("test","output"), but it did not work. Any suggestions? Thanks in advance. Cuong -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DStream-cannot-write-to-text-file

Re: Got NotSerializableException when access broadcast variable

2014-08-20 Thread Yin Huai
If you want to filter the table name, you can use hc.sql("show tables").filter(row => !"test".equals(row.getString(0 Seems making functionRegistry transient can fix the error. On Wed, Aug 20, 2014 at 8:53 PM, Vida Ha wrote: > Hi, > > I doubt the the broadcast variable is your problem, sin

Spark memory settings on yarn

2014-08-20 Thread centerqi hu
Spark memory settings let me very misunderstanding. My code is as follows. spark-1.0.2-bin-2.4.1/bin/spark-submit --class SimpleApp \ --master yarn \ --deploy-mode cluster \ --queue sls_queue_1 \ --num-executors 3 \ --driver-memory 6g \ --executor-memory 10g \ --executor-cores 5 \ target/scala-2.

Spark QL and protobuf schema

2014-08-20 Thread Dmitriy Lyubimov
Hello, is there any known work to adapt protobuf schema to Spark QL data sourcing? If not, would it present interest to contribute one? thanks. -d

Re: java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper

2014-08-20 Thread Vida Ha
Hi Chris, We have a knowledge base article to explain what's happening here: https://github.com/databricks/spark-knowledgebase/blob/master/troubleshooting/javaionotserializableexception.md Let me know if the article is not clear enough - I would be happy to edit and improve it. -Vida On Wed,

Re: Got NotSerializableException when access broadcast variable

2014-08-20 Thread Vida Ha
Hi, I doubt the the broadcast variable is your problem, since you are seeing: org.apache.spark.SparkException: Task not serializable Caused by: java.io.NotSerializableException: org.apache.spark.sql .hive.HiveContext$$anon$3 We have a knowledgebase article that explains why this happens - it's a

Re: java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper

2014-08-20 Thread Marcelo Vanzin
My guess is that your test is trying to serialize a closure referencing "connectionInfo"; that closure will have a reference to the test instance, since the instance is needed to execute that method. Try to make the "connectionInfo" method local to the method where it's needed, or declare it in an

Re: Is Spark SQL Thrift Server part of the 1.0.2 release

2014-08-20 Thread Michael Armbrust
You could use the programatic API to make the hive queries directly. On Wed, Aug 20, 2014 at 9:47 AM, Tam, Ken K wrote: > What is the best way to run Hive queries in 1.0.2? In my case. Hive > queries will be invoked fr

java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper

2014-08-20 Thread Chris Jones
New to Apache Spark, trying to build a scalatest. Below is the error I'm consistently seeing. Somehow Spark is trying to load a scalatest AssertionHelper class which is not serializable. The scalatest I have specified doesn't even have any assertions in it. I added the JVM flag  -Dsun.io.serializa

Re: GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo
Hi Ankur, thank you for your response. I already looked at the sample code you sent. And I think the modification you are referring to is on the "tryMatch" function of the PartialMatch class. I noticed you have a case in there that checks for a pattern match, and I think that's the code I need to m

Re: Web UI doesn't show some stages

2014-08-20 Thread Patrick Wendell
The reason is that some operators get pipelined into a single stage. rdd.map(XX).filter(YY) - this executes in a single stage since there is no data movement needed in between these operations. If you call toDeubgString on the final RDD it will give you some information about the exact lineage. In

Re: Broadcast vs simple variable

2014-08-20 Thread Patrick Wendell
For large objects, it will be more efficient to broadcast it. If your array is small it won't really matter. How many centers do you have? Unless you are finding that you have very large tasks (and Spark will print a warning about this), it could be okay to just reference it directly. On Wed, Aug

Re: Advantage of using cache()

2014-08-20 Thread Patrick Wendell
Your rdd2 and rdd3 differ in two ways so it's hard to track the exact effect of caching. In rdd3, in addition to the fact that rdd will be cached, you are also doing a bunch of extra random number generation. So it will be hard to isolate the effect of caching. On Wed, Aug 20, 2014 at 7:48 AM, Gr

Re: GraphX question about graph traversal

2014-08-20 Thread Ankur Dave
At 2014-08-20 10:34:50 -0700, Cesar Arevalo wrote: > I would like to get the type B vertices that are connected through type A > vertices where the edges have a score greater than 5. So, from the example > above I would like to get V1 and V4. It sounds like you're trying to find paths in the grap

Re: Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Silvio Fiorito
Thanks, I’ll go ahead and disable that setting for now. From: Aaron Davidson mailto:ilike...@gmail.com>> Date: Wednesday, August 20, 2014 at 3:20 PM To: Silvio Fiorito mailto:silvio.fior...@granturing.com>> Cc: "user@spark.apache.org" mailto:user@spark.apache.org>>

Re: Personalized Page rank in graphx

2014-08-20 Thread Ankur Dave
At 2014-08-20 10:57:57 -0700, Mohit Singh wrote: > I was wondering if Personalized Page Rank algorithm is implemented in graphx. > If the talks and presentation were to be believed > (https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf) > it is.. but cant find

Small input split sizes

2014-08-20 Thread David Rosenstrauch
I'm still bumping up against this issue: spark (and shark) are breaking my inputs into 64MB-sized splits. Anyone know where/how to configure spark so that it either doesn't split the inputs, or at least uses a much large split size? (E.g., 512MB.) Thanks, DR On 07/15/2014 05:58 PM, David

Re: GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo
Hey, thanks for your response. And I had seen the triplets, but I'm not quite sure how the triplets would get me that V1 is connected to V4. Maybe I need to spend more time understanding it, I guess. -Cesar On Wed, Aug 20, 2014 at 10:56 AM, glxc wrote: > I don't know if Pregel would be neces

Re: How to set KryoRegistrator class in spark-shell

2014-08-20 Thread Benyi Wang
I can do that in my application, but I really want to know how I can do it in spark-shell because I usually prototype in spark-shell before I put the code into an application. On Wed, Aug 20, 2014 at 12:47 PM, Sameer Tilak wrote: > Hi Wang, > Have you tried doing this in your application? > >

RE: Decision tree: categorical variables

2014-08-20 Thread Sameer Tilak
Was able to resolve the parsing issue. Thanks! From: ssti...@live.com To: user@spark.apache.org Subject: FW: Decision tree: categorical variables Date: Wed, 20 Aug 2014 12:48:10 -0700 From: ssti...@live.com To: men...@gmail.com Subject: RE: Decision tree: categorical variables Date: Wed, 20 Au

FW: Decision tree: categorical variables

2014-08-20 Thread Sameer Tilak
From: ssti...@live.com To: men...@gmail.com Subject: RE: Decision tree: categorical variables Date: Wed, 20 Aug 2014 12:09:52 -0700 Hi Xiangrui, My data is in the following format: 0,1,5,A,8,1,M0,1,5,B,4,1,M1,0,2,B,7,0,U0,1,3,C,8,0,M0,0,5,C,1,0,M1,1,5,C,8,0,U0,0,5,B,8,0,M1,0,3,B,2,1,M0,1,5,B,8

RE: How to set KryoRegistrator class in spark-shell

2014-08-20 Thread Sameer Tilak
Hi Wang,Have you tried doing this in your application? conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") conf.set("spark.kryo.registrator", "yourpackage.MyKryoRegistrator") You then don't need to specify it via commandline. Date: Wed, 20 Aug 2014 12:25:14 -0

Re: spark-submit with HA YARN

2014-08-20 Thread Matt Narrell
Ok Marcelo, Thanks for the quick and thorough replies. I’ll keep an eye on these tickets and the mailing list to see how things move along. mn On Aug 20, 2014, at 1:33 PM, Marcelo Vanzin wrote: > Hi, > > On Wed, Aug 20, 2014 at 11:59 AM, Matt Narrell wrote: >> Specifying the driver-class-p

Re: spark-submit with HA YARN

2014-08-20 Thread Marcelo Vanzin
Hi, On Wed, Aug 20, 2014 at 11:59 AM, Matt Narrell wrote: > Specifying the driver-class-path yields behavior like > https://issues.apache.org/jira/browse/SPARK-2420 and > https://issues.apache.org/jira/browse/SPARK-2848 It feels like opening a > can of worms here if I also need to replace the gu

How to set KryoRegistrator class in spark-shell

2014-08-20 Thread Benyi Wang
I want to use opencsv's CSVParser to parse csv lines using a script like below in spark-shell: import au.com.bytecode.opencsv.CSVParser; import com.esotericsoftware.kryo.Kryo import org.apache.spark.serializer.KryoRegistrator import org.apache.hadoop.fs.{Path, FileSystem} class MyKryoRegistrator

Re: Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Aaron Davidson
This is likely due to a bug in shuffle file consolidation (which you have enabled) which was hopefully fixed in 1.1 with this patch: https://github.com/apache/spark/commit/78f2af582286b81e6dc9fa9d455ed2b369d933bd Until 1.0.3 or 1.1 are released, the simplest solution is to disable spark.shuffle.co

Re: spark-submit with HA YARN

2014-08-20 Thread Matt Narrell
Marcelo, Specifying the driver-class-path yields behavior like https://issues.apache.org/jira/browse/SPARK-2420 and https://issues.apache.org/jira/browse/SPARK-2848 It feels like opening a can of worms here if I also need to replace the guava dependencies. Wouldn’t calling “./make-distributio

Personalized Page rank in graphx

2014-08-20 Thread Mohit Singh
Hi, I was wondering if Personalized Page Rank algorithm is implemented in graphx. If the talks and presentation were to be believed ( https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf) it is.. but cant find the algo code ( https://github.com/amplab/graphx/tree

Re: GraphX question about graph traversal

2014-08-20 Thread glxc
I don't know if Pregel would be necessary since it's not iterative You could filter the graph by looking at edge triplets, and testing if source =B, dest =A, and edge value > 5 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-question-about-graph-trav

Spark-job error on writing result into hadoop w/ switch_user=false

2014-08-20 Thread Jongyoul Lee
Hi, I've used hdfs 2.3.0-cdh5.0.1, mesos 0.19.1 and spark 1.0.2 that is re-compiled. For a security reason, we run hdfs and mesos as hdfs, that is an account name and not in a root group, and non-root user submit a spark job on mesos. With no-switch_user, simple job, which only read data from hdf

MLlib: issue with increasing maximum depth of the decision tree

2014-08-20 Thread Sameer Tilak
Hi All,My dataset is fairly small -- a CSV file with around half million rows and 600 features. Everything works when I set maximum depth of the decision tree to 5 or 6. However, I get this error for larger values of that parameter -- For example when I set it to 10. Have others encountered a s

GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo
Hi All: I have a question about how to do the following operation in GraphX. Suppose I have a graph with the following vertices and scores on the edges: (V1 {type:"B"})->(V2 {type:"A"})-->(V3 {type:"A"})<-(V4 {type:"B"}) 100 10100

Re: spark-submit with HA YARN

2014-08-20 Thread Marcelo Vanzin
Ah, sorry, forgot to talk about the second issue. On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell wrote: > However, now the Spark jobs running in the ApplicationMaster on a given node > fails to find the active resourcemanager. Below is a log excerpt from one > of the assigned nodes. As all the j

Re: spark-submit with HA YARN

2014-08-20 Thread Marcelo Vanzin
On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell wrote: > An “unaccepted” reply to this thread from Dean Chen suggested to build Spark > with a newer version of Hadoop (2.4.1) and this has worked to some extent. > I’m now able to submit jobs (omitting an explicit > “yarn.resourcemanager.address” prop

Re: Does anyone have a stand alone spark instance running on Windows

2014-08-20 Thread Steve Lewis
I have made a little progress - by downloading a prebuilt version of Spark I can call spark-shell.cmd and bring up a spark shell. In the shell things run. Next I go to my development environment and try to run JavaWordCount i try -Dspark.master=spark://local[*]:55519 -Dspark.master=spark://Asterix:

RE: Is Spark SQL Thrift Server part of the 1.0.2 release

2014-08-20 Thread Tam, Ken K
What is the best way to run Hive queries in 1.0.2? In my case. Hive queries will be invoked from a middle tier webapp. I am thinking to use the Hive JDBC driver. Thanks, Ken From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, August 20, 2014 9:38 AM To: Tam, Ken K Cc: user@s

Re: Is Spark SQL Thrift Server part of the 1.0.2 release

2014-08-20 Thread Michael Armbrust
No. It'll be part of 1.1. On Wed, Aug 20, 2014 at 9:35 AM, Tam, Ken K wrote: > Is Spark SQL Thrift Server part of the 1.0.2 release? If not, which > release is the target? > > > > Thanks, > > Ken >

Is Spark SQL Thrift Server part of the 1.0.2 release

2014-08-20 Thread Tam, Ken K
Is Spark SQL Thrift Server part of the 1.0.2 release? If not, which release is the target? Thanks, Ken

Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Silvio Fiorito
This is a long running Spark Streaming job running in YARN, Spark v1.0.2 on CDH5. The jobs will run for about 34-37 hours then die due to this FileNotFoundException. There’s very little CPU or RAM usage, I’m running 2 x cores, 2 x executors, 4g memory, YARN cluster mode. Here’s the stack trace

Spark exception while reading different inputs

2014-08-20 Thread durga
Hi I am using using below program in spark-shell to load and filter data from the data sets. I am getting exceptions if I run the programs for multiple times, If I restart the shell it is working fine. 1) please let me know what I am doing wrong. 2) Also is there a way to make the program better

How to pass env variables from master to executors within spark-shell

2014-08-20 Thread Darin McBeath
Can't seem to figure this out.  I've tried several different approaches without success. For example, I've tried setting spark.executor.extraJavaOptions in the spark-default.conf (prior to starting the spark-shell) but this seems to have no effect. Outside of spark-shell (within a java applicat

Re: spark-submit with HA YARN

2014-08-20 Thread Matt Narrell
Yes, I’m pretty sure my YARN and HDFS HA configuration is correct. I can use the UIs and HDFS command line tools with HA support as expected (failing over namenodes and resourcemanagers, etc) so I believe this to be a Spark issue. Like I mentioned earlier, if i manipulate the “yarn.resourcemana

Advantage of using cache()

2014-08-20 Thread Grzegorz Białek
Hi, I tried to write small program which shows that using cache() can speed up execution but results with and without cache were similar. Could help me with this issue? I tried to compute rdd and use it later in two places and I thought in second usage this rdd is recomputed but it doesn't: val

Potential Thrift Server Bug on Spark SQL,perhaps with cache table?

2014-08-20 Thread John Omernik
I am working with Spark SQL and the Thrift server. I ran into an interesting bug, and I am curious on what information/testing I can provide to help narrow things down. My setup is as follows: Hive 0.12 with a table that has lots of columns (50+) stored as rcfile. Spark-1.1.0-SNAPSHOT with Hive

Re: Segmented fold count

2014-08-20 Thread fil
> > > Could I write groupCount() in Scala, and then use it from Pyspark? Care > to > > supply an example, I'm finding them hard to find :) > > It's doable, but not so convenient. If you really care about the > performance > difference, you should write your program in Scala. > Is it possible to wr

Web UI doesn't show some stages

2014-08-20 Thread Grzegorz Białek
Hi, I am wondering why in web UI some stages (like join, filter) are not visible. For example this code: val simple = sc.parallelize(Array.range(0,100)) val simple2 = sc.parallelize(Array.range(0,100)) val toJoin = simple.map(x => (x, x.toString + x.toString)) val rdd = simple2 .map(x =>

RE: NullPointerException from '.count.foreachRDD'

2014-08-20 Thread anoldbrain
Thank you for the reply. I implemented my InputDStream to return None when there's no data. After changing it to return empty RDD, the exception is gone. I am curious as to why all other processings worked correctly with my old incorrect implementation, with or without data? My actual codes, witho

[no subject]

2014-08-20 Thread Cường Phạm

Re: RDD Row Index

2014-08-20 Thread Sean Owen
zipWithIndex() will give you something like an index for each element in the RDD. If you files are small, you can use SparkContext.wholeTextFiles() to load an RDD where each element is (filename, content). Maybe that's what you are looking for if you are really looking to extract an ID from the fil

RE: Hi

2014-08-20 Thread Shao, Saisai
Hi, Actually several java task threads running in a single executor, not processes, so each executor will only have one JVM runtime which shares with different task threads. Thanks Jerry From: rapelly kartheek [mailto:kartheek.m...@gmail.com] Sent: Wednesday, August 20, 2014 5:29 PM To: user@s

Hi

2014-08-20 Thread rapelly kartheek
Hi I have this doubt: I understand that each java process runs on different JVM instances. Now, if I have a single executor on my machine and run several java processes, then there will be several JVM instances running. Now, process_local means, the data is located on the same JVM as the task tha

RE: OutOfMemory Error

2014-08-20 Thread Shao, Saisai
Hi Meethu, The spark.executor.memory is the Java heap size of forked executor process. Increasing the spark.executor.memory can actually increase the runtime heap size of executor process. For the details of Spark configurations, you can check: http://spark.apache.org/docs/latest/configuration

RE: NullPointerException from '.count.foreachRDD'

2014-08-20 Thread Shao, Saisai
Hi, I don't think there's a NPE issue when using DStream/count() even there is no data feed into Spark Streaming. I tested using Kafka in my local settings, both are OK with and without data consumed. Actually you can see the details in ReceiverInputDStream, even there is no data in this batch

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-20 Thread Victor Tso-Guillen
And duh, of course, you can do the setup in that new RDD as well :) On Wed, Aug 20, 2014 at 1:59 AM, Victor Tso-Guillen wrote: > How about this: > > val prev: RDD[V] = rdd.mapPartitions(partition => { /*setup()*/; partition > }) > new RDD[V](prev) { > protected def getPartitions = prev.partit

Re: hdfs read performance issue

2014-08-20 Thread Gurvinder Singh
I got some time to look in to it. It appears as that Spark (latest git) is doing this operation much more often compare to Aug 1 version. Here is the log from operation I am referring to 14/08/19 12:37:26 INFO spark.CacheManager: Partition rdd_8_414 not found, computing it 14/08/19 12:37:26 INFO r

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-20 Thread Victor Tso-Guillen
How about this: val prev: RDD[V] = rdd.mapPartitions(partition => { /*setup()*/; partition }) new RDD[V](prev) { protected def getPartitions = prev.partitions def compute(split: Partition, context: TaskContext) = { context.addOnCompleteCallback(() => /*cleanup()*/) firstParent[V].iter

Re: OutOfMemory Error

2014-08-20 Thread MEETHU MATHEW
 Hi , How to increase the heap size? What is the difference between spark executor memory and heap size? Thanks & Regards, Meethu M On Monday, 18 August 2014 12:35 PM, Akhil Das wrote: I believe spark.shuffle.memoryFraction is the one you are looking for. spark.shuffle.memoryFraction

Difference between amplab docker and spark docker?

2014-08-20 Thread Josh J
Hi, Whats the difference between amplab docker and spark docker ? Thanks, Josh

Re: Got NotSerializableException when access broadcast variable

2014-08-20 Thread tianyi
Thanks for help. I run this script again with "bin/spark-shell --conf spark.serializer=org.apache.spark.serializer.KryoSerializer” in the console, I can see: scala> sc.getConf.getAll.foreach(println) (spark.tachyonStore.folderName,spark-eaabe986-03cb-41bd-bde5-993c7db3f048) (spark.driver.host,1

Broadcast vs simple variable

2014-08-20 Thread Julien Naour
Hi, I have a question about broadcast. I'm working on a clustering algorithm close to KMeans. It seems that KMeans broadcast clusters centers at each step. For the moment I just use my centers as Array that I call directly in my map at each step. Could it be more efficient to use broadcast instead

Re: NullPointerException from '.count.foreachRDD'

2014-08-20 Thread anoldbrain
Looking at the source codes of DStream.scala > /** >* Return a new DStream in which each RDD has a single element generated > by counting each RDD >* of this DStream. >*/ > def count(): DStream[Long] = { > this.map(_ => (null, 1L)) > .transform(_.union(context.sparkCon

Accessing to elements in JavaDStream

2014-08-20 Thread cuongpham92
Hi, I am a newbie to Spark Streaming, and I am quite confused about JavaDStream in SparkStreaming. In my situation, after catching a message "Hello world" from Kafka in JavaDStream, I want to access to JavaDStream and change this message to "Hello John", but I could not figure how to do it. Any ide

[Spark SQL] How to select first row in each GROUP BY group?

2014-08-20 Thread Fengyun RAO
I have a table with 4 columns: a, b, c, time What I need is something like: SELECT a, b, GroupFirst(c) FROM t GROUP BY a, b GroupFirst means "the first" item of column c group, and by "the first" I mean minimal "time" in that group. In Oracle/Sql Server, we could write: WITH summary AS (

RDD Row Index

2014-08-20 Thread TJ Klein
Hi, I wonder if there is something like an (row) index to of the elements in the RDD. Specifically, my RDD is generated from a series of files, where the value corresponds the file contents. Ideally, I would like to have the keys to be an enumeration of the file number e.g. (0,),(1,). Any idea? T