Re: SQLCtx cacheTable

2014-08-04 Thread Gurvinder Singh
On 08/04/2014 10:57 PM, Michael Armbrust wrote: > If mesos is allocating a container that is exactly the same as the max > heap size then that is leaving no buffer space for non-heap JVM memory, > which seems wrong to me. > This can be a cause. I am now wondering how mesos pick up the size and set

Re: Visualizing stage & task dependency graph

2014-08-04 Thread Zongheng Yang
I agree that this is definitely useful. One related project I know of is Sparkling [1] (also see talk at Spark Summit 2014 [2]), but it'd be great (and I imagine somewhat challenging) to visualize the *physical execution* graph of a Spark job. [1] http://pr01.uml.edu/ [2] http://spark-summit.org

Re: Visualizing stage & task dependency graph

2014-08-04 Thread Rahul Kumar Singh
One way to do that is to use RDD.toDebugString to check the dependency graph and it also gives a good idea regarding stages. On Mon, Aug 4, 2014 at 8:55 PM, rpandya wrote: > Is there a way to visualize the task dependency graph of an application, > during or after its execution? The list of sta

Re: about spark and using machine learning model

2014-08-04 Thread Xiangrui Meng
Some extra work is needed to close the loop. One related example is streaming linear regression added by Jeremy very recently: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala You can use a model trained offline to

about spark and using machine learning model

2014-08-04 Thread Hoai-Thu Vuong
Hello everybody! I'm getting started with spark and mllib. I'm successful in building a small cluster and follow the tutorial. However, I would like to ask about how to use the model, which is trained by mllib. I understand that, with data we can training the model such as Classifier model, then u

Visualizing stage & task dependency graph

2014-08-04 Thread rpandya
Is there a way to visualize the task dependency graph of an application, during or after its execution? The list of stages on port 4040 is useful, but still quite limited. For example, I've found that if I don't cache() the result of one expensive computation, it will get repeated 4 times, but it i

Re: Can't see any thing one the storage panel of application UI

2014-08-04 Thread binbinbin915
Actually, if you don’t use method like persist or cache, it even not store the rdd to the disk. Every time you use this rdd, they just compute it from the original one. In logistic regression from mllib, they don't persist the changed input , so I can't see the rdd from the web gui. I have cha

Re: Create a new object by given classtag

2014-08-04 Thread Matei Zaharia
To get the ClassTag object inside your function with the original syntax you used (T: ClassTag), you can do this: def read[T: ClassTag](): T = {   val ct = classTag[T]   ct.runtimeClass.newInstance().asInstanceOf[T] } Passing the ClassTag with : ClassTag lets you have an implicit parameter that

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread Tathagata Das
Are you able to run it locally? If not, can you try creating an all-inclusive jar with all transitive dependencies together (sbt assembly) and then try running the app? Then this will be a self contained environment, which will help us debug better. TD On Mon, Aug 4, 2014 at 5:06 PM, durin wro

Re: Unit Test for Spark Streaming

2014-08-04 Thread JiajiaJing
This helps a lot!! Thank you very much! Jiajia -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Test-for-Spark-Streaming-tp11394p11396.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Unit Test for Spark Streaming

2014-08-04 Thread Tathagata Das
Appropriately timed question! Here is the PR that adds a real unit test for Kafka stream in Spark Streaming. Maybe this will help! https://github.com/apache/spark/pull/1751/files On Mon, Aug 4, 2014 at 6:30 PM, JiajiaJing wrote: > Hello Spark Users, > > I have a spark streaming program that stre

Unit Test for Spark Streaming

2014-08-04 Thread JiajiaJing
Hello Spark Users, I have a spark streaming program that stream data from kafka topics and output as parquet file on HDFS. Now I want to write a unit test for this program to make sure the output data is correct (i.e not missing any data from kafka). However, I have no idea about how to do this,

Re: Substring in Spark SQL

2014-08-04 Thread Michael Armbrust
Yeah, there will likely be a community preview build soon for the 1.1 release. Benchmarking that will both give you better performance and help QA the release. Bonus points if you turn on codegen for Spark SQL (experimental feature) when benchmarking and report bugs: "SET spark.sql.codegen=true"

RE: Substring in Spark SQL

2014-08-04 Thread Cheng, Hao
>From the log, I noticed the "substr" was added on July 15th, 1.0.1 release >should be earlier than that. Community is now working on releasing the 1.1.0, >and also some of the performance improvements were added. Probably you can try >that for your benchmark. Cheng Hao -Original Message--

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread durin
In the WebUI "Environment" tab, the section "Classpath Entries" lists the following ones as part of System Classpath: /foo/hadoop-2.0.0-cdh4.5.0/etc/hadoop /foo/spark-master-2014-07-28/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.0.0-cdh4.5.0.jar /foo/spark-master-2014-07-28/co

Re: Low Level Kafka Consumer for Spark

2014-08-04 Thread Jonathan Hodges
Hi Yan, That is a good suggestion. I believe non-Zookeeper offset management will be a feature in the upcoming Kafka 0.8.2 release tentatively scheduled for September. https://cwiki.apache.org/confluence/display/KAFKA/Inbuilt+Consumer+Offset+Management That should make this fairly easy to imple

Re: Spark Streaming : Could not compute split, block not found

2014-08-04 Thread Tathagata Das
Aaah sorry, I should have been more clear. Can you give me INFO (DEBUG even better) level logs since the start of the program? I need to see how the cleaning up code is managing to delete the block. TD On Fri, Aug 1, 2014 at 10:26 PM, Kanwaldeep wrote: > Here is the log file. > streaming.gz >

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread Tathagata Das
3.0.3 is being used https://github.com/apache/spark/blob/master/external/twitter/pom.xml Are you sure you are deploying the twitter4j3.0.3, and there is not other version of twitter4j in the path? TD On Mon, Aug 4, 2014 at 4:48 PM, durin wrote: > Using 3.0.3 (downloaded from http://mvnrepositor

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread durin
Using 3.0.3 (downloaded from http://mvnrepository.com/artifact/org.twitter4j ) changes the error to Exception in thread "Thread-55" java.lang.NoClassDefFoundError: twitter4j/StatusListener at org.apache.spark.streaming.twitter.TwitterInputDStream.getReceiver(TwitterInputDStream.scala:55)

java.lang.IllegalArgumentException: Unable to create serializer "com.esotericsoftware.kryo.serializers.FieldSerializer"

2014-08-04 Thread Sameer Tilak
Hi All, I am trying to move away from spark-shell to spark-submit and have been making some code changes. However, I am now having problem with serialization. It used to work fine before the code update. Not sure what I did wrong. However, here is the code JaccardScore.scala packa

Re: Create a new object by given classtag

2014-08-04 Thread Marcelo Vanzin
Hello, Try something like this: scala> def newFoo[T]()(implicit ct: ClassTag[T]): T = ct.runtimeClass.newInstance().asInstanceOf[T] newFoo: [T]()(implicit ct: scala.reflect.ClassTag[T])T scala> newFoo[String]() res2: String = "" scala> newFoo[java.util.ArrayList[String]]() res5: java.util.Array

Re: Configuration setup and Connection refused

2014-08-04 Thread alamin.ishak
Hi all, Any help would be much appreciated. Thanks, Al On Mon, Aug 4, 2014 at 7:09 PM, Al Amin wrote: > Hi all, > > I have setup 2 nodes (master and slave1) on stand alone mode. Tried > running SparkPi example and its working fine. However when I move on to > wordcount its giving me below er

Re: spark streaming kafka

2014-08-04 Thread Tathagata Das
1. Does your cluster have access to the machines that run kafka? 2. Is there any error in logs? If so can you please post them? TD On Mon, Aug 4, 2014 at 1:12 PM, salemi wrote: > Hi, > > I have the following driver and it works when I run it in the local[*] mode > but if I execute it in a stand

Re: Low Level Kafka Consumer for Spark

2014-08-04 Thread Yan Fang
Another suggestion that may help is that, you can consider use Kafka to store the latest offset instead of Zookeeper. There are at least two benefits: 1) lower the workload of ZK 2) support replay from certain offset. This is how Samza deals with the Kafka offse

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread Tathagata Das
Spark STreaming's TwitterUtils, depends on twitter4j 3.0.3 (if i remember correctly). That may be the issue here. TD On Mon, Aug 4, 2014 at 11:08 AM, durin wrote: > I am using the latest Spark master and additionally, I am loading these jars: > - spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar

Running Examples

2014-08-04 Thread cetaylor
Hello, I downloaded and built Spark 1.0.1 using sbt/sbt assembly. Once built I attempted to go through a couple examples. I could run Spark interactively through the Scala Shell and the example sc.parallelize(1 to 1000).count() returned correcly with 1000. Then I attempted to run the example us

Re: Create a new object by given classtag

2014-08-04 Thread Peng Wei
Hi Matei, Thanks for your reply. Do you know how to write the code to implement what you mentioned? I tried a lot, for example, "val obj = ClassTag[T].runtimeClass.newInstance", but none works. Thanks very much. On Mon, Aug 4, 2014 at 2:42 PM, Matei Zaharia wrote: > This is not something inhe

Re: MovieLensALS - Scala Pattern Magic

2014-08-04 Thread Holden Karau
Hi Steve, The _ notation can be a bit confusing when starting with Scala, we can rewrite it to avoid using it here. So instead of val numUsers = ratings.map(_._2.user) we can write val numUsers = ratings.map(x => x._2.user) ratings is an Key-Value RDD (which is an RDD comprised of tuples) and so

Re: MovieLensALS - Scala Pattern Magic

2014-08-04 Thread Sean Owen
ratings is an RDD of Rating objects. You can see them created as the second element of the tuple. It's a simple case class: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L66 This is just accessing the user and product field of the

Re: Streaming + SQL : How to resgister a DStream content as a table and access it

2014-08-04 Thread Tathagata Das
There are other threads in the mailing list that has the solution. For example, http://apache-spark-user-list.1001560.n3.nabble.com/SQL-streaming-td9668.html Its recommended that you search through them before posting :) On Mon, Aug 4, 2014 at 2:45 PM, salemi wrote: > Hi, > > I was wondering if

Re: Memory & compute-intensive tasks

2014-08-04 Thread rpandya
This one turned out to be another problem with my app configuration, not with Spark. The compute task was dependent on the local filesystem, and config errors on 8 of 10 of the nodes made them fail early. The Spark wrapper was not checking the process exit value, so it appeared as if they were prod

MovieLensALS - Scala Pattern Magic

2014-08-04 Thread Steve Nunez
Can one of the Scala experts please explain this bit of pattern magic from the Spark ML tutorial: _._2.user ? As near as I can tell, this is applying the _2 function to the wildcard, and then applying the Œuser¹ function to that. In a similar way the Œproduct¹ function is applied in the next line,

Substring in Spark SQL

2014-08-04 Thread Tom
Hi, I am trying to run the Big Data Benchmark , and I am stuck at Query 2 for Spark SQL using Spark 1.0.1: SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X) When I look into the sourcecode, it seems that "sub

Streaming + SQL : How to resgister a DStream content as a table and access it

2014-08-04 Thread salemi
Hi, I was wondering if you can give me an example on How to resgister a DStream content as a table and access it. Thanks, Ali -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-SQL-How-to-resgister-a-DStream-content-as-a-table-and-access-it-tp11372.

Re: Create a new object by given classtag

2014-08-04 Thread Matei Zaharia
This is not something inherently supported by ClassTags. The best you can do is get the Class object for it (which is part of the ClassTag API) and create an instance through reflection. This will work as long as it has a public constructor with no parameters. Matei On August 4, 2014 at 2:00:2

Re: Writing to RabbitMQ

2014-08-04 Thread Tathagata Das
Can you show us the code that you are using to write to RabbitMQ. I fear that this is a relatively common problem where you are using something like this. dstream.foreachRDD(rdd => { // create connection / channel to source rdd.foreach(element => // write using channel }) This is not the

Spark SQL JDBC

2014-08-04 Thread John Omernik
I am using spark-1.1.0-SNAPSHOT right now and trying to get familiar with the JDBC thrift server. I have everything compiled correctly, I can access data in spark-shell on yarn from my hive installation. Cached tables, etc all work. When I execute ./sbin/start-thriftserver.sh I get the error bel

Create a new object by given classtag

2014-08-04 Thread Parthus
Hi there, I was wondering if somebody could tell me how to create an object with given classtag so as to make the function below work. The only thing to do is just to write one line to create an object of Class T. I tried new T but it does not work. Would it possible to give me one scala line to f

Re: SQLCtx cacheTable

2014-08-04 Thread Michael Armbrust
If mesos is allocating a container that is exactly the same as the max heap size then that is leaving no buffer space for non-heap JVM memory, which seems wrong to me. The problem here is that cacheTable is more aggressive about grabbing large ByteBuffers during caching (which it later releases wh

Re: [SparkStreaming]StackOverflow on restart

2014-08-04 Thread Tathagata Das
What is your checkpoint interval of the updateStateByKey's DStream? Did you modify it? Also, do you have a simple program and the step-by-step process by which I can reproduce the issue? If not, can you give me the full DEBUG level logs of the program before and after restart? TD On Mon, Aug 4, 2

Re: Spark Training Course?

2014-08-04 Thread Matei Zaharia
This looks pretty comprehensive to me. A few quick suggestions: - On the VM part: we've actually been avoiding this in all the Databricks training efforts because the VM itself can be annoying to install and it makes it harder for people to really use Spark for development (they can learn it, b

spark streaming kafka

2014-08-04 Thread salemi
Hi, I have the following driver and it works when I run it in the local[*] mode but if I execute it in a standalone cluster then then I don't get any data from kafka. Does anybody know why that might be? val sparkConf = new SparkConf().setAppName("KafkaMessageReceiver") val sc = new

Re: Can't see any thing one the storage panel of application UI

2014-08-04 Thread anthonyjschu...@gmail.com
Good idea Andrew... Using this feature allowed me to debug that my app wasn't caching properly-- the UI is working as designed for me in 1.0. It might be a good idea to say "no cached blocks" instead of an empty page... just a thought... On Mon, Aug 4, 2014 at 1:17 PM, Andrew Or-2 [via Apache Spa

Re: Can't see any thing one the storage panel of application UI

2014-08-04 Thread Andrew Or
Hi all, Could you check with `sc.getExecutorStorageStatus` to see if the blocks are in fact present? This returns a list of StorageStatus objects, and you can check whether each status' `blocks` is non-empty. If the blocks do exist, then this is likely a bug in the UI. There have been a couple of

Re: Can't see any thing one the storage panel of application UI

2014-08-04 Thread anthonyjschu...@gmail.com
I am (not) seeing this also... No items in the storage UI page. using 1.0 with HDFS... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-see-any-thing-one-the-storage-panel-of-application-UI-tp10296p11361.html Sent from the Apache Spark User List mailing

Re: [GraphX] How spark parameters relate to Pregel implementation

2014-08-04 Thread Ankur Dave
At 2014-08-04 20:52:26 +0800, Bin wrote: > I wonder how spark parameters, e.g., number of paralellism, affect Pregel > performance? Specifically, sendmessage, mergemessage, and vertexprogram? > > I have tried label propagation on a 300,000 edges graph, and I found that no > paralellism is much f

RE: creating a distributed index

2014-08-04 Thread Daniel, Ronald (ELS-SDG)
At the Spark Summit, the Typesafe people had a toy implementation of a full-text index that you could use as a starting point. The bare code is available in github at https://github.com/deanwampler/spark-workshop/blob/eb077a734aad166235de85494def8fe3d4d2ca66/src/main/scala/spark/InvertedIndex5b.

Re: Spark app throwing java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-08-04 Thread Sean Owen
(- incubator list, + user list) (Answer copied from original posting at http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-app-throwing-java-lang-OutOfMemoryError-GC-overhead-limit/m-p/16396#U16396 -- let's follow up one place. If it's not specific to CDH, this is a good place t

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Nicholas Chammas
Patrick, that was the problem. Individual partitions were too big to fit in memory. I also believe the Snappy compression codec in Hadoop is not splittable. > This means that each of your JSON files is read in its entirety as one > spark partition. If you have files that are larger than the standa

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Ron Gonzalez
One key thing I forgot to mention is that I changed the avro version to 1.7.7 to get AVRO-1476. I took a closer look at the jars, and what I noticed is that the assembly jars that work do not have the org.apache.avro.mapreduce package packaged into the assembly. For spark-1.0.1, org.apache.avro

Configuration setup and Connection refused

2014-08-04 Thread alamin.ishak
Hi all, I have setup 2 nodes (master and slave1) on stand alone mode. Tried running SparkPi example and its working fine. However when I move on to wordcount its giving me below error: 14/08/04 21:40:33 INFO storage.MemoryStore: ensureFreeSpace(32856) called with curMem=0, maxMem=311387750 14/08

Spark Streaming fails - where is the problem?

2014-08-04 Thread durin
I am using the latest Spark master and additionally, I am loading these jars: - spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar - twitter4j-core-4.0.2.jar - twitter4j-stream-4.0.2.jar My simple test program that I execute in the shell looks as follows: import org.apache.spark.streaming._ impo

Re: creating a distributed index

2014-08-04 Thread Philip Ogren
After playing around with mapPartition I think this does exactly what I want. I can pass in a function to mapPartition that looks like this: def f1(iter: Iterator[String]): Iterator[MyIndex] = { val idx: MyIndex = new MyIndex() while (iter.hasNext) {

Re: Computing mean and standard deviation by key

2014-08-04 Thread Ron Gonzalez
Cool thanks!  On Monday, August 4, 2014 8:58 AM, kriskalish wrote: Hey Ron, It was pretty much exactly as Sean had depicted. I just needed to provide count an anonymous function to tell it which elements to count. Since I wanted to count them all, the function is simply "true".         va

Re: Spark job tracker.

2014-08-04 Thread abhiguruvayya
I am trying to create a asynchronous thread using Java executor service and launching the javaSparkContext in this thread. But it is failing with exit code 0(zero). I basically want to submit spark job in one thread and continue doing something else after submitting. Any help on this? Thanks. --

Spark app throwing java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-08-04 Thread buntu
I got a 40 node cdh 5.1 cluster and attempting to run a simple spark app that processes about 10-15GB raw data but I keep running into this error: java.lang.OutOfMemoryError: GC overhead limit exceeded Each node has 8 cores and 2GB memory. I notice the heap size on the executors is set to 512

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Steve Nunez
Hmm. Fair enough. I hadn¹t given that answer much thought and on reflection think you¹re right in that a profile would just be a bad hack. On 8/4/14, 10:35, "Sean Owen" wrote: >What would such a profile do though? In general building for a >specific vendor version means setting hadoop.verison

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Sean Owen
What would such a profile do though? In general building for a specific vendor version means setting hadoop.verison and/or yarn.version. Any hard-coded value is unlikely to match what a particular user needs. Setting protobuf versions and so on is already done by the generic profiles. In a similar

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Sean Owen
The profile does set it automatically: https://github.com/apache/spark/blob/master/pom.xml#L1086 yarn.version should default to hadoop.version It shouldn't hurt, and should work, to set to any other specific version. If one HDP version works and another doesn't, are you sure the repo has the desir

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Steve Nunez
I don’t think there is an hwx profile, but there probably should be. - Steve From: Patrick Wendell Date: Monday, August 4, 2014 at 10:08 To: Ron's Yahoo! Cc: Ron's Yahoo! , Steve Nunez , , "d...@spark.apache.org" Subject: Re: Issues with HDP 2.4.0.2.1.3.0-563 Ah I see, yeah you might nee

Re: Bad Digest error while doing aws s3 put

2014-08-04 Thread lmk
Thanks Patrick. But why am I getting a Bad Digest error when I am saving large amount of data to s3? /Loss was due to org.apache.hadoop.fs.s3.S3Exception org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/spark_test%2Fsmaato_one_day_phase_2%2Fsmaato_

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell
Ah I see, yeah you might need to set hadoop.version and yarn.version. I thought he profile set this automatically. On Mon, Aug 4, 2014 at 10:02 AM, Ron's Yahoo! wrote: > I meant yarn and hadoop defaulted to 1.0.4 so the yarn build fails since > 1.0.4 doesn't exist for yarn... > > Thanks, > Ron

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Ron's Yahoo!
I meant yarn and hadoop defaulted to 1.0.4 so the yarn build fails since 1.0.4 doesn’t exist for yarn... Thanks, Ron On Aug 4, 2014, at 10:01 AM, Ron's Yahoo! wrote: > That failed since it defaulted the versions for yarn and hadoop > I’ll give it a try with just 2.4.0 for both yarn and hadoop

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Ron's Yahoo!
That failed since it defaulted the versions for yarn and hadoop I’ll give it a try with just 2.4.0 for both yarn and hadoop… Thanks, Ron On Aug 4, 2014, at 9:44 AM, Patrick Wendell wrote: > Can you try building without any of the special `hadoop.version` flags and > just building only with -P

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell
Can you try building without any of the special `hadoop.version` flags and just building only with -Phadoop-2.4? In the past users have reported issues trying to build random spot versions... I think HW is supposed to be compatible with the normal 2.4.0 build. On Mon, Aug 4, 2014 at 8:35 AM, Ron'

Re: Bad Digest error while doing aws s3 put

2014-08-04 Thread Patrick Wendell
You are hitting this issue: https://issues.apache.org/jira/browse/SPARK-2075 On Mon, Jul 28, 2014 at 5:40 AM, lmk wrote: > Hi > I was using saveAsTextFile earlier. It was working fine. When we migrated > to > spark-1.0, I started getting the following error: > java.lang.ClassNotFoundException:

Re: Spark Training Course?

2014-08-04 Thread Victor Tso-Guillen
I made a few small comments. Still a relative newbie, but hope it helps! On Mon, Aug 4, 2014 at 9:08 AM, Jörn Franke wrote: > Hi Chris, > > I am currently working out a university course on Bi, Nosql (key/value, > columnar, graph,document,search), big data (lambda architecture, hadoop, > spark)

Spark SQL cache table using different storage level

2014-08-04 Thread chutium
it seems sqlContext.cacheTable() will always cache the target table using MEMORY_ONLY ? MEMORY_ONLY without serialization will blow up the size of RDD, we want to use MEM_SER or MEM_DISK to cache tables in Spark SQL, is this possible? Thanks -- View this message in context: http://apache-spa

Re: Using countApproxDistinct in pyspark

2014-08-04 Thread Diederik
Dear Davies, Thanks so much for your instructions! It worked like a charm. Best, Diederik On Wed, Jul 30, 2014 at 1:27 AM, Davies Liu-2 [via Apache Spark User List] < ml-node+s1001560n10917...@n3.nabble.com> wrote: > Hey Diederik, > > The data in rdd._jrdd.rdd() is serialized by pickle in batch

Re: Spark Training Course?

2014-08-04 Thread Jörn Franke
Hi Chris, I am currently working out a university course on Bi, Nosql (key/value, columnar, graph,document,search), big data (lambda architecture, hadoop, spark). Your work looks quite ambitious. You could elaborate as well on how you integrate different data sources with the spark cluster (kafka,

Re: Computing mean and standard deviation by key

2014-08-04 Thread kriskalish
Hey Ron, It was pretty much exactly as Sean had depicted. I just needed to provide count an anonymous function to tell it which elements to count. Since I wanted to count them all, the function is simply "true". val grouped = rdd.groupByKey().mapValues { mcs => val values = mcs

[SparkStreaming]StackOverflow on restart

2014-08-04 Thread Yana Kadiyska
Hi Spark users, I'm trying to get a pretty simple streaming program going. My context is created via StreamingContext.getOrCreate(checkpointDir,createFn) creating a context works fine but when trying to start from a checkpoint I get a stack overflow. Any pointers what could be going wrong? My ba

Re: creating a distributed index

2014-08-04 Thread Philip Ogren
This looks like a really cool feature and it seems likely that this will be extremely useful for things we are doing. However, I'm not sure it is quite what I need here. With an inverted index you don't actually look items up by their keys but instead try to match against some input string.

RE: Data from Mysql using JdbcRDD

2014-08-04 Thread chutium
have a look on this commit: https://github.com/apache/spark/pull/1612/files#diff-0 try this: -stmt.setLong(1, part.lower) -stmt.setLong(2, part.upper) +val parameterCount = stmt.getParameterMetaData.getParameterCount +if (parameterCount > 0) stmt.setLong(1, part.lower) +i

Re: Reading from HDFS no faster than reading from S3 - how to tell if data locality respected?

2014-08-04 Thread Martin Goodson
Just an update on this - I have benchmarked on a cluster built with spark-ec2 and again found that reading from hdfs is not much faster than from s3 (about 20%). Does anyone know how to check that data locality is being used by spark on my cluster? Is it surprising that access to HDFS on local di

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Ron's Yahoo!
Thanks, I ensured that $SPARK_HOME/pom.xml had the HDP repository under the repositories element. I also confirmed that if the build couldn’t find the version, it would fail fast so it seems as if it’s able to get the versions it needs to build the distribution. I ran the following (generated fr

Re: Spark on HDFS with replication

2014-08-04 Thread Deep Pradhan
I understand that RDDs don't need replication but I just wanted to know about the relation between the storage of RDDs and the HDFS On Mon, Aug 4, 2014 at 3:32 PM, Stanley Shi wrote: > RDD don't *need* replication; but it doesn't harm if the underlying > things has replication. > > > On Mon, Au

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Steve Nunez
Provided you¹ve got the HWX repo in your pom.xml, you can build with this line: mvn -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.4.0.2.1.1.0-385 -DskipTests clean package I haven¹t tried building a distro, but it should be similar. - SteveN On 8/4/14, 1:25, "Sean Owen" wrote: >For a

[GraphX] How spark parameters relate to Pregel implementation

2014-08-04 Thread Bin
Hi all, I wonder how spark parameters, e.g., number of paralellism, affect Pregel performance? Specifically, sendmessage, mergemessage, and vertexprogram? I have tried label propagation on a 300,000 edges graph, and I found that no paralellism is much faster than 5 or 500 paralellism. Looki

Spark Training Course?

2014-08-04 Thread Chris London
Hey Everyone, I'm thinking of creating an instructional video training course for Spark. I don't know if I actually plan on publishing it or not, my goal is by creating this course I will become intimately familiar with Spark. I was wondering if you had a second you could look over my outline and

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-04 Thread Fengyun RAO
object LogParserWrapper { private val logParser = { val settings = new ... val builders = new new LogParser(builders, settings) } def getParser = logParser } object MySparkJob { def main(args: Array[String]) { val sc = new SparkContext()

Can we throttle the individual queries using SPARK

2014-08-04 Thread Mahesh Govind
HI , Can we throttle the individual queries in SPARK . So that one query will not hog the system resources ? Regards Mahesh

Re: access hdfs file name in map()

2014-08-04 Thread Roberto Torella
Thanks, Simon! It helped a lot :D Ciao, r- Xu (Simon) Chen wrote > Hi Roberto, > > Ultimately, the info you need is set here: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L69 > > Being a spark newbie, I extended org.apache.spark.rdd.Ha

Re: Spark on HDFS with replication

2014-08-04 Thread Stanley Shi
RDD don't *need* replication; but it doesn't harm if the underlying things has replication. On Mon, Aug 4, 2014 at 5:51 PM, Deep Pradhan wrote: > Hi, > Spark can run on top of HDFS. > While Spark talks about the RDDs which do not need replication because the > partitions can be built with the h

Spark on HDFS with replication

2014-08-04 Thread Deep Pradhan
Hi, Spark can run on top of HDFS. While Spark talks about the RDDs which do not need replication because the partitions can be built with the help of lineage. But, HDFS inherently has replication. How do these two concepts go together? Thank You

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-04 Thread DB Tsai
You can try to define a wrapper class for your parser, and create an instance of your parser in companion object as a singleton object. Thus, even you create an object of wrapper in mapPartition every time, each JVM will have only a single instance of your parser object. Sincerely, DB Tsai --

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-04 Thread Fengyun RAO
Thanks, Sean! It works, but as the link in 2 - Why Is My Spark Job so Slow and Only Using a Single Thread? says " parser instance is now a singleton created in the scope of our driver program" w

Re: Compiling Spark master (6ba6c3eb) with sbt/sbt assembly

2014-08-04 Thread Larry Xiao
Sorry I mean, I tried this command ./sbt/sbt clean and now it works. Is it because of cached components no recompiled? On 8/4/14, 4:44 PM, Larry Xiao wrote: I guessed ./sbt/sbt clean and it works fine now. On 8/4/14, 11:48 AM, Larry Xiao wrote: On the latest pull today (6ba6c3ebfe9a473

Re: Bad Digest error while doing aws s3 put

2014-08-04 Thread lmk
Anyone has any thoughts on this? Regards, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p11313.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Compiling Spark master (6ba6c3eb) with sbt/sbt assembly

2014-08-04 Thread Larry Xiao
I guessed ./sbt/sbt clean and it works fine now. On 8/4/14, 11:48 AM, Larry Xiao wrote: On the latest pull today (6ba6c3ebfe9a47351a50e45271e241140b09bf10) meet assembly problem. $ ./sbt/sbt assembly Using /usr/lib/jvm/java-7-oracle as default JAVA_HOME. Note, this will be overridden by -j

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-04 Thread Sean Owen
The parser does not need to be serializable. In the line: lines.map(line => JSONParser.parse(line)) ... the parser is called but there is no parser object that with state that can be serialized. Are you sure it does not work? The error message alluded to originally refers to an object not shown

Re: NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass with spark-submit

2014-08-04 Thread Sean Owen
I'm guessing you have the Jackson classes in your assembly but so does Spark. Its classloader wins, and does not contain the class present in your app's version of Jackson. Try spark.files.userClassPathFirst ? On Mon, Aug 4, 2014 at 6:28 AM, Ryan Braley wrote: > Hi Folks, > > I have an assembly

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Sean Owen
For any Hadoop 2.4 distro, yes, set hadoop.version but also set -Phadoop-2.4. http://spark.apache.org/docs/latest/building-with-maven.html On Mon, Aug 4, 2014 at 9:15 AM, Patrick Wendell wrote: > For hortonworks, I believe it should work to just link against the > corresponding upstream version.

Re: Cached RDD Block Size - Uneven Distribution

2014-08-04 Thread Patrick Wendell
Are you directly caching files from Hadoop or are you doing some transformation on them first? If you are doing a groupBy or some type of transformation, then you could be causing data skew that way. On Sun, Aug 3, 2014 at 1:19 PM, iramaraju wrote: > I am running spark 1.0.0, Tachyon 0.5 and Ha

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell
For hortonworks, I believe it should work to just link against the corresponding upstream version. I.e. just set the Hadoop version to "2.4.0" Does that work? - Patrick On Mon, Aug 4, 2014 at 12:13 AM, Ron's Yahoo! wrote: > Hi, > Not sure whose issue this is, but if I run make-distribution

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell
BTW - the reason why the workaround could help is because when persisting to DISK_ONLY, we explicitly avoid materializing the RDD partition in memory... we just pass it through to disk On Mon, Aug 4, 2014 at 1:10 AM, Patrick Wendell wrote: > It seems possible that you are running out of memory

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell
It seems possible that you are running out of memory unrolling a single partition of the RDD. This is something that can cause your executor to OOM, especially if the cache is close to being full so the executor doesn't have much free memory left. How large are your executors? At the time of failur

Re: RDD to DStream

2014-08-04 Thread Aniket Bhatnagar
The use case for converting RDD into DStream is that I want to simulate a stream from an already persisted data for testing analytics. It is trivial to create a RDD from any persisted data but not so much for DStream. Therefore, my idea to create DStream from RDD. For example, lets say you are tryi

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-04 Thread Fengyun RAO
Thanks, Ron. The problem is that the "parser" is written in another package which is not serializable. In mapreduce, I could create the "parser" in the map setup() method. Now in spark, I want to create it for each worker, and share it among all the tasks on the same work node. I know different

Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Ron's Yahoo!
Hi, Not sure whose issue this is, but if I run make-distribution using HDP 2.4.0.2.1.3.0-563 as the hadoop version (replacing it in make-distribution.sh), I get a strange error with the exception below. If I use a slightly older version of HDP (2.4.0.2.1.2.0-402) with make-distribution, using

  1   2   >