How to enforce RDD to be cached?

2014-12-03 Thread shahab
Hi, I noticed that rdd.cache() is not happening immediately rather due to lazy feature of Spark, it is happening just at the moment you perform some map/reduce actions. Is this true? If this is the case, how can I enforce Spark to cache immediately at its cache() statement? I need this to perfor

Re: Filter using the Vertex Ids

2014-12-03 Thread Deep Pradhan
And one more thing, the given tupes (1, 1.0) (2, 1.0) (3, 2.0) (4, 2.0) (5, 0.0) are a part of RDD and they are not just tuples. graph.vertices return me the above tuples which is a part of VertexRDD. On Wed, Dec 3, 2014 at 3:43 PM, Deep Pradhan wrote: > This is just an example but if my graph

Re: Help understanding - Not enough space to cache rdd

2014-12-03 Thread Akhil Das
Set spark.storage.memoryFraction flag to 1 while creating the sparkContext to utilize upto 73Gb of your memory, default it 0.6 and hence you are getting 33.6Gb. Also set rdd.compression and StorageLevel as MEMORY_ONLY_SER if your data is kind of larger than your available memory. (you could try MEM

Re: WordCount fails in .textFile() method

2014-12-03 Thread Akhil Das
Try running it in local mode. Looks like a jar conflict/missing. SparkConf conf = new SparkConf().setAppName("JavaWordCount"); conf.set("spark.io.compression.codec","org.apache.spark.io.LZ4CompressionCodec"); conf.setMaster("*local[2]*").setSparkHome(System.getenv("SPARK_HOME")); JavaSparkContext

Re: Spark with HBase

2014-12-03 Thread Akhil Das
You could go through these to start with http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase http://stackoverflow.com/questions/25189527/how-to-process-a-range-of-hbase-rows-using-spark Thanks Best Regards On Wed, Dec 3, 2014 at 11

Re: Low Level Kafka Consumer for Spark

2014-12-03 Thread Dibyendu Bhattacharya
Hi, Yes, as Jerry mentioned, the Spark -3129 ( https://issues.apache.org/jira/browse/SPARK-3129) enabled the WAL feature which solves the Driver failure problem. The way 3129 is designed , it solved the driver failure problem agnostic of the source of the stream ( like Kafka or Flume etc) But with

How does 2.3.4-spark differ from typesafe 2.3.4 akka?

2014-12-03 Thread dresnick
Using sbt-assemble I'm creating a fat jar that includes spark and akka. I've encountered this error: [error] /home/dev/.ivy2/cache/com.typesafe.akka/akka-actor_2.10/jars/akka-actor_2.10-2.3.4.jar:akka/util/ByteIterator$$anonfun$getLongPart$1.class [error] /home/dev/.ivy2/cache/org.spark-project.ak

Re: pySpark saveAsSequenceFile append overwrite

2014-12-03 Thread Akhil Das
You can't append to a file with spark using the native saveAs* calls, it will always check if the directory already exists and if yes, it will throw error. People usually use hadoop's getMerge utilities to combine the output. Thanks Best Regards On Tue, Dec 2, 2014 at 8:10 PM, Csaba Ragany wrote

getting firs N messages froma Kafka topic using Spark Streaming

2014-12-03 Thread Hafiz Mujadid
Hi Experts! Is there a way to read first N messages from kafka stream and put them in some collection and return to the caller for visualization purpose and close spark streaming. I will be glad to hear from you and will be thankful to you. Currently I have following code that def getsample(p

Spark SQL UDF returning a list?

2014-12-03 Thread Jerry Raj
Hi, Can a UDF return a list of values that can be used in a WHERE clause? Something like: sqlCtx.registerFunction("myudf", { Array(1, 2, 3) }) val sql = "select doc_id, doc_value from doc_table where doc_id in myudf()" This does not work: Exception in thread "main" java.l

Re: Spark with HBase

2014-12-03 Thread Ted Yu
Which hbase release are you running ? If it is 0.98, take a look at: https://issues.apache.org/jira/browse/SPARK-1297 Thanks On Dec 2, 2014, at 10:21 PM, Jai wrote: > I am trying to use Apache Spark with a psuedo distributed Hadoop Hbase > Cluster and I am looking for some links regarding the

Does count() evaluate all mapped functions?

2014-12-03 Thread Tobias Pfeiffer
Hi, I have an RDD and a function that should be called on every item in this RDD once (say it updates an external database). So far, I used rdd.map(myFunction).count() or rdd.mapPartitions(iter => iter.map(myFunction)) but I am wondering if this always triggers the call of myFunction in both c

Re: Filter using the Vertex Ids

2014-12-03 Thread Deep Pradhan
This is just an example but if my graph is big, there will be so many tuples to handle. I cannot manually do val a: RDD[(Int, Double)] = sc.parallelize(List( (1, 1.0), (2, 1.0), (3, 2.0), (4, 2.0), (5, 0.0))) for all the vertices in the graph. What should I do in that

Re: WordCount fails in .textFile() method

2014-12-03 Thread Akhil Das
dump your classpath, looks like you have multiple versions of guava jars in the classpath. Thanks Best Regards On Wed, Dec 3, 2014 at 2:30 PM, Rahul Swaminathan < rahul.swaminat...@duke.edu> wrote: > I’ve tried that and the same error occurs. Do you have any other > suggestions? > > Thanks! >

Re: WordCount fails in .textFile() method

2014-12-03 Thread Rahul Swaminathan
I've tried that and the same error occurs. Do you have any other suggestions? Thanks! Rahul From: Akhil Das mailto:ak...@sigmoidanalytics.com>> Date: Wednesday, December 3, 2014 at 3:55 AM To: Rahul Swaminathan mailto:rahul.swaminat...@duke.edu>> Cc: "u...@spark.incubator.apache.org

Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-03 Thread sahanbull
Hi Guys, I am trying to use SparkSQL to convert an RDD to SchemaRDD so that I can save it in parquet format. A record in my RDD has the following format: RDD1 { field1:5, field2: 'string', field3: {'a':1, 'c':2} } I am using field3 to represent a "sparse vector" and it can have keys

Re: Spark SQL table Join, one task is taking long

2014-12-03 Thread Cheng Lian
Hey Venkat, This behavior seems reasonable. According to the table name, I guess here |DAgents| should be the fact table and |ContactDetails| is the dim table. Below is an explanation of a similar query, you may see |src| as |DAgents| and |src1| as |ContactDetails|. |0: jdbc:hive2://localhos

Re: Filter using the Vertex Ids

2014-12-03 Thread Ankur Dave
At 2014-12-03 02:13:49 -0800, Deep Pradhan wrote: > We cannot do sc.parallelize(List(VertexRDD)), can we? There's no need to do this, because every VertexRDD is also a pair RDD: class VertexRDD[VD] extends RDD[(VertexId, VD)] You can simply use graph.vertices in place of `a` in my example.

Re: Filter using the Vertex Ids

2014-12-03 Thread Ankur Dave
At 2014-12-02 22:01:20 -0800, Deep Pradhan wrote: > I have a graph which returns the following on doing graph.vertices > (1, 1.0) > (2, 1.0) > (3, 2.0) > (4, 2.0) > (5, 0.0) > > I want to group all the vertices with the same attribute together, like into > one RDD or something. I want all the vert

textFileStream() issue?

2014-12-03 Thread Bahubali Jain
Hi, I am trying to use textFileStream("some_hdfs_location") to pick new files from a HDFS location.I am seeing a pretty strange behavior though. textFileStream() is not detecting new files when I "move" them from a location with in hdfs to location at which textFileStream() is checking for new file

Re: getting firs N messages froma Kafka topic using Spark Streaming

2014-12-03 Thread Akhil Das
You could do something like: val stream = kafkaStream.getStream().repartition(1).mapPartitions(x => x. take(*10*)) Here stream will have 10 elements from the kafakaStream. Thanks Best Regards On Wed, Dec 3, 2014 at 1:05 PM, Hafiz Mujadid wrote: > Hi Experts! > > Is there a way to read fir

Re: getting firs N messages froma Kafka topic using Spark Streaming

2014-12-03 Thread Hafiz Mujadid
Hi Akhil! Thanks for your response. Can you please suggest me how to return this sample from a function to the caller and stopping SparkStreaming Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/getting-firs-N-messages-froma-Kafka-topic-using-Spark-S

Re: How to enforce RDD to be cached?

2014-12-03 Thread Daniel Darabos
On Wed, Dec 3, 2014 at 10:52 AM, shahab wrote: > Hi, > > I noticed that rdd.cache() is not happening immediately rather due to lazy > feature of Spark, it is happening just at the moment you perform some > map/reduce actions. Is this true? > Yes, this is correct. If this is the case, how can I

Re: Announcing Spark 1.1.1!

2014-12-03 Thread rzykov
Andrew and developers, thank you for excellent release! It fixed almost all of our issues. Now we are migrating to Spark from Zoo of Python, Java, Hive, Pig jobs. Our Scala/Spark jobs often failed on 1.1. Spark 1.1.1 works like a Swiss watch. -- View this message in context: http://apache-spar

converting DStream[String] into RDD[String] in spark streaming

2014-12-03 Thread Hafiz Mujadid
Hi everyOne! I want to convert a DStream[String] into an RDD[String]. I could not find how to do this. var data = KafkaUtils.createStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, consumerConfig, topicMap, StorageLevel.MEMORY_ONLY).map(_._2) val streams = data.window(S

collecting fails - requirements for collecting (clone, hashCode etc?)

2014-12-03 Thread Ron Ayoub
The following code is failing on the collect. If I don't do the collect and go with a JavaRDD it works fine. Except I really would like to collect. At first I was getting an error regarding JDI threads and an index being 0. Then it just started locking up. I'm running the spark context locally o

Re: Logging problem in Spark when using Flume Log4jAppender

2014-12-03 Thread QiaoanChen
Here is the jstack command output, maybe this is helpful: js.js -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Logging-problem-in-Spark-when-using-Flume-Log4jAppender-tp19140p20255.h

RE: collecting fails - requirements for collecting (clone, hashCode etc?)

2014-12-03 Thread Ron Ayoub
I didn't realize I do get a nice stack trace if not running in debug mode. Basically, I believe Document has to be serializable. But since the question has already been asked, are the other requirements for objects within an RDD that I should be aware of. serializable is very understandable. Ho

Re: converting DStream[String] into RDD[String] in spark streaming

2014-12-03 Thread Sean Owen
DStream.foreachRDD gives you an RDD[String] for each interval of course. I don't think it makes sense to say a DStream can be converted into one RDD since it is a stream. The past elements are inherently not supposed to stick around for a long time, and future elements aren't known. You may conside

Re: converting DStream[String] into RDD[String] in spark streaming

2014-12-03 Thread Hafiz Mujadid
Thanks Dear, It is good to save this data to HDFS and then load back into an RDD :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/converting-DStream-String-into-RDD-String-in-spark-streaming-tp20253p20258.html Sent from the Apache Spark User List mailing l

Re: How to enforce RDD to be cached?

2014-12-03 Thread Paolo Platter
Yes, otherwise you can try: rdd.cache().count() and then run your benchmark Paolo Da: Daniel Darabos Data invio: ?mercoled?? ?3? ?dicembre? ?2014 ?12?:?28 A: shahab Cc: user@spark.apache.org

Re: Low Level Kafka Consumer for Spark

2014-12-03 Thread Luis Ángel Vicente Sánchez
My main complain about the WAL mechanism in the new reliable kafka receiver is that you have to enable checkpointing and for some reason, even if spark.cleaner.ttl is set to a reasonable value, only the metadata is cleaned periodically. In my tests, using a folder in my filesystem as the checkpoint

Re: Does filter on an RDD scan every data item ?

2014-12-03 Thread Sean Owen
take(1000) merely takes the first 1000 elements of an RDD. I don't imagine that's what the OP means. filter() is how you select a subset of elements to work with. Yes, this requires evaluating the predicate on all 10M elements, at least once. I don't think you could avoid this in general, right, in

Failed fetch: "Could not get block(s)"

2014-12-03 Thread Al M
I am using Spark 1.1.1. I am seeing an issue that only appears when I run in standalone clustered mode with at least 2 workers. The workers are on separate physical machines. I am performing a simple join on 2 RDDs. After the join I run first() on the joined RDD (in Scala) to get the first resul

Re: MLlib Naive Bayes classifier confidence

2014-12-03 Thread Sean Owen
Probabilities won't sum to 1 since this expression doesn't incorporate the probability of the evidence, I imagine? it's constant across classes so is usually excluded. It would appear as a "- log(P(evidence))" term. On Tue, Dec 2, 2014 at 10:44 AM, MariusFS wrote: > Are we sure that exponentiatin

Spark MOOC by Berkeley and Databricks

2014-12-03 Thread Marco Didonna
Hello everybody, in case you missed DataBricks and Berkeley have announced a free mooc on spark and another one on scalable machine learning using spark. Both courses are free but if you want to have a verified certificate of completion you need to donate at least 50$. I did it, it's a great invest

what is the best way to implement mini batches?

2014-12-03 Thread ll
hi. what is the best way to pass through a large dataset in small, sequential mini batches? for example, with 1,000,000 data points and the mini batch size is 10, we would need to do some computation at these mini batches (0..9), (10..19), (20..29), ... (N-9, N) RDD.repartition(N/10).mapParti

Providing query dsl to Elasticsearch for Spark (2.1.0.Beta3)

2014-12-03 Thread Ian Wilkinson
Hi, I'm trying the Elasticsearch support for Spark (2.1.0.Beta3). In the following I provide the query (as query dsl): import org.elasticsearch.spark._ object TryES { val sparkConf = new SparkConf().setAppName("Campaigns") sparkConf.set("es.nodes", ":9200") sparkConf.set("es.nodes.discov

Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

2014-12-03 Thread cjdc
Ideas? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20267.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: ALS failure with size > Integer.MAX_VALUE

2014-12-03 Thread Bharath Ravi Kumar
Thanks Xiangrui. I'll try out setting a smaller number of item blocks. And yes, I've been following the JIRA for the new ALS implementation. I'll try it out when it's ready for testing. . On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng wrote: > Hi Bharath, > > You can try setting a small item bloc

Re: Help understanding - Not enough space to cache rdd

2014-12-03 Thread akhandeshi
hmm.. 33.6gb is sum of the memory used by the two RDD that is cached. You're right when I put serialized RDDs in the cache, the memory foot print for these rdds become a lot smaller. Serialized Memory footprint shown below: RDD NameStorage Level Cached Partitions Fraction Cached S

Serializing with Kryo NullPointerException - Java

2014-12-03 Thread Robin Keunen
Hi all, I am having troubles using Kryo and being new to this kind of serialization, I am not sure where to look. Can someone please help me? :-) Here is my custom class: public class *DummyClass* implements KryoSerializable { private static final Logger LOGGER = LoggerFactory.getLogger(

Re: what is the best way to implement mini batches?

2014-12-03 Thread Alex Minnaar
I am trying to do the same thing and also wondering what the best strategy is. Thanks From: ll Sent: Wednesday, December 3, 2014 10:28 AM To: u...@spark.incubator.apache.org Subject: what is the best way to implement mini batches? hi. what is the best w

[SQL] Wildcards in SQLContext.parquetFile?

2014-12-03 Thread Yana Kadiyska
Hi folks, I'm wondering if someone has successfully used wildcards with a parquetFile call? I saw this thread and it makes me think no? http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3CCACA1tWLjcF-NtXj=pqpqm3xk4aj0jitxjhmdqbojj_ojybo...@mail.gmail.com%3E I have a set

Re: Help understanding - Not enough space to cache rdd

2014-12-03 Thread akhandeshi
I think, the memory calculation is correct, what I didn't account for is the memory used. I am still puzzled as how I can successfully process the RDD in spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-understanding-Not-enough-space-to-cache-rdd

GraphX Pregel halting condition

2014-12-03 Thread Jay Hutfles
I'm trying to implement a graph algorithm that does a form of path searching. Once a certain criteria is met on any path in the graph, I wanted to halt the rest of the iterations. But I can't see how to do that with the Pregel API, since any vertex isn't able to know the state of other arbitrary

Re: heterogeneous cluster setup

2014-12-03 Thread Victor Tso-Guillen
I don't have a great answer for you. For us, we found a common divisor, not necessarily a whole gigabyte, of the available memory of the different hardware and used that as the amount of memory per worker and scaled the number of cores accordingly so that every core in the system has the same amoun

dockerized spark executor on mesos?

2014-12-03 Thread Dick Davies
Just wondered if anyone had managed to start spark jobs on mesos wrapped in a docker container? At present (i.e. very early testing) I'm able to submit executors to mesos via spark-submit easily enough, but they fall over as we don't have a JVM on our slaves out of the box. I can push one out via

Best way to have some singleton per worker

2014-12-03 Thread Ashic Mahtab
Hello, I was wondering what the best way is to have some form of singleton per worker...something that'll be instantiated once at the start of a job for each worker, and shut down when all the work on that node is completed. For instance, say I have a client library that initiates a single s

Monitoring Spark

2014-12-03 Thread Isca Harmatz
hello, im running spark on stand alone station and im try to view the event log after the run is finished i turned on the event log as the site said (spark.eventLog.enabled set to true) but i can't find the log files or get the web ui to work. any idea on how to do this? thanks Isca

Re: SchemaRDD + SQL , loading projection columns

2014-12-03 Thread Vishnusaran Ramaswamy
Thanks for the help.. Let me find more info on how to enable statistics in parquet. -Vishnu Michael Armbrust wrote > There is not a super easy way to do what you are asking since in general > parquet needs to read all the data in a column. As far as I understand it > does not have indexes that w

Re: How to enforce RDD to be cached?

2014-12-03 Thread shahab
Daniel and Paolo, thanks for the comments. best, /Shahab On Wed, Dec 3, 2014 at 3:12 PM, Paolo Platter wrote: > Yes, > > otherwise you can try: > > rdd.cache().count() > > and then run your benchmark > > Paolo > > *Da:* Daniel Darabos > *Data invio:* ‎mercoledì‎ ‎3‎ ‎dicembre‎ ‎2014 ‎12

MLLib: loading saved model

2014-12-03 Thread Sameer Tilak
Hi All,I am using LinearRegressionWithSGD and then I save the model weights and intercept. File that contains weights have this format: 1.204550.13560.000456.. Intercept is 0 since I am using train not setting the intercept so it can be ignored for the moment. I would now like to initialize a

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-03 Thread Davies Liu
inferSchema() will work better than jsonRDD() in your case, >>> from pyspark.sql import Row >>> srdd = sqlContext.inferSchema(rdd.map(lambda x: Row(**x))) >>> srdd.first() Row( field1=5, field2='string', field3={'a'=1, 'c'=2}) On Wed, Dec 3, 2014 at 12:11 AM, sahanbull wrote: > Hi Guys, > > I am

GraphX Pregel halting condition

2014-12-03 Thread Jay Hutfles
I'm trying to implement a graph algorithm that does a form of path searching. Once a certain criteria is met on any path in the graph, I wanted to halt the rest of the iterations. But I can't see how to do that with the Pregel API, since any vertex isn't able to know the state of other arbitrary

Re: How to enforce RDD to be cached?

2014-12-03 Thread dsiegel
shahabm wrote > I noticed that rdd.cache() is not happening immediately rather due to lazy > feature of Spark, it is happening just at the moment you perform some > map/reduce actions. Is this true? Yes, .cache() is a transformation (lazy evaluation) shahabm wrote > If this is the case, how can

Re: How can I read an avro file in HDFS in Java?

2014-12-03 Thread Prannoy
Hi, Try using sc.newAPIHadoopFile("", AvroSequenceFileInputFormat.class, AvroKey.class, AvroValue.class, your Configuration) You will get the Avro related classes by importing org.apache.avro.* Thanks. On Tue, Dec 2, 2014 at 9:23 PM, leaviva [via Apache Spark User List] < ml-node+s10015

Re: WordCount fails in .textFile() method

2014-12-03 Thread Rahul Swaminathan
For others who may be having a similar problem: The error below occurs when using Yarn, which uses an earlier version of Guava compared to Spark 1.1.0. When packaging using Maven, if you put the Yarn dependency above the Spark dependency, the earlier version of guava is the one that gets recogn

Re: [SQL] Wildcards in SQLContext.parquetFile?

2014-12-03 Thread Michael Armbrust
It won't work until this is merged: https://github.com/apache/spark/pull/3407 On Wed, Dec 3, 2014 at 9:25 AM, Yana Kadiyska wrote: > Hi folks, > > I'm wondering if someone has successfully used wildcards with a > parquetFile call? > > I saw this thread and it makes me think no? > http://mail-arc

Re: object xxx is not a member of package com

2014-12-03 Thread Prannoy
Hi, Add the jars in the external library of you related project. Right click on package or class -> Build Path -> Configure Build Path -> Java Build Path -> Select the Libraries tab -> Add external library -> Browse to com.xxx.yyy.zzz._ -> ok Clean and build your project, most probably you will b

Re: Does filter on an RDD scan every data item ?

2014-12-03 Thread dsiegel
>> nsareen wrote >>> 1) Does filter function scan every element saved in RDD? if my RDD >>> represents 10 Million rows, and if i want to work on only 1000 of them, >>> is >>> there an efficient way of filtering the subset without having to scan >>> every element ? using .take(1000) may be a biase

Re: Does filter on an RDD scan every data item ?

2014-12-03 Thread dsiegel
also available is .sample(), which will randomly sample your RDD with or without replacement, and returns an RDD. .sample() takes a fraction, so it doesn't return an exact number of elements. eg. rdd.sample(true, .0001, 1) -- View this message in context: http://apache-spark-user-list.100156

Re: Insert new data into specific partition of an RDD

2014-12-03 Thread dsiegel
I'm not sure about .union(), but at least in the case of .join(), as long as you have hash partitioned the original RDDs and persisted them, calls to .join() take advantage of already knowing which partition the keys are on, and will not repartition rdd1. val rdd1 = log.partitionBy(new HashPartit

Re: Announcing Spark 1.1.1!

2014-12-03 Thread Romi Kuntsman
About version compatibility and upgrade path - can the Java application dependencies and the Spark server be upgraded separately (i.e. will 1.1.0 library work with 1.1.1 server, and vice versa), or do they need to be upgraded together? Thanks! *Romi Kuntsman*, *Big Data Engineer* http://www.tot

Alternatives to groupByKey

2014-12-03 Thread ameyc
Hi, So my Spark app needs to run a sliding window through a time series dataset (I'm not using Spark streaming). And then run different types on aggregations on per window basis. Right now I'm using a groupByKey() which gives me Iterables for each window. There are a few concerns I have with this

Re: Announcing Spark 1.1.1!

2014-12-03 Thread Andrew Or
By the Spark server do you mean the standalone Master? It is best if they are upgraded together because there have been changes to the Master in 1.1.1. Although it might "just work", it's highly recommended to restart your cluster manager too. 2014-12-03 13:19 GMT-08:00 Romi Kuntsman : > About ve

Re: Problem creating EC2 cluster using spark-ec2

2014-12-03 Thread Andrew Or
Yeah this is currently broken for 1.1.1. I will submit a fix later today. 2014-12-02 17:17 GMT-08:00 Shivaram Venkataraman : > +Andrew > > Actually I think this is because we haven't uploaded the Spark binaries to > cloudfront / pushed the change to mesos/spark-ec2. > > Andrew, can you take care

Re: dockerized spark executor on mesos?

2014-12-03 Thread Matei Zaharia
I'd suggest asking about this on the Mesos list (CCed). As far as I know, there was actually some ongoing work for this. Matei > On Dec 3, 2014, at 9:46 AM, Dick Davies wrote: > > Just wondered if anyone had managed to start spark > jobs on mesos wrapped in a docker container? > > At present

Re: Announcing Spark 1.1.1!

2014-12-03 Thread Aaron Davidson
Because this was a maintenance release, we should not have introduced any binary backwards or forwards incompatibilities. Therefore, applications that were written and compiled against 1.1.0 should still work against a 1.1.1 cluster, and vice versa. On Wed, Dec 3, 2014 at 1:30 PM, Andrew Or wrote

Re: Problem creating EC2 cluster using spark-ec2

2014-12-03 Thread Andrew Or
This should be fixed now. Thanks for bringing this to our attention. 2014-12-03 13:31 GMT-08:00 Andrew Or : > Yeah this is currently broken for 1.1.1. I will submit a fix later today. > > 2014-12-02 17:17 GMT-08:00 Shivaram Venkataraman < > shiva...@eecs.berkeley.edu>: > > +Andrew >> >> Actually

How to create a new SchemaRDD which is not based on original SparkPlan?

2014-12-03 Thread Tim Chou
Hi All, My question is about lazy running mode for SchemaRDD, I guess. I know lazy mode is good, however, I still have this demand. For example, here is the first SchemaRDD, named result.(select * from table where num>1 and num < 4): results: org.apache.spark.sql.SchemaRDD = SchemaRDD[59] at RDD

Re: Alternatives to groupByKey

2014-12-03 Thread Nathan Kronenfeld
I think it would depend on the type and amount of information you're collecting. If you're just trying to collect small numbers for each window, and don't have an overwhelming number of windows, you might consider using accumulators. Just make one per value per time window, and for each data poin

Spark executor lost

2014-12-03 Thread S. Zhou
We are using Spark job server to submit spark jobs (our spark version is 0.91). After running the spark job server for a while, we often see the following errors (executor lost) in the spark job server log. As a consequence, the spark driver (allocated inside spark job server) gradually loses ex

Re: Alternatives to groupByKey

2014-12-03 Thread Xuefeng Wu
I have similar requirememt,take top N by key. right now I use groupByKey,but one key would group more than half data in some dataset. Yours, Xuefeng Wu 吴雪峰 敬上 > On 2014年12月4日, at 上午7:26, Nathan Kronenfeld > wrote: > > I think it would depend on the type and amount of information you're > co

RE: Spark executor lost

2014-12-03 Thread Ganelin, Ilya
You want to look further up the stack (there are almost certainly other errors before this happens) and those other errors may give your better idea of what is going on. Also if you are running on yarn you can run "yarn logs -applicationId " to get the logs from the data nodes. Sent with Good

Re: Alternatives to groupByKey

2014-12-03 Thread Koert Kuipers
do these requirements boils down to a need for foldLeftByKey with sorting of the values? https://issues.apache.org/jira/browse/SPARK-3655 On Wed, Dec 3, 2014 at 6:34 PM, Xuefeng Wu wrote: > I have similar requirememt,take top N by key. right now I use > groupByKey,but one key would group more

single key-value pair fitting in memory

2014-12-03 Thread dsiegel
Hi, In the talk "A Deeper Understanding of Spark Internals", it was mentioned that for some operators, spark can spill to disk across keys (in 1.1 - .groupByKey(), .reduceByKey(), .sortByKey()), but that as a limitation of the shuffle at that time, each single key-value pair must fit in memory.

SQL query in scala API

2014-12-03 Thread Arun Luthra
I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API, I can make an RDD (called "users") of key-value pairs where the keys are zip (as in ZIP code) and the values are user id's. Then I ca

Re: Spark SQL UDF returning a list?

2014-12-03 Thread Tobias Pfeiffer
Hi, On Wed, Dec 3, 2014 at 4:31 PM, Jerry Raj wrote: > > Exception in thread "main" java.lang.RuntimeException: [1.57] failure: > ``('' expected but identifier myudf found > > I also tried returning a List of Ints, that did not work either. Is there > a way to write a UDF that returns a list? >

Re: Spark executor lost

2014-12-03 Thread Ted Yu
bq. to get the logs from the data nodes Minor correction: the logs are collected from machines where node managers run. Cheers On Wed, Dec 3, 2014 at 3:39 PM, Ganelin, Ilya wrote: > You want to look further up the stack (there are almost certainly other > errors before this happens) and thos

How can a function running on a slave access the Executor

2014-12-03 Thread Steve Lewis
I have been working on balancing work across a number of partitions and find it would be useful to access information about the current execution environment much of which (like Executor ID) are available if there was a way to get the current executor or the Hadoop TaskAttempt context - does any o

Re: textFileStream() issue?

2014-12-03 Thread Tobias Pfeiffer
Hi, On Wed, Dec 3, 2014 at 5:31 PM, Bahubali Jain wrote: > > I am trying to use textFileStream("some_hdfs_location") to pick new files > from a HDFS location.I am seeing a pretty strange behavior though. > textFileStream() is not detecting new files when I "move" them from a > location with in hd

Re: Alternatives to groupByKey

2014-12-03 Thread Xuefeng Wu
looks good. I concern about the foldLeftByKey which looks break the consistence from foldLeft in RDD and aggregateByKey in PairRDD Yours, Xuefeng Wu 吴雪峰 敬上 > On 2014年12月4日, at 上午7:47, Koert Kuipers wrote: > > foldLeftByKey ---

Re: Best way to have some singleton per worker

2014-12-03 Thread Tobias Pfeiffer
Hi, On Thu, Dec 4, 2014 at 2:59 AM, Ashic Mahtab wrote: > > I've been doing this with foreachPartition (i.e. have the parameters for > creating the singleton outside the loop, do a foreachPartition, create the > instance, loop over entries in the partition, close the partition), but > it's quite

Re: dockerized spark executor on mesos?

2014-12-03 Thread Kyle Ellrott
I'd like to tag a question onto this; has anybody attempted to deploy spark under Kubernetes https://github.com/googlecloudplatform/kubernetes or Kubernetes mesos ( https://github.com/mesosphere/kubernetes-mesos ) . On Wednesday, December 3, 2014, Matei Zaharia wrote: > I'd suggest asking about t

wordcount accross several files

2014-12-03 Thread BC
I'm trying to run wordcount on several files, but stuck in failing to pass the output from one file to another. Any help would be appreciate. sc = SparkContext() for datafile in inputfiles: lines = sc.textFile(indir + "/" + datafile, 1) counts = lines.flatMap(lambda x: x.split(' ')) \

reading dynamoDB with spark

2014-12-03 Thread Tyson
Hi, I try to read data from DynamoDB table with Spark, but after I run this code I got an error massege like in below. I use Spark 1.1.1 and emr-core-1.1.jar, emr-ddb-hive-1.0.jar and emr-ddb-hadoop-1.0.jar. valsparkConf = SparkConf().setAppName("DynamoRdeader").setMaster("local[4]") valctx

Spark SQL with a sorted file

2014-12-03 Thread Jerry Raj
Hi, If I create a SchemaRDD from a file that I know is sorted on a certain field, is it possible to somehow pass that information on to Spark SQL so that SQL queries referencing that field are optimized? Thanks -Jerry - To un

Re: Having problem with Spark streaming with Kinesis

2014-12-03 Thread A.K.M. Ashrafuzzaman
Guys, In my local machine it consumes a stream of Kinesis with 3 shards. But in EC2 it does not consume from the stream. Later we found that the EC2 machine was of 2 cores and my local machine was of 4 cores. I am using a single machine and in spark standalone mode. And we got a larger machine f

Re: SQL query in scala API

2014-12-03 Thread Cheng Lian
You may do this: |table("users").groupBy('zip)('zip, count('user), countDistinct('user)) | On 12/4/14 8:47 AM, Arun Luthra wrote: I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API

spark-submit on YARN is slow

2014-12-03 Thread Tobias Pfeiffer
Hi, I am using spark-submit to submit my application to YARN in "yarn-cluster" mode. I have both the Spark assembly jar file as well as my application jar file put in HDFS and can see from the logging output that both files are used from there. However, it still takes about 10 seconds for my appli

cannot submit python files on EC2 cluster

2014-12-03 Thread chocjy
Hi, I am using spark with version number 1.1.0 on an EC2 cluster. After I submitted the job, it returned an error saying that a python module cannot be loaded due to missing files. I am using the same command that used to work on an private cluster before for submitting jobs and all the source fil

RE: Spark SQL UDF returning a list?

2014-12-03 Thread Cheng, Hao
Yes I agree, and it may also be ambiguous in semantic. A list of objects V.S. A list with single List Object. I’ve also tested that, seems a. There is a bug in registerFunction, which doesn’t support the UDF without argument. ( I just create a PR for this: https://github.com/apache/spark/

RE: Spark SQL with a sorted file

2014-12-03 Thread Cheng, Hao
You can try to write your own Relation with filter push down or use the ParquetRelation2 for workaround. (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala) Cheng Hao -Original Message- From: Jerry Raj [mailto:jerry@gma

Issue in executing Spark Application from Eclipse

2014-12-03 Thread Stuti Awasthi
Hi All, I have a standalone Spark(1.1) cluster on one machine and I have installed scala Eclipse IDE (scala 2.10) on my desktop. I am trying to execute a spark code to execute over my standalone cluster but getting errors. Please guide me to resolve this. Code: val logFile = "" // Should be so

MLLIB model export: PMML vs MLLIB serialization

2014-12-03 Thread sourabh
Hi All, I am doing model training using Spark MLLIB inside our hadoop cluster. But prediction happens in a different realtime synchronous system(Web application). I am currently exploring different options to export the trained Mllib models from spark. 1. *Export model as PMML:* I found the pro

Re: netty on classpath when using spark-submit

2014-12-03 Thread Tobias Pfeiffer
Markus, On Tue, Nov 11, 2014 at 10:40 AM, M. Dale wrote: > > I never tried to use this property. I was hoping someone else would jump > in. When I saw your original question I remembered that Hadoop has > something similar. So I searched and found the link below. A quick JIRA > search seems to

Re: Issue in executing Spark Application from Eclipse

2014-12-03 Thread Sonal Goyal
Seems like there is an issue with your standalone cluster as can be seen from the master logs. Are your DNS entries correct? Best Regards, Sonal Founder, Nube Technologies On Thu, Dec 4, 2014 at 11:50 AM, Stuti Awasthi wrote: >

Re: Filter using the Vertex Ids

2014-12-03 Thread Deep Pradhan
But groupByKey() gives me the error saying that it is not a member of org.apache.spark,rdd,RDD[(Double, org.apache.spark.graphx.VertexId)] when run in the graphx directory of spark-1.0.0. This error does not come when I use the same in the interactive shell. On Wed, Dec 3, 2014 at 3:49 PM, Ankur

Re: Filter using the Vertex Ids

2014-12-03 Thread Ankur Dave
To get that function in scope you have to import org.apache.spark.SparkContext._ Ankur On Wednesday, December 3, 2014, Deep Pradhan wrote: > But groupByKey() gives me the error saying that it is not a member of > org.apache.spark,rdd,RDD[(Double, org.apache.spark.graphx.VertexId)] > -- Ankur

  1   2   >