Re: java.io.IOException: Filesystem closed

2014-12-01 Thread Akhil Das
What is the application that you are submitting? Looks like you might have invoked fs inside the app and then closed it within it. Thanks Best Regards On Tue, Dec 2, 2014 at 11:59 AM, rapelly kartheek wrote: > Hi, > > I face the following exception when submit a spark application. The log > fil

java.io.IOException: Filesystem closed

2014-12-01 Thread rapelly kartheek
Hi, I face the following exception when submit a spark application. The log file shows: 14/12/02 11:52:58 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689) at org.apache.

Re: numpy arrays and spark sql

2014-12-01 Thread Davies Liu
applySchema() only accept RDD of Row/list/tuple, it does not work with numpy.array. After applySchema(), the Python RDD will be pickled and unpickled in JVM, so you will not have any benefit by using numpy.array. It will work if you convert ndarray into list: schemaRDD = sqlContext.applySchema(r

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
You may be able to construct RDDs directly from an iterator - not sure - you may have to subclass your own. On 1 December 2014 at 18:40, Keith Simmons wrote: > Yep, that's definitely possible. It's one of the workarounds I was > considering. I was just curious if there was a simpler (and perhap

Re: Calling spark from a java web application.

2014-12-01 Thread ryaminal
If you are able to use YARN in your hadoop cluster, then the following technique is pretty straightforward: http://blog.sequenceiq.com/blog/2014/08/22/spark-submit-in-java/ We use this in our system and it's super easy to execute from our Tomcat application. -- View this message in context: ht

Re: ALS failure with size > Integer.MAX_VALUE

2014-12-01 Thread Bharath Ravi Kumar
Yes, the issue appears to be due to the 2GB block size limitation. I am hence looking for (user, product) block sizing suggestions to work around the block size limitation. On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen wrote: > (It won't be that, since you see that the error occur when reading a >

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
Yep, that's definitely possible. It's one of the workarounds I was considering. I was just curious if there was a simpler (and perhaps more efficient) approach. Keith On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg wrote: > Could you modify your function so that it streams through the files record

numpy arrays and spark sql

2014-12-01 Thread Joseph Winston
This works as expected in the 1.1 branch: from pyspark.sql import * rdd = sc.parallelize([range(0, 10), range(10,20), range(20, 30)] # define the schema schemaString = "value1 value2 value3 value4 value5 value6 value7 value8 value9 value10" fields = [StructField(field_name, IntegerType(), True

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
Could you modify your function so that it streams through the files record by record and outputs them to hdfs, then read them all in as RDDs and take the union? That would only use bounded memory. On 1 December 2014 at 17:19, Keith Simmons wrote: > Actually, I'm working with a binary format. Th

Re: Passing Java Options to Spark AM launching

2014-12-01 Thread Zhan Zhang
Please check whether https://github.com/apache/spark/pull/3409#issuecomment-64045677 solve the problem for launching AM. Thanks. Zhan Zhang On Dec 1, 2014, at 4:49 PM, Mohammad Islam wrote: > Hi, > How to pass the Java options (such as "-XX:MaxMetaspaceSize=100M") when > lunching AM or task

Re: Passing Java Options to Spark AM launching

2014-12-01 Thread Mohammad Islam
Thanks Tobias for the answer.Does it work for "driver" as well? Regards,Mohammad On Monday, December 1, 2014 5:30 PM, Tobias Pfeiffer wrote: Hi, have a look at the documentation for spark.driver.extraJavaOptions (which seems to have disappeared since I looked it up last week) and  s

Re: Passing Java Options to Spark AM launching

2014-12-01 Thread Tobias Pfeiffer
Hi, have a look at the documentation for spark.driver.extraJavaOptions (which seems to have disappeared since I looked it up last week) and spark.executor.extraJavaOptions at < http://spark.apache.org/docs/latest/configuration.html#runtime-environment>. Tobias

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
Actually, I'm working with a binary format. The api allows reading out a single record at a time, but I'm not sure how to get those records into spark (without reading everything into memory from a single file at once). On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg wrote: > file => tranform file

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
> > file => tranform file into a bunch of records What does this function do exactly? Does it load the file locally? Spark supports RDDs exceeding global RAM (cf the terasort example), but if your example just loads each file locally, then this may cause problems. Instead, you should load each fi

Passing Java Options to Spark AM launching

2014-12-01 Thread Mohammad Islam
Hi,How to pass the Java options (such as "-XX:MaxMetaspaceSize=100M") when lunching AM or task containers? This is related to running Spark on Yarn (Hadoop 2.3.0). In Map-reduce case, setting the property such as "mapreduce.map.java.opts" would do the work. Any help would be highly appreciated.

Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
This is a long shot, but... I'm trying to load a bunch of files spread out over hdfs into an RDD, and in most cases it works well, but for a few very large files, I exceed available memory. My current workflow basically works like this: context.parallelize(fileNames).flatMap { file => tranform

Re: How take top N of top M from RDD as RDD

2014-12-01 Thread Xuefeng Wu
hi Debasish, I found test code in map translate, would it collect all products too? + val sortedProducts = products.toArray.sorted(ord.reverse) Yours, Xuefeng Wu 吴雪峰 敬上 > On 2014年12月2日, at 上午1:33, Debasish Das wrote: > > rdd.top collects it on master... > > If you want topk for a key run m

RE: hdfs streaming context

2014-12-01 Thread Bui, Tri
Yep. No localhost Usually, I use hdfs:///user/data to indicates I want hdfs or file:///user/data to indicates local file directory. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, December 01, 2014 5:06 PM To: Bui, Tri Cc: Benjamin Cuthbert; user@spark.

Re: hdfs streaming context

2014-12-01 Thread Sean Owen
Yes but you can't follow three slashes with host:port. No host probably defaults to whatever is found in your HDFS config. On Mon, Dec 1, 2014 at 11:02 PM, Bui, Tri wrote: > For the streaming example I am working on, Its accepted ("hdfs:///user/data") > without the localhost info. > > Let me dig

RE: hdfs streaming context

2014-12-01 Thread Bui, Tri
For the streaming example I am working on, Its accepted ("hdfs:///user/data") without the localhost info. Let me dig through my hdfs config. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, December 01, 2014 4:50 PM To: Benjamin Cuthbert Cc: user@spark.

Re: hdfs streaming context

2014-12-01 Thread Benjamin Cuthbert
Thanks Sean, That worked just removing the /* and leaving it as /user/data Seems to be streaming in. > On 1 Dec 2014, at 22:50, Sean Owen wrote: > > Yes, in fact, that's the only way it works. You need > "hdfs://localhost:8020/user/data", I believe. > > (No it's not correct to write "hdfs://

Re: hdfs streaming context

2014-12-01 Thread Sean Owen
Yes, in fact, that's the only way it works. You need "hdfs://localhost:8020/user/data", I believe. (No it's not correct to write "hdfs:///...") On Mon, Dec 1, 2014 at 10:41 PM, Benjamin Cuthbert wrote: > All, > > Is it possible to stream on HDFS directory and listen for multiple files? > > I hav

Re: hdfs streaming context

2014-12-01 Thread Andy Twigg
Have you tried just passing a path to ssc.textFileStream() ? It monitors the path for new files by looking at mtime/atime ; all new/touched files in the time window appear as an rdd in the dstream. On 1 December 2014 at 14:41, Benjamin Cuthbert wrote: > All, > > Is it possible to stream on HDFS d

RE: hdfs streaming context

2014-12-01 Thread Bui, Tri
Try ("hdfs:///localhost:8020/user/data/*") With 3 "/". Thx tri -Original Message- From: Benjamin Cuthbert [mailto:cuthbert@gmail.com] Sent: Monday, December 01, 2014 4:41 PM To: user@spark.apache.org Subject: hdfs streaming context All, Is it possible to stream on HDFS director

hdfs streaming context

2014-12-01 Thread Benjamin Cuthbert
All, Is it possible to stream on HDFS directory and listen for multiple files? I have tried the following val sparkConf = new SparkConf().setAppName("HdfsWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = ssc.textFileStream("hdfs://localhost:8020/user/data/*") lines.fi

Spark SQL table Join, one task is taking long

2014-12-01 Thread Venkat Subramanian
Environment: Spark 1.1, 4 Node Spark and Hadoop Dev cluster - 6 cores, 32 GB Ram each. Default serialization, Standalone, no security Data was sqooped from relational DB to HDFS and Data is partitioned across HDFS uniformly. I am reading a fact table about 8 GB in size and one small dim table fro

StreamingLinearRegressionWithSGD

2014-12-01 Thread Joanne Contact
Hi Gurus, I did not look at the code yet. I wonder if StreamingLinearRegressionWithSGD is equivalent to LinearRegressionWithSGD

RE: Inaccurate Estimate of weights model from StreamingLinearRegressionWithSGD

2014-12-01 Thread Bui, Tri
Thanks Yanbo! That works! The only issue is that it won’t print the predicted value from lp.features, from code line below. model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print() It prints the test input data correctly, but it keeps on printing “0.0” as the predicted value

Remove added jar from spark context

2014-12-01 Thread ankits
Hi, Is there a way to REMOVE a jar (added via addJar) to spark contexts? We have a long running context used by the spark jobserver, but after trying to update versions of classes already in the class path via addJars, the context still runs the old versions. It would be helpful if I could remove

Re: How to use FlumeInputDStream in spark cluster?

2014-12-01 Thread Ping Tang
Thank you very much for your reply. I have a cluster of 8 nodes: m1, m2, m3.. m8. m1 configured as Spark master node, the rest of the nodes are all worker node. I also configured m3 as the History Server. But the history server fails to start.I ran FlumeEventCount in m1 using the right hostname

Minimum cluster size for empirical testing

2014-12-01 Thread Valdes, Pablo
Hi everyone, I’m interested in empirically measuring how faster spark works in comparison to Hadoop for certain problems and input corpus I currently work with (I’ve read Matei Zahari’s “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” paper and I wa

Re: Creating a SchemaRDD from an existing API

2014-12-01 Thread Michael Armbrust
No, it should support any data source that has a schema and can produce rows. On Mon, Dec 1, 2014 at 1:34 AM, Niranda Perera wrote: > Hi Michael, > > About this new data source API, what type of data sources would it > support? Does it have to be RDBMS necessarily? > > Cheers > > On Sat, Nov 29,

How to Integrate openNLP with Spark

2014-12-01 Thread Nikhil
Hi, I am using openNLP NER ( Token Name finder ) for parsing an Unstructured data. In order to speed up my process( to quickly train a models and analyze the documents from the models ), I want to use Spark and I saw on the web that it is possible to connect openNLP with Spark using UIMAFit but I

Spark Summit East CFP - 5 days until deadline

2014-12-01 Thread Scott walent
The inaugural Spark Summit East (spark-summit.org/east), an event to bring the Apache Spark community together, will be in New York City on March 18, 2015. The call for submissions is currently open, but will close this Friday December 5, at 11:59pm PST. The summit is looking for talks that will

RE: Unable to compile spark 1.1.0 on windows 8.1

2014-12-01 Thread Judy Nash
Have you checked out the wiki here? http://spark.apache.org/docs/latest/building-with-maven.html A couple things I did differently from you: 1) I got the bits directly from github (https://github.com/apache/spark/). Use branch 1.1 for spark 1.1 2) execute maven command on cmd (powershell misses

Re: How take top N of top M from RDD as RDD

2014-12-01 Thread Debasish Das
rdd.top collects it on master... If you want topk for a key run map / mappartition and use a bounded priority queue and reducebykey the queues. I experimented with topk from algebird and bounded priority queue wrapped over jpriority queue ( spark default)...bpq is faster Code example is here: h

packaging from source gives protobuf compatibility issues.

2014-12-01 Thread akhandeshi
scala> textFile.count() java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$CompleteReques tProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; I tried ./make-distribution.sh -Dhadoop.version=2.5.0 and /usr/local/apache-

Re: Time based aggregation in Real time Spark Streaming

2014-12-01 Thread pankaj
Hi , suppose i keep batch size of 3 minute. in 1 batch there can be incoming records with any time stamp. so it is difficult to keep track of when the 3 minute interval was start and end. i am doing output operation on worker nodes in forEachPartition not in drivers(forEachRdd) so i cannot use any

Re: Mllib native netlib-java/OpenBLAS

2014-12-01 Thread agg212
Thanks for your reply, but I'm still running into issues installing/configuring the native libraries for MLlib. Here are the steps I've taken, please let me know if anything is incorrect. - Download Spark source - unzip and compile using `mvn -Pnetlib-lgpl -DskipTests clean package ` - Run `sbt/s

Re: Time based aggregation in Real time Spark Streaming

2014-12-01 Thread Bahubali Jain
Hi, You can associate all the messages of a 3min interval with a unique key and then group by and finally add up. Thanks On Dec 1, 2014 9:02 PM, "pankaj" wrote: > Hi, > > My incoming message has time stamp as one field and i have to perform > aggregation over 3 minute of time slice. > > Message

Re: Spark Job submit

2014-12-01 Thread Matt Narrell
Or setting the HADOOP_CONF_DIR property. Either way, you must have the YARN configuration available to the submitting application to allow for the use of “yarn-client” or “yarn-master” The attached stack trace below doesn’t provide any information as to why the job failed. mn > On Nov 27, 20

Re: How take top N of top M from RDD as RDD

2014-12-01 Thread Ritesh Kumar Singh
For converting an Array or any List to a RDD, we can try using : >sc.parallelize(groupedScore)//or whatever the name of the list variable is On Mon, Dec 1, 2014 at 8:14 PM, Xuefeng Wu wrote: > Hi, I have a problem, it is easy in Scala code, but I can not take the top > N from RDD as RDD

Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

2014-12-01 Thread cjdc
btw the same error from above also happen on 1.1.0 (just tested) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20106.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Is Spark the right tool for me?

2014-12-01 Thread Stadin, Benjamin
Yes, the processing causes the most stress. But this is parallizeable by splitting the input source. My problem is that once the heavy preprocessing is done, I’m in a „micro-update“ mode so to say (user-interactive part of the whole workflow). Then the map is rendered directly from the SQLite fi

Re: ensuring RDD indices remain immutable

2014-12-01 Thread rok
true though I was hoping to avoid having to sort... maybe there's no way around it. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ensuring-RDD-indices-remain-immutable-tp20094p20104.html Sent from the Apache Spark User List mailing list archive at

Problem creating EC2 cluster using spark-ec2

2014-12-01 Thread Dave Challis
I've been trying to create a Spark cluster on EC2 using the documentation at https://spark.apache.org/docs/latest/ec2-scripts.html (with Spark 1.1.1). Running the script successfully creates some EC2 instances, HDFS etc., but appears to fail to copy the actual files needed to run Spark across. I

Time based aggregation in Real time Spark Streaming

2014-12-01 Thread pankaj
Hi, My incoming message has time stamp as one field and i have to perform aggregation over 3 minute of time slice. Message sample "Item ID" "Item Type" "timeStamp" 1 X 1-12-2014:12:01 1 X 1-12-2014:12:02 1 X

Re: ensuring RDD indices remain immutable

2014-12-01 Thread Sean Owen
I think the robust thing to do is sort the RDD, and then zipWithIndex. Even if the RDD is recomputed, the ordering and thus assignment of IDs should be the same. On Mon, Dec 1, 2014 at 2:36 PM, rok wrote: > I have an RDD that serves as a feature look-up table downstream in my > analysis. I create

Re: Is Spark the right tool for me?

2014-12-01 Thread andy petrella
not applicable to your problem, but interesting enough to share on this thread: http://basepub.dauphine.fr/bitstream/handle/123456789/5260/SD-Rtree.PDF?sequence=2 On Mon Dec 01 2014 at 3:48:14 PM andy petrella wrote: > Indeed. However, I guess the important load and stress is in the > processing

Re: Is Spark the right tool for me?

2014-12-01 Thread andy petrella
Indeed. However, I guess the important load and stress is in the processing of the 3D data (DEM or alike) into geometries/shades/whatever. Hence you can use spark (geotrellis can be tricky for 3D, poke @lossyrob for more info) to perform these operations then keep an RDD of only the resulting geome

How take top N of top M from RDD as RDD

2014-12-01 Thread Xuefeng Wu
Hi, I have a problem, it is easy in Scala code, but I can not take the top N from RDD as RDD. There are 1 Student Score, ask take top 10 age, and then take top 10 from each age, the result is 100 records. The Scala code is here, but how can I do it in RDD, *for RDD.take return is Array, but

Re: Is Spark the right tool for me?

2014-12-01 Thread Stadin, Benjamin
… Sorry, I forgot to mention why I’m basically bound to SQLite. The workflow involves more data processings than I mentioned. There are several tools in the chain which either rely on SQLite as exchange format, or processings like data cleaning that are done orders of magnitude faster / or using

Re: Is Spark the right tool for me?

2014-12-01 Thread Stadin, Benjamin
Thank you for mentioning GeoTrellis. I haven’t heard of this before. We have many custom tools and steps, I’ll check our tools fit in. The end result after is actually a 3D map for native OpenGL based rendering on iOS / Android [1]. I’m using GeoPackage which is basically SQLite with R-Tree and

ensuring RDD indices remain immutable

2014-12-01 Thread rok
I have an RDD that serves as a feature look-up table downstream in my analysis. I create it using the zipWithIndex() and because I suppose that the elements of the RDD could end up in a different order if it is regenerated at any point, I cache it to try and ensure that the (feature --> index) mapp

Re: Is Spark the right tool for me?

2014-12-01 Thread andy petrella
Not quite sure which geo processing you're doing are they raster, vector? More info will be appreciated for me to help you further. Meanwhile I can try to give some hints, for instance, did you considered GeoMesa ? Since you need a WMS (or alike), did you

Is Spark the right tool for me?

2014-12-01 Thread Stadin, Benjamin
Hi all, I need some advise whether Spark is the right tool for my zoo. My requirements share commonalities with „big data“, workflow coordination and „reactive“ event driven data processing (as in for example Haskell Arrows), which doesn’t make it any easier to decide on a tool set. NB: I have

Re: Setting network variables in spark-shell

2014-12-01 Thread Shixiong Zhu
Don't set `spark.akka.frameSize` to 1. The max value of `spark.akka.frameSize` is 2047. The unit is MB. Best Regards, Shixiong Zhu 2014-12-01 0:51 GMT+08:00 Yanbo : > > Try to use spark-shell --conf spark.akka.frameSize=1 > > 在 2014年12月1日,上午12:25,Brian Dolan 写道: > > Howdy Folks, > > Wha

Re: Kafka+Spark-streaming issue: Stream 0 received 0 blocks

2014-12-01 Thread Akhil Das
I see you have no worker machines to execute the job [image: Inline image 1] You haven't configured your spark cluster properly. Quick fix to get it running would be run it on local mode, for that change this line JavaStreamingContext jssc = *new* JavaStreamingContext("spark:// 192.168.88.130:7

RE: Kryo exception for CassandraSQLRow

2014-12-01 Thread Ashic Mahtab
Don't know if this'll solve it, but if you're on Spark 1.1, the Cassandra Connector version 1.1.0 final fixed the guava back compat issue. Maybe taking the guava exclusions might help? Date: Mon, 1 Dec 2014 10:48:25 +0100 Subject: Kryo exception for CassandraSQLRow From: shahab.mok...@gmail.com

RE: Kafka+Spark-streaming issue: Stream 0 received 0 blocks

2014-12-01 Thread m.sarosh
Hi, The spark master is working, and I have given the same url in the code: [cid:image001.png@01D00D82.6DC2FFF0] The warning is gone, and the new log is: --- Time: 141742785 ms --- INFO [sparkDriver-akka.actor.d

Re: Kafka+Spark-streaming issue: Stream 0 received 0 blocks

2014-12-01 Thread Akhil Das
It says: 14/11/27 11:56:05 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory A quick guess would be, you are giving the wrong master url. ( spark:// 192.168.88.130:7077 ) Open the w

Kafka+Spark-streaming issue: Stream 0 received 0 blocks

2014-12-01 Thread m.sarosh
Hi, I am integrating Kafka and Spark, using spark-streaming. I have created a topic as a kafka producer: bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test I am publishing messages in kafka and trying to read them using spark-streami

Spark 1.1.0: weird spark-shell behavior

2014-12-01 Thread Reinis Vicups
Hello, I have two weird effects when working with spark-shell: 1. This code executed in spark-shell causes an exception below. At the same time it works perfectly when submitted with spark-submit! : import org.apache.hadoop.hbase.{HConstants, HBaseConfiguration} import org.apache.hadoop.hbas

Kryo exception for CassandraSQLRow

2014-12-01 Thread shahab
I am using Cassandra-Spark connector to pull data from Cassandra, process it and write it back to Cassandra. Now I am getting the following exception, and apparently it is Kryo serialisation. Does anyone what is the reason and how this can be solved? I also tried to register "org.apache.spark.s

RE: Unable to compile spark 1.1.0 on windows 8.1

2014-12-01 Thread Ishwardeep Singh
Hi Judy, Thank you for your response. When I try to compile using maven "mvn -Dhadoop.version=1.2.1 -DskipTests clean package" I get an error "Error: Could not find or load main class" . I have maven 3.0.4. And when I run command "sbt package" I get the same exception as earlier. I have done t

Re: Creating a SchemaRDD from an existing API

2014-12-01 Thread Niranda Perera
Hi Michael, About this new data source API, what type of data sources would it support? Does it have to be RDBMS necessarily? Cheers On Sat, Nov 29, 2014 at 12:57 AM, Michael Armbrust wrote: > You probably don't need to create a new kind of SchemaRDD. Instead I'd > suggest taking a look at th

Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

2014-12-01 Thread cjdc
Hi Vikas and Simone, thanks for the replies. Yeah I understand this would be easier with 1.2 but this is completely out of my control. I really have to work with 1.0.0. About Simone's approach, during the imports I get: /scala> import org.apache.avro.mapreduce.{ AvroJob, AvroKeyInputFormat, AvroK

Re: java.io.InvalidClassException: org.apache.spark.api.java.JavaUtils$SerializableMapWrapper; no valid constructor

2014-12-01 Thread Josh Rosen
SerializableMapWrapper was added in https://issues.apache.org/jira/browse/SPARK-3926; do you mind opening a new JIRA and linking it to that one? On Mon, Dec 1, 2014 at 12:17 AM, lokeshkumar wrote: > The workaround was to wrap the map returned by spark libraries into HashMap > and then broadcast

akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-01 Thread Alexey Romanchuk
Hello spark users! I found lots of strange messages in driver log. Here it is: 2014-12-01 11:54:23,849 [sparkDriver-akka.actor.default-dispatcher-25] ERROR akka.remote.EndpointWriter[akka://sparkDriver/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FsparkExecutor%40data1.hadoop%3A1

Re: java.io.InvalidClassException: org.apache.spark.api.java.JavaUtils$SerializableMapWrapper; no valid constructor

2014-12-01 Thread lokeshkumar
The workaround was to wrap the map returned by spark libraries into HashMap and then broadcast them. Could anyone please let me know if there is any issue open? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-InvalidClassException-org-apache-spark-a

merge elements in a Spark RDD under custom condition

2014-12-01 Thread Pengcheng YIN
Hi Pro, I want to merge elements in a Spark RDD when the two elements satisfy certain condition Suppose there is a RDD[Seq[Int]], where some Seq[Int] in this RDD contain overlapping elements. The task is to merge all overlapping Seq[Int] in this RDD, and store the result into a new RDD. For ex