Hi Jim,
this is definitley strange. It sure sounds like a bug, but it also is a
very commonly used code path, so it at the very least you must be hitting a
corner case. Could you share a little more info with us? What version of
spark are you using? How big is the object you are trying to broa
at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
> at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> at
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>
0) -> x}.partitionBy(new
> org.apache.spark.HashPartitioner(10))
> (0 until 5).foreach { idx =>
> val otherData = sc.parallelize(1 to (idx * 100)).map{ x => (x % 10) ->
> x}.partitionBy(new org.apache.spark.HashPartitioner(10))
> println(idx + " ---> " + o
only an answer to one of your questions:
What about log statements in the
> partition processing functions? Will their log statements get logged into
> a
> file residing on a given 'slave' machine, or will Spark capture this log
> output and divert it into the log file of the driver's machine?
>
Hi Zhipeng,
yes, your understanding is correct. the "SER" portion just refers to how
its stored in memory. On disk, the data always has to be serialized.
On Fri, May 22, 2015 at 10:40 PM, Jiang, Zhipeng
wrote:
> Hi Todd, Howard,
>
>
>
> Thanks for your reply, I might not present my question
I agree with Richard. It looks like the issue here is shuffling, and
shuffle data is always written to disk, so the issue is definitely not that
all the output of flatMap has to be stored in memory.
If at all possible, I'd first suggest upgrading to a new version of spark
-- even in 1.2, there we
It launches two jobs because it doesn't know ahead of time how big your RDD
is, so it doesn't know what the sampling rate should be. After counting
all the records, it can determine what the sampling rate should be -- then
it does another pass through the data, sampling by the rate its just
determ
yikes.
Was this a one-time thing? Or does it happen consistently? can you turn
on debug logging for o.a.s.scheduler (dunno if it will help, but maybe ...)
On Tue, Aug 11, 2015 at 8:59 AM, Akhil Das
wrote:
> Hi
>
> My Spark job (running in local[*] with spark 1.4.1) reads data from a
> thrift
}
> }
>
> partitionsArray
>
>
>
>
> Thanks
> Best Regards
>
> On Wed, Aug 12, 2015 at 10:57 PM, Imran Rashid
> wrote:
>
>> yikes.
>>
>> Was this a one-time thing? Or does it happen consistently? can you turn
>> on debug logging
just looking at the thread dump from your original email, the 3 executor
threads are all trying to load classes. (One thread is actually loading
some class, and the others are blocked waiting to load a class, most likely
trying to load the same thing.) That is really weird, definitely not
somethi
sorry, by "repl" I mean "spark-shell", I guess I'm used to them being used
interchangeably. From that thread dump, the one thread that isn't stuck is
trying to get classes specifically related to the shell / repl:
java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketR
perhaps this is https://issues.apache.org/jira/browse/SPARK-24578?
that was reported as a performance issue, not OOMs, but its in the exact
same part of the code and the change was to reduce the memory pressure
significantly.
On Mon, Jul 16, 2018 at 1:43 PM, Bryan Jeffrey
wrote:
> Hello.
>
> I
Serga, can you explain a bit more why you want this ability?
If the node is really bad, wouldn't you want to decomission the NM entirely?
If you've got heterogenous resources, than nodelabels seem like they would
be more appropriate -- and I don't feel great about adding workarounds for
the node-la
Severity: Important
Vendor: The Apache Software Foundation
Versions affected:
All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions
Spark 2.2.0 to 2.2.2
Spark 2.3.0 to 2.3.1
Description:
When using PySpark , it's possible for a different local user to connect to
the Spark application and imperson
I received some questions about what the exact change was which fixed the
issue, and the PMC decided to post info in jira to make it easier for the
community to track. The relevant details are all on
https://issues.apache.org/jira/browse/SPARK-26802
On Mon, Jan 28, 2019 at 1:08 PM Imran Rashid
We haven't seen many of these, but we have seen it a couple of times --
there is ongoing work under SPARK-26089 to address the issue we know about,
namely that we don't detect corruption in large shuffle blocks.
Do you believe the cases you have match that -- does it appear to be
corruption in lar
Severity: Important
Vendor: The Apache Software Foundation
Versions affected:
All Spark 1.x, Spark 2.0.x, Spark 2.1.x, and 2.2.x versions
Spark 2.3.0 to 2.3.2
Description:
Prior to Spark 2.3.3, in certain situations Spark would write user data to
local disk unencrypted, even if spark.io.encryp
I think you should also just be able to provide an input format that never
splits the input data. This has come up before on the list, but I couldn't
find it.*
I think this should work, but I can't try it out at the moment. Can you
please try and let us know if it works?
class TextFormatNoSplit
Spark can definitely process data with optional fields. It kinda depends
on what you want to do with the results -- its more of a object design /
knowing scala types question.
Eg., scala has a built in type Option specifically for handling optional
data, which works nicely in pattern matching & f
I'm not sure about this, but I suspect the answer is: spark doesn't
guarantee a stable sort, nor does it plan to in the future, so the
implementation has more flexibility.
But you might be interested in the work being done on secondary sort, which
could give you the guarantees you want:
https://i
I'm not an expert on streaming, but I think you can't do anything like this
right now. It seems like a very sensible use case, though, so I've created
a jira for it:
https://issues.apache.org/jira/browse/SPARK-5467
On Wed, Jan 28, 2015 at 8:54 AM, YaoPau wrote:
> The TwitterPopularTags example
Michael,
you are right, there is definitely some limit at 2GB. Here is a trivial
example to demonstrate it:
import org.apache.spark.storage.StorageLevel
val d = sc.parallelize(1 to 1e6.toInt, 1).map{i => new
Array[Byte](5e3.toInt)}.persist(StorageLevel.DISK_ONLY)
d.count()
It gives the same err
require us to divide the
> transfer of a very large block into multiple smaller blocks.
>
>
>
> On Tue, Feb 3, 2015 at 3:00 PM, Imran Rashid wrote:
>
>> Michael,
>>
>> you are right, there is definitely some limit at 2GB. Here is a trivial
>> example to dem
d to run some of our jobs on it ... But
> that is forked off 1.1 actually).
>
> Regards
> Mridul
>
>
> On Tuesday, February 3, 2015, Imran Rashid wrote:
>
>> Thanks for the explanations, makes sense. For the record looks like this
>> was worked on a while
I think you are interested in secondary sort, which is still being worked
on:
https://issues.apache.org/jira/browse/SPARK-3655
On Tue, Feb 3, 2015 at 4:41 PM, Nitin kak wrote:
> I thought thats what sort based shuffled did, sort the keys going to the
> same partition.
>
> I have tried (c1, c2)
Many operations in spark are lazy -- most likely your collect() statement
is actually forcing evaluation of severals steps earlier in the pipeline.
The logs & the UI might give you some info about all the stages that are
being run when you get to collect().
I think collect() is just fine if you ar
Hi Michael,
judging from the logs, it seems that those tasks are just working a really
long time. If you have long running tasks, then you wouldn't expect the
driver to output anything while those tasks are working.
What is unusual is that there is no activity during all that time the tasks
are
Hi Karlson,
I think your assumptions are correct -- that join alone shouldn't require
any shuffling. But its possible you are getting tripped up by lazy
evaluation of RDDs. After you do your partitionBy, are you sure those RDDs
are actually materialized & cached somewhere? eg., if you just did
You need to import the implicit conversions to PairRDDFunctions with
import org.apache.spark.SparkContext._
(note that this requirement will go away in 1.3:
https://issues.apache.org/jira/browse/SPARK-4397)
On Thu, Feb 12, 2015 at 9:36 AM, Vladimir Protsenko
wrote:
> Hi. I am stuck with how to
) does a shuffle read of about
> 1GB in size, though.
>
> The getPartitions-method does not exist on the resulting RDD (I am using
> the Python API). There is however foreachPartition(). What is the line
>
> joinedRdd.getPartitions.foreach{println}
>
> supposed to do?
>
I wonder if the issue is that these lines just need to add
preservesPartitioning = true
?
https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38
I am getting the feeling this is an issue w/ pyspark
On Thu, Feb 12, 2015 at 10:43 AM, Imran Rashid wrote:
> ah, sorry I am not
The important thing here is the master's memory, that's where you're
getting the GC overhead limit. The master is updating its UI to include
your finished app when your app finishes, which would cause a spike in
memory usage.
I wouldn't expect the master to need a ton of memory just to serve the
;> The feature works as expected in Scala/Java, but not implemented in
>>> Python.
>>>
>>> On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid
>>> wrote:
>>>
>>>> I wonder if the issue is that these lines just need to add
>>>> pr
unfortunately this is a known issue:
https://issues.apache.org/jira/browse/SPARK-1476
as Sean suggested, you need to think of some other way of doing the same
thing, even if its just breaking your one big broadcast var into a few
smaller ones
On Fri, Feb 13, 2015 at 12:30 PM, Sean Owen wrote:
>
this is more-or-less the best you can do now, but as has been pointed out,
accumulators don't quite fit the bill for counters. There is an open issue
to do something better, but no progress on that so far
https://issues.apache.org/jira/browse/SPARK-603
On Fri, Feb 13, 2015 at 11:12 AM, Mark Hams
(trying to repost to the list w/out URLs -- rejected as spam earlier)
Hi,
Using take() is not a good idea, as you have noted it will pull a lot of
data down to the driver so its not scalable. Here are some more scalable
alternatives:
1. Approximate solutions
1a. Sample the data. Just sample s
Hi Emre,
there shouldn't be any difference in which files get processed w/ print()
vs. foreachRDD(). In fact, if you look at the definition of print(), it is
just calling foreachRDD() underneath. So there is something else going on
here.
We need a little more information to figure out exactly w
Hi Darin,
When you say you "see 400GB of shuffle writes" from the first code snippet,
what do you mean? There is no action in that first set, so it won't do
anything. By itself, it won't do any shuffle writing, or anything else for
that matter.
Most likely, the .count() on your second code snip
e splittable). In reality, that's
> what I would really want to do in the first place.
>
> Thanks again for your insights.
>
> Darin.
>
> --
> *From:* Imran Rashid
> *To:* Darin McBeath
> *Cc:* User
> *Sent:* Tuesday, February 1
a JavaRDD is just a wrapper around a normal RDD defined in scala, which is
stored in the "rdd" field. You can access everything that way. The
JavaRDD wrappers just provide some interfaces that are a bit easier to work
with in Java.
If this is at all convincing, here's me demonstrating it inside
(that implements the core
> functionality). I've also put the relevant methods from the my utility
> classes for completeness.
>
> I am as perplexed as you are as to why forcing the output via foreachRDD
> ended up in different behaviour compared to simply using print() meth
Hi Tom,
there are a couple of things you can do here to make this more efficient.
first, I think you can replace your self-join with a groupByKey. on your
example data set, this would give you
(1, Iterable(2,3))
(4, Iterable(3))
this reduces the amount of data that needs to be shuffled, and tha
Hi Joe,
The issue is not that you have input partitions that are bigger than 2GB --
its just that they are getting cached. You can see in the stack trace, the
problem is when you try to read data out of the DiskStore:
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
Also, just b
«size of expanded file» is actually the size of all
> concatenated input files (probably about 800 GB)? In that case should I
> multiply it by the number of files? Or perhaps I'm barking up completely
> the wrong tree.
>
> Joe
>
>
>
>
> On 19 February 2015 at 14:44,
almost all your data is going to one task. You can see that the shuffle
read for task 0 is 153.3 KB, and for most other tasks its just 26B (which
is probably just some header saying there are no actual records). You need
to ensure your data is more evenly distributed before this step.
On Thu, Fe
if you have duplicate values for a key, join creates all pairs. Eg. if you
2 values for key X in rdd A & 2 values for key X in rdd B, then a.join(B)
will have 4 records for key X
On Thu, Feb 19, 2015 at 3:39 PM, Darin McBeath
wrote:
> Consider the following left outer join
>
> potentialDailyMod
the more scalable alternative is to do a join (or a variant like cogroup,
leftOuterJoin, subtractByKey etc. found in PairRDDFunctions)
the downside is this requires a shuffle of both your RDDs
On Thu, Feb 19, 2015 at 3:36 PM, Himanish Kushary
wrote:
> Hi,
>
> I have two RDD's with csv data as b
The error msg is telling you the exact problem, it can't find
"ProgramSIM", the thing you are trying to run
Lost task 3520.3 in stage 0.0 (TID 11, compute3.research.dev):
java.io.IOException: Cannot run program "ProgramSIM": error=2, No s\
uch file or directory
On Thu, Feb 19, 2015 at 5:52 PM, a
yeah, this is just the totally normal message when spark executes
something. The first time something is run, all of its tasks are
"missing". I would not worry about cases when all tasks aren't "missing"
if you're new to spark, its probably an advanced concept that you don't
care about. (and wou
sortByKey() is the probably the easiest way:
import org.apache.spark.SparkContext._
joinedRdd.map{case(word, (file1Counts, file2Counts)) => (file1Counts,
(word, file1Counts, file2Counts))}.sortByKey()
On Mon, Feb 23, 2015 at 10:41 AM, Anupama Joshi
wrote:
> Hi ,
> To simplify my problem -
> I
I think you're getting tripped up lazy evaluation and the way stage
boundaries work (admittedly its pretty confusing in this case).
It is true that up until recently, if you unioned two RDDs with the same
partitioner, the result did not have the same partitioner. But that was
just fixed here:
htt
the spark history server and the yarn history server are totally
independent. Spark knows nothing about yarn logs, and vice versa, so
unfortunately there isn't any way to get all the info in one place.
On Tue, Feb 24, 2015 at 12:36 PM, Colin Kincaid Williams
wrote:
> Looks like in my tired stat
Hi Yiannis,
Broadcast variables are meant for *immutable* data. They are not meant for
data structures that you intend to update. (It might *happen* to work when
running local mode, though I doubt it, and it would probably be a bug if it
did. It will certainly not work when running on a cluster
Hi Tristan,
at first I thought you were just hitting another instance of
https://issues.apache.org/jira/browse/SPARK-1391, but I actually think its
entirely related to kryo. Would it be possible for you to try serializing
your object using kryo, without involving spark at all? If you are
unfamil
Hi Tushar,
The most scalable option is probably for you to consider doing some
approximation. Eg., sample the first to come up with the bucket
boundaries. Then you can assign data points to buckets without needing to
do a full groupByKey. You could even have more passes which corrects any
error
any chance your input RDD is being read from hdfs, and you are running into
this issue (in the docs on SparkContext#hadoopFile):
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable
object for each
* record, directly caching the returned RDD or directly passing it to an
aggr
val grouped = R.groupBy[VertexId](G).persist(StorageLeve.MEMORY_ONLY_SER)
// or whatever persistence makes more sense for you ...
while(true) {
val res = grouped.flatMap(F)
res.collect.foreach(func)
if(criteria)
break
}
On Thu, Feb 26, 2015 at 10:56 AM, Vijayasarathy Kannan
wrote:
> H
Hi Yong,
mostly correct except for:
>
>- Since we are doing reduceByKey, shuffling will happen. Data will be
>shuffled into 1000 partitions, as we have 1000 unique keys.
>
> no, you will not get 1000 partitions. Spark has to decide how many
partitions to use before it even knows how many
no, it does not give you transitive dependencies. You'd have to walk the
tree of dependencies yourself, but that should just be a few lines.
On Thu, Feb 26, 2015 at 3:32 PM, Corey Nolet wrote:
> I see the "rdd.dependencies()" function, does that include ALL the
> dependencies of an RDD? Is it s
Why would you want to use spark to sequentially process your entire data
set? The entire purpose is to let you do distributed processing -- which
means letting partitions get processed simultaneously by different cores /
nodes.
that being said, occasionally in a bigger pipeline with a lot of
dist
the scala syntax for arrays is Array[T], not T[], so you want to use
something:
kryo.register(classOf[Array[org.roaringbitmap.RoaringArray$Element]])
kryo.register(classOf[Array[Short]])
nonetheless, the spark should take care of this itself. I'll look into it
later today.
On Mon, Mar 2, 2015
This doesn't involve spark at all, I think this is entirely an issue with
how scala deals w/ primitives and boxing. Often it can hide the details
for you, but IMO it just leads to far more confusing errors when things
don't work out. The issue here is that your map has value type Any, which
leads
You can set the number of partitions dynamically -- its just a parameter to
a method, so you can compute it however you want, it doesn't need to be
some static constant:
val dataSizeEstimate = yourMagicFunctionToEstimateDataSize()
val numberOfPartitions =
yourConversionFromDataSizeToNumPartitions(
It should be *possible* to do what you want ... but if I understand you
right, there isn't really any very easy way to do it. I think you would
need to write your own subclass of RDD, which has its own logic on how the
input files get put divided among partitions. You can probably subclass
Hadoop
I am not entirely sure I understand your question -- are you saying:
* scoring a sample of 50k events is fast
* taking the top N scores of 77M events is slow, no matter what N is
?
if so, this shouldn't come as a huge surprise. You can't find the top
scoring elements (no matter how small N is)
this is a very interesting use case. First of all, its worth pointing out
that if you really need to process the data sequentially, fundamentally you
are limiting the parallelism you can get. Eg., if you need to process the
entire data set sequentially, then you can't get any parallelism. If you
ngs don't break. I want
> to benefit from the MapOutputTracker fix in 1.2.0.
>
> On Tue, Mar 3, 2015 at 5:41 AM, Imran Rashid wrote:
>
>> the scala syntax for arrays is Array[T], not T[], so you want to use
>> something:
>>
>> kryo.register(classOf[Array[o
is your data skewed? Could it be that there are a few keys with a huge
number of records? You might consider outputting
(recordA, count)
(recordB, count)
instead of
recordA
recordA
recordA
...
you could do this with:
input = sc.textFile
pairsCounts = input.map{x => (x,1)}.reduceByKey{_ + _}
did you forget to specify the main class w/ "--class Main"? though if that
was it, you should at least see *some* error message, so I'm confused
myself ...
On Wed, Mar 11, 2015 at 6:53 AM, Aung Kyaw Htet wrote:
> Hi Everyone,
>
> I am developing a scala app, in which the main object does not ca
Hi Jonathan,
you might be interested in https://issues.apache.org/jira/browse/SPARK-3655
(not yet available) and https://github.com/tresata/spark-sorted (not part
of spark, but it is available right now). Hopefully thats what you are
looking for. To the best of my knowledge that covers what is a
lyCompressedMapStatus])
>
> If I don't register it, I get a runtime error saying that it needs to be
> registered (the error is only when I turn on kryo).
>
> However the code is running smoothly with kryo turned off.
>
> On Wed, Mar 11, 2015 at 5:38 PM, Imran Rashid
>
if I try to fake/enforce the partition in my own way.
>
> Regards,
>
> Shuai
>
> On Wed, Mar 11, 2015 at 8:09 PM, Imran Rashid
> wrote:
>
>> It should be *possible* to do what you want ... but if I understand you
>> right, there isn't really any very easy way t
Hi Shuai,
On Sat, Mar 14, 2015 at 11:02 AM, Shawn Zheng wrote:
> Sorry I response late.
>
> Zhan Zhang's solution is very interesting and I look at into it, but it is
> not what I want. Basically I want to run the job sequentially and also gain
> parallelism. So if possible, if I have 1000 parti
p my own MyGroupingRDD class? I am
> not very clear how to do that, any place I can find an example? I never
> create my own RDD class before (not RDD instance J). But this is very
> valuable approach to me so I am desired to learn.
>
>
>
> Regards,
>
>
>
> Shuai
&g
I'm not super familiar w/ S3, but I think the issue is that you want to use
a different output committers with "object" stores, that don't have a
simple move operation. There have been a few other threads on S3 &
outputcommitters. I think the most relevant for you is most probably this
open JIRA:
Interesting, on another thread, I was just arguing that the user should
*not* open the files themselves and read them, b/c then they lose all the
other goodies we have in HadoopRDD, eg. the metric tracking.
I think this encourages Pat's argument that we might actually need better
support for this
Hi Thomas,
sorry for such a late reply. I don't have any super-useful advice, but
this seems like something that is important to follow up on. to answer
your immediate question, No, there should not be any hard limit to the
number of tasks that MapOutputTracker can handle. Though of course as
t
Hi Yong,
yes I think your analysis is correct. I'd imagine almost all serializers
out there will just convert a string to its utf-8 representation. You
might be interested in adding compression on top of a serializer, which
would probably bring the string size down in almost all cases, but then
I think you are running into a combo of
https://issues.apache.org/jira/browse/SPARK-5928
and
https://issues.apache.org/jira/browse/SPARK-5945
The standard solution is to just increase the number of partitions you are
creating. textFile(), reduceByKey(), and sortByKey() all take an optional
second
I think you should see some other errors before that, from
NettyBlockTransferService, with a msg like "Exception while beginning
fetchBlocks". There might be a bit more information there. there are an
assortment of possible causes, but first lets just make sure you have all
the details from the o
I think writing to hdfs and reading it back again is totally reasonable.
In fact, in my experience, writing to hdfs and reading back in actually
gives you a good opportunity to handle some other issues as well:
a) instead of just writing as an object file, I've found its helpful to
write in a form
you also need to register *array*s of MyObject. so change:
conf.registerKryoClasses(Array(classOf[MyObject]))
to
conf.registerKryoClasses(Array(classOf[MyObject], classOf[Array[MyObject]]))
On Wed, Mar 25, 2015 at 2:44 AM, donhoff_h <165612...@qq.com> wrote:
> Hi, experts
>
> I wrote a very
On the worker side, it all happens in Executor. The task result is
computed here:
https://github.com/apache/spark/blob/b45059d0d7809a986ba07a447deb71f11ec6afe4/core/src/main/scala/org/apache/spark/executor/Executor.scala#L210
then its serialized along with some other goodies, and finally sent ba
yes, it sounds like a good use of an accumulator to me
val counts = sc.accumulator(0L)
rdd.map{x =>
counts += 1
x
}.saveAsObjectFile(file2)
On Mon, Mar 30, 2015 at 12:08 PM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:
> Sean
>
>
>
> Yes I know that I can use persist() to
Those funny class names come from scala's specialization -- its compiling a
different version of OpenHashMap for each primitive you stick in the type
parameter. Here's a super simple example:
*➜ **~ * more Foo.scala
class Foo[@specialized X]
*➜ **~ * scalac Foo.scala
*➜ **~ * ls Foo*.cl
broadcast variables count towards "spark.storage.memoryFraction", so they
use the same "pool" of memory as cached RDDs.
That being said, I'm really not sure why you are running into problems, it
seems like you have plenty of memory available. Most likely its got
nothing to do with broadcast varia
Hi Robert,
A lot of task metrics are already available for individual tasks. You can
get these programmatically by registering a SparkListener, and you van also
view them in the UI. Eg., for each task, you can see runtime,
serialization time, amount of shuffle data read, etc. I'm working on als
Hi twinkle,
To be completely honest, I'm not sure, I had never heard "spark.task.cpus"
before. But I could imagine two different use cases:
a) instead of just relying on spark's creation of tasks for parallelism, a
user wants to run multiple threads *within* a task. This is sort of going
agains
Interesting, my gut instinct is the same as Sean's. I'd suggest debugging
this in plain old scala first, without involving spark. Even just in the
scala shell, create one of your Array[T], try calling .toSet and calling
.distinct. If those aren't the same, then its got nothing to do with
spark.
Shuffle write could be a good indication of skew, but it looks like the
task in question hasn't generated any shuffle write yet, because its still
working on the shuffle-read side. So I wouldn't read too much into the
fact that the shuffle write is 0 for a task that is still running.
The shuffle
Hi Arun,
It can be hard to use kryo with required registration because of issues
like this -- there isn't a good way to register all the classes that you
need transitively. In this case, it looks like one of your classes has a
reference to a ClassTag, which in turn has a reference to some anonymo
HI Shuai,
I don't think this is a bug with kryo, its just a subtlety with the kryo
works. I *think* that it would also work if you changed your
PropertiesUtil class to either (a) remove the no-arg constructor or (b)
instead of extending properties, you make it a contained member variable.
I wish
(+dev)
Hi Justin,
short answer: no, there is no way to do that.
I'm just guessing here, but I imagine this was done to eliminate
serialization problems (eg., what if we got an error trying to serialize
the user exception to send from the executors back to the driver?).
Though, actually that isn'
ot resolve symbol ClassTag$$anon$1
>
> Hence I am not any closer to making this work. If you have any further
> suggestions, they would be most welcome.
>
> arun
>
>
> On Tue, Apr 14, 2015 at 2:33 PM, Imran Rashid
> wrote:
>
>> Hi Arun,
>>
>> It can
this is a really strange exception ... I'm especially surprised that it
doesn't work w/ java serialization. Do you think you could try to boil it
down to a minimal example?
On Wed, Apr 15, 2015 at 8:58 AM, Jeetendra Gangele
wrote:
> Yes Without Kryo it did work out.when I remove kryo registrati
list.1001560.n3.nabble.com/java-io-InvalidClassException-org-apache-spark-api-java-JavaUtils-SerializableMapWrapper-no-valid-cor-td20034.html
>
> Can you please suggest any work around I am broad casting HashMap return
> from RDD.collectasMap().
>
> On 15 April 2015 at 19:33, Imran Ra
its the latter -- after spark gets to the end of the iterator (or if it
hits an exception)
so your example is good, that is exactly what it is intended for.
On Fri, Apr 17, 2015 at 12:23 PM, Akshat Aranya wrote:
> Hi,
>
> I'm trying to figure out when TaskCompletionListeners are called -- are
>
rator(rdd3.partitions(1), context)));
> 1
> }
>
> I was wondering if you had any ideas on what I am doing wrong, or how I
> can properly send the serialized version of the RDD and function to my
> other program. My thought is that I might need to add more jars to the
> build pa
if you can store the entire sample for one partition in memory, I think you
just want:
val sample1 =
rdd.sample(true,0.01,42).mapPartitions(scala.util.Random.shuffle)
val sample2 = rdd.sample(true,0.01,43)
.mapPartitions(scala.util.Random.shuffle)
...
On Fri, Apr 17, 2015 at 3:05 AM, Aurélien
are you calling sc.stop() at the end of your applications?
The history server only displays completed applications, but if you don't
call sc.stop(), it doesn't know that those applications have been stopped.
Note that in spark 1.3, the history server can also display running
applications (includi
1 - 100 of 136 matches
Mail list logo