ok so let me try again ;-)
I don't think that the page size calculation matters apart from hitting the
allocation limit earlier if the page size is too large.
If a task is going to need X bytes, it is going to need X bytes. In this
case, for at least one of the tasks, X > maxmemory/no_active_task
Can you paste the entire stacktrace of the error? In your original email
you only included the last function call.
Maybe I'm missing something here, but I still think the bad heuristics is
the issue.
Some operators pre-reserve memory before running anything in order to avoid
starvation. For examp
I'm not sure what we can do here. Nested RDDs are a pain to implement,
support, and explain. The programming model is not well explored.
Maybe a UDAF interface that allows going through the data twice?
On Mon, Sep 14, 2015 at 4:36 PM, sim wrote:
> I'd like to get some feedback on an API design
I see what you are saying. Full stack trace:
java.io.IOException: Unable to acquire 4194304 bytes of memory
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS
I agree that this in issue but I am afraid supporting RDD nesting would be
hard and perhaps would need rearchitecting Spark. For now, you may to use
workarounds like storing each group in a separate file, process each file
as separate RDD and finally merge results in a single RDD.
I know its painf
Hi,
That reminds me to a previous discussion about splitting an RDD into
several RDDs
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-split-into-multiple-RDDs-td11877.html.
There you can see a simple code to convert RDD[(K, V)] into Map[K, RDD[V]]
through several filters. On top of t
I'm not sure the problem is quite as bad as you state. Both sampleByKey and
sampleByKeyExact are implemented using a function from
StratifiedSamplingUtils which does one of two things depending on whether
the exact implementation is needed. The exact version requires double the
number of lines of c
good morning, denizens of the aether!
your hard working build system (and some associated infrastructure)
has been in need of some updates and housecleaning for quite a while
now. we will be splitting the maintenance over two mornings to
minimize impact.
here's the plan:
7am-9am wednesday, 9-24
so forcing the ShuffleMemoryManager to assume 32 cores and therefore
calculate a pagesize of 1MB passes the tests.
How can we determine the correct value to use in getPageSize rather than
Runtime.getRuntime.availableProcessors()?
On 16 September 2015 at 10:17, Pete Robbins wrote:
> I see what y
Thanks Shane and Jon for the heads up.
On Wednesday, September 16, 2015, shane knapp wrote:
> good morning, denizens of the aether!
>
> your hard working build system (and some associated infrastructure)
> has been in need of some updates and housecleaning for quite a while
> now. we will be sp
SparkEnv for the driver was created in SparkContext. The default
parallelism field is set to the number of slots (max number of active
tasks). Maybe we can just use the default parallelism to compute that in
local mode.
On Wednesday, September 16, 2015, Pete Robbins wrote:
> so forcing the Shuff
SparkR streaming is mentioned at about page 17 in below pdf, can anyone
share source code? (could not find it on GitHub)
https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-19-Hao-Lin-Haichuan-Wang.pdf
Thanks,
Renyi.
You should reach out to the speakers directly.
On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong wrote:
> SparkR streaming is mentioned at about page 17 in below pdf, can anyone
> share source code? (could not find it on GitHub)
>
>
>
> https://spark-summit.org/2015-east/wp-content/uploads/2015/03/S
I think Hao posted a link to the source code in the description of
https://issues.apache.org/jira/browse/SPARK-6803
On Wed, Sep 16, 2015 at 10:06 AM, Reynold Xin wrote:
> You should reach out to the speakers directly.
>
>
> On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong wrote:
>>
>> SparkR streami
> 630am-10am thursday, 9-24-15:
> * jenknins update to 1.629 (we're a few months behind in versions, and
> some big bugs have been fixed)
> * jenkins master and worker system package updates
> * all systems get a reboot (lots of hanging java processes have been
> building up over the months)
> * bu
got it, thanks a lot!
On Wed, Sep 16, 2015 at 10:14 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:
> I think Hao posted a link to the source code in the description of
> https://issues.apache.org/jira/browse/SPARK-6803
>
> On Wed, Sep 16, 2015 at 10:06 AM, Reynold Xin wrote:
> >
How do executors communicate with the driver in Spark ? I understand that
it s done using Akka actors and messages are exchanged as
CoarseGrainedSchedulerMessage, but I'd really appreciate if someone could
explain the entire process in a bit detail.
Hi,
I want to do temporal join operation on DStream across RDDs, my question
is: Are RDDs from same DStream always computed on same worker (except
failover) ?
thanks,
Renyi.
This is "expected" in the sense that DataFrame operations can get
re-ordered under the hood by the optimizer. For example, if the optimizer
deems it is cheaper to apply the 2nd filter first, it might re-arrange the
filters. In reality, it doesn't do that. I think this is too confusing and
violates
I've tended to use Strings. Params can be created with a validator
(isValid) which can ensure users get an immediate error if they try to pass
an unsupported String. Not as nice as compile-time errors, but easier on
the APIs.
On Mon, Sep 14, 2015 at 6:07 PM, Feynman Liang
wrote:
> We usually w
Hi Joseph,
Strings sounds reasonable. However, there is no StringParam (only
StringArrayParam). Should I create a new param type? Also, how can the user get
all possible values of String parameter?
Best regards, Alexander
From: Joseph Bradley [mailto:jos...@databricks.com]
Sent: Wednesday, Sep
There was a long thread about enum's initiated by Xiangrui several months
back in which the final consensus was to use java enum's. Is that
discussion (/decision) applicable here?
2015-09-16 17:43 GMT-07:00 Ulanov, Alexander :
> Hi Joseph,
>
>
>
> Strings sounds reasonable. However, there is no
@Alexander It's worked for us to use Param[String] directly. (I think
it's b/c String is exactly java.lang.String, rather than a Scala version of
it, so it's still Java-friendly.) In other classes, I've added a static
list (e.g., NaiveBayes.supportedModelTypes), though there isn't consistent
cov
Just wanted to bring this email up again in case there were any thoughts.
Having all the information from the web UI accessible through a supported
json API is very important to us; are there any objections to us adding a v2
API to Spark?
Thanks!
From: Kevin Chen
Date: Friday, September 11, 20
We actually meet the similiar problem in a real case, see
https://issues.apache.org/jira/browse/SPARK-10474
After checking the source code, the external sort memory management strategy
seems the root cause of the issue.
Currently, we allocate the 4MB (page size) buffer as initial in the beginni
Do we need to increment the version number if it is just strict additions?
On Wed, Sep 16, 2015 at 7:10 PM, Kevin Chen wrote:
> Just wanted to bring this email up again in case there were any thoughts.
> Having all the information from the web UI accessible through a supported
> json API is ver
26 matches
Mail list logo