Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Pete Robbins
ok so let me try again ;-) I don't think that the page size calculation matters apart from hitting the allocation limit earlier if the page size is too large. If a task is going to need X bytes, it is going to need X bytes. In this case, for at least one of the tasks, X > maxmemory/no_active_task

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Reynold Xin
Can you paste the entire stacktrace of the error? In your original email you only included the last function call. Maybe I'm missing something here, but I still think the bad heuristics is the issue. Some operators pre-reserve memory before running anything in order to avoid starvation. For examp

Re: RDD API patterns

2015-09-16 Thread Reynold Xin
I'm not sure what we can do here. Nested RDDs are a pain to implement, support, and explain. The programming model is not well explored. Maybe a UDAF interface that allows going through the data twice? On Mon, Sep 14, 2015 at 4:36 PM, sim wrote: > I'd like to get some feedback on an API design

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Pete Robbins
I see what you are saying. Full stack trace: java.io.IOException: Unable to acquire 4194304 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalS

Re: RDD API patterns

2015-09-16 Thread Aniket
I agree that this in issue but I am afraid supporting RDD nesting would be hard and perhaps would need rearchitecting Spark. For now, you may to use workarounds like storing each group in a separate file, process each file as separate RDD and finally merge results in a single RDD. I know its painf

Re: RDD API patterns

2015-09-16 Thread Juan Rodríguez Hortalá
Hi, That reminds me to a previous discussion about splitting an RDD into several RDDs http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-split-into-multiple-RDDs-td11877.html. There you can see a simple code to convert RDD[(K, V)] into Map[K, RDD[V]] through several filters. On top of t

Re: RDD API patterns

2015-09-16 Thread robineast
I'm not sure the problem is quite as bad as you state. Both sampleByKey and sampleByKeyExact are implemented using a function from StratifiedSamplingUtils which does one of two things depending on whether the exact implementation is needed. The exact version requires double the number of lines of c

JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-16 Thread shane knapp
good morning, denizens of the aether! your hard working build system (and some associated infrastructure) has been in need of some updates and housecleaning for quite a while now. we will be splitting the maintenance over two mornings to minimize impact. here's the plan: 7am-9am wednesday, 9-24

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Pete Robbins
so forcing the ShuffleMemoryManager to assume 32 cores and therefore calculate a pagesize of 1MB passes the tests. How can we determine the correct value to use in getPageSize rather than Runtime.getRuntime.availableProcessors()? On 16 September 2015 at 10:17, Pete Robbins wrote: > I see what y

Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-16 Thread Reynold Xin
Thanks Shane and Jon for the heads up. On Wednesday, September 16, 2015, shane knapp wrote: > good morning, denizens of the aether! > > your hard working build system (and some associated infrastructure) > has been in need of some updates and housecleaning for quite a while > now. we will be sp

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Reynold Xin
SparkEnv for the driver was created in SparkContext. The default parallelism field is set to the number of slots (max number of active tasks). Maybe we can just use the default parallelism to compute that in local mode. On Wednesday, September 16, 2015, Pete Robbins wrote: > so forcing the Shuff

SparkR streaming source code

2015-09-16 Thread Renyi Xiong
SparkR streaming is mentioned at about page 17 in below pdf, can anyone share source code? (could not find it on GitHub) https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-19-Hao-Lin-Haichuan-Wang.pdf Thanks, Renyi.

Re: SparkR streaming source code

2015-09-16 Thread Reynold Xin
You should reach out to the speakers directly. On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong wrote: > SparkR streaming is mentioned at about page 17 in below pdf, can anyone > share source code? (could not find it on GitHub) > > > > https://spark-summit.org/2015-east/wp-content/uploads/2015/03/S

Re: SparkR streaming source code

2015-09-16 Thread Shivaram Venkataraman
I think Hao posted a link to the source code in the description of https://issues.apache.org/jira/browse/SPARK-6803 On Wed, Sep 16, 2015 at 10:06 AM, Reynold Xin wrote: > You should reach out to the speakers directly. > > > On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong wrote: >> >> SparkR streami

Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-16 Thread shane knapp
> 630am-10am thursday, 9-24-15: > * jenknins update to 1.629 (we're a few months behind in versions, and > some big bugs have been fixed) > * jenkins master and worker system package updates > * all systems get a reboot (lots of hanging java processes have been > building up over the months) > * bu

Re: SparkR streaming source code

2015-09-16 Thread Renyi Xiong
got it, thanks a lot! On Wed, Sep 16, 2015 at 10:14 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > I think Hao posted a link to the source code in the description of > https://issues.apache.org/jira/browse/SPARK-6803 > > On Wed, Sep 16, 2015 at 10:06 AM, Reynold Xin wrote: > >

Communication between executors and drivers

2015-09-16 Thread Muhammad Haseeb Javed
How do executors communicate with the driver in Spark ? I understand that it s done using Akka actors and messages are exchanged as CoarseGrainedSchedulerMessage, but I'd really appreciate if someone could explain the entire process in a bit detail.

Spark streaming DStream state on worker

2015-09-16 Thread Renyi Xiong
Hi, I want to do temporal join operation on DStream across RDDs, my question is: Are RDDs from same DStream always computed on same worker (except failover) ? thanks, Renyi.

Re: And.eval short circuiting

2015-09-16 Thread Reynold Xin
This is "expected" in the sense that DataFrame operations can get re-ordered under the hood by the optimizer. For example, if the optimizer deems it is cheaper to apply the 2nd filter first, it might re-arrange the filters. In reality, it doesn't do that. I think this is too confusing and violates

Re: Enum parameter in ML

2015-09-16 Thread Joseph Bradley
I've tended to use Strings. Params can be created with a validator (isValid) which can ensure users get an immediate error if they try to pass an unsupported String. Not as nice as compile-time errors, but easier on the APIs. On Mon, Sep 14, 2015 at 6:07 PM, Feynman Liang wrote: > We usually w

RE: Enum parameter in ML

2015-09-16 Thread Ulanov, Alexander
Hi Joseph, Strings sounds reasonable. However, there is no StringParam (only StringArrayParam). Should I create a new param type? Also, how can the user get all possible values of String parameter? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Wednesday, Sep

Re: Enum parameter in ML

2015-09-16 Thread Stephen Boesch
There was a long thread about enum's initiated by Xiangrui several months back in which the final consensus was to use java enum's. Is that discussion (/decision) applicable here? 2015-09-16 17:43 GMT-07:00 Ulanov, Alexander : > Hi Joseph, > > > > Strings sounds reasonable. However, there is no

Re: Enum parameter in ML

2015-09-16 Thread Joseph Bradley
@Alexander It's worked for us to use Param[String] directly. (I think it's b/c String is exactly java.lang.String, rather than a Scala version of it, so it's still Java-friendly.) In other classes, I've added a static list (e.g., NaiveBayes.supportedModelTypes), though there isn't consistent cov

Re: New Spark json endpoints

2015-09-16 Thread Kevin Chen
Just wanted to bring this email up again in case there were any thoughts. Having all the information from the web UI accessible through a supported json API is very important to us; are there any objections to us adding a v2 API to Spark? Thanks! From: Kevin Chen Date: Friday, September 11, 20

RE: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Cheng, Hao
We actually meet the similiar problem in a real case, see https://issues.apache.org/jira/browse/SPARK-10474 After checking the source code, the external sort memory management strategy seems the root cause of the issue. Currently, we allocate the 4MB (page size) buffer as initial in the beginni

Re: New Spark json endpoints

2015-09-16 Thread Reynold Xin
Do we need to increment the version number if it is just strict additions? On Wed, Sep 16, 2015 at 7:10 PM, Kevin Chen wrote: > Just wanted to bring this email up again in case there were any thoughts. > Having all the information from the web UI accessible through a supported > json API is ver