JobScheduler: Error generating jobs for time for custom InputDStream

2015-09-15 Thread Juan Rodríguez Hortalá
Hi, Sorry to insist, anyone has any thoughts on this? Or at least someone can point me to a documentation of DStream.compute() so I can understand when I should return None for a batch? Thanks Juan 2015-09-14 20:51 GMT+02:00 Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com>: > Hi, >

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Reynold Xin
It is exactly the issue here, isn't it? We are using memory / N, where N should be the maximum number of active tasks. In the current master, we use the number of cores to approximate the number of tasks -- but it turned out to be a bad approximation in tests because it is set to 32 to increase co

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Pete Robbins
Oops... I meant to say "The page size calculation is NOT the issue here" On 16 September 2015 at 06:46, Pete Robbins wrote: > The page size calculation is the issue here as there is plenty of free > memory, although there is maybe a fair bit of wasted space in some pages. > It is that when we ha

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Pete Robbins
The page size calculation is the issue here as there is plenty of free memory, although there is maybe a fair bit of wasted space in some pages. It is that when we have a lot of tasks each is only allowed to reach 1/n of the available memory and several of the tasks bump in to that limit. With task

Re: pyspark streaming DStream compute

2015-09-15 Thread Davies Liu
On Tue, Sep 15, 2015 at 1:46 PM, Renyi Xiong wrote: > Can anybody help understand why pyspark streaming uses py4j callback to > execute python code while pyspark batch uses worker.py? There are two kind of callback in pyspark streaming: 1) one operate on RDDs, it take an RDD and return an new RDD

pyspark streaming DStream compute

2015-09-15 Thread Renyi Xiong
Can anybody help understand why pyspark streaming uses py4j callback to execute python code while pyspark batch uses worker.py? regarding pyspark streaming, is py4j callback only used for DStream, worker.py still used for RDD? thanks, Renyi.

RE: And.eval short circuiting

2015-09-15 Thread Zack Sampson
I see. We're having problems with code like this (forgive my noob scala): val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements") df .filter($"animals".rlike(".*")) .filter(callUDF({(value: String) => value.length > 2}, BooleanType, $"animals")) .collect() This code throws a

Re: Predicate push-down bug?

2015-09-15 Thread Ram Sriharsha
Hi Ravi This does look like a bug.. I have created a JIRA to track it here: https://issues.apache.org/jira/browse/SPARK-10623 Ram On Tue, Sep 15, 2015 at 10:47 AM, Ram Sriharsha wrote: > Hi Ravi > > Can you share more details? What Spark version are you running? > > Ram > > On Tue, Sep 15, 20

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Reynold Xin
Maybe we can change the heuristics in memory calculation to use SparkContext.defaultParallelism if it is local mode. On Tue, Sep 15, 2015 at 10:28 AM, Pete Robbins wrote: > Yes and at least there is an override by setting spark.sql.test.master to > local[8] , in fact local[16] worked on my 8 c

Re: Predicate push-down bug?

2015-09-15 Thread Ram Sriharsha
Hi Ravi Can you share more details? What Spark version are you running? Ram On Tue, Sep 15, 2015 at 10:32 AM, Ravi Ravi wrote: > Turning on predicate pushdown for ORC datasources results in a > NoSuchElementException: > > scala> val df = sqlContext.sql("SELECT name FROM people WHERE age < 15")

Predicate push-down bug?

2015-09-15 Thread Ravi Ravi
Turning on predicate pushdown for ORC datasources results in a NoSuchElementException: scala> val df = sqlContext.sql("SELECT name FROM people WHERE age < 15") df: org.apache.spark.sql.DataFrame = [name: string] scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "*true*") scala> df.explai

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Pete Robbins
Yes and at least there is an override by setting spark.sql.test.master to local[8] , in fact local[16] worked on my 8 core box. I'm happy to use this as a workaround but the 32 hard-coded will fail running build/tests on a clean checkout if you only have 8 cores. On 15 September 2015 at 17:40, M

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Marcelo Vanzin
That test explicitly sets the number of executor cores to 32. object TestHive extends TestHiveContext( new SparkContext( System.getProperty("spark.sql.test.master", "local[32]"), On Mon, Sep 14, 2015 at 11:22 PM, Reynold Xin wrote: > Yea I think this is where the heuristics is faili

Fwd: Null Value in DecimalType column of DataFrame

2015-09-15 Thread Dirceu Semighini Filho
Hi Yin, posted here because I think it's a bug. So, it will return null and I can get a nullpointerexception, as I was getting. Is this really the expected behavior? Never seen something returning null in other Scala tools that I used. Regards, 2015-09-14 18:54 GMT-03:00 Yin Huai : > btw, move

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Pete Robbins
This is the culprit: https://issues.apache.org/jira/browse/SPARK-8406 "2. Make `TestHive` use a local mode `SparkContext` with 32 threads to increase parallelism The major reason for this is that, the original parallelism of 2 is too low to reproduce the data loss issue. Also, higher concu

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Pete Robbins
Ok so it looks like the max number of active tasks reaches 30. I'm not setting anything as it is a clean environment with clean spark code checkout. I'll dig further to see why so many tasks are active. Cheers, On 15 September 2015 at 07:22, Reynold Xin wrote: > Yea I think this is where the he