Updation of spark metrics.

2015-05-11 Thread Archit Thakur
Hi, I can see executor sends the TaskMetrics in the heartbeat every 10 sec(configurable). but in JobProgListener there is a check for task completion. If the task is not completed, it doesn't update the metrics on the UI. Any specific reason for this.? I don't think there would be much of the perf

Re: [PySpark DataFrame] When a Row is not a Row

2015-05-11 Thread Ted Yu
In Row#equals(): while (i < len) { if (apply(i) != that.apply(i)) { '!=' should be !apply(i).equals(that.apply(i)) ? Cheers On Mon, May 11, 2015 at 1:49 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > This is really strange. > > >>> # Spark 1.3.1 > >>> print type(resu

[PySpark DataFrame] When a Row is not a Row

2015-05-11 Thread Nicholas Chammas
This is really strange. >>> # Spark 1.3.1 >>> print type(results) >>> a = results.take(1)[0] >>> print type(a) >>> print pyspark.sql.types.Row >>> print type(a) == pyspark.sql.types.Row False >>> print isinstance(a, pyspark.sql.types.Row) False If I set a as follows, then the type checks p

Re: Recent Spark test failures

2015-05-11 Thread Steve Loughran
> On 7 May 2015, at 01:41, Andrew Or wrote: > > Dear all, > > I'm sure you have all noticed that the Spark tests have been fairly > unstable recently. I wanted to share a tool that I use to track which tests > have been failing most often in order to prioritize fixing these flaky > tests. > >

Re: Recent Spark test failures

2015-05-11 Thread Ted Yu
Makes sense. Having high determinism in these tests would make Jenkins build stable. On Mon, May 11, 2015 at 1:08 PM, Andrew Or wrote: > Hi Ted, > > Yes, those two options can be useful, but in general I think the standard > to set is that tests should never fail. It's actually the worst if tes

Re: Recent Spark test failures

2015-05-11 Thread Andrew Or
Hi Ted, Yes, those two options can be useful, but in general I think the standard to set is that tests should never fail. It's actually the worst if tests fail sometimes but not others, because we can't reproduce them deterministically. Using -M and -A actually tolerates flaky tests to a certain e

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-11 Thread Reynold Xin
Looks like it is spending a lot of time doing hash probing. It could be a number of the following: 1. hash probing itself is inherently expensive compared with rest of your workload 2. murmur3 doesn't work well with this key distribution 3. quadratic probing (triangular sequence) with a power-of

RE: DataFrame distinct vs RDD distinct

2015-05-11 Thread Ulanov, Alexander
The following worked for me as a workaround for distinct: val pf = sqlContext.parquetFile("hdfs://file") val distinctValuesOfColumn4 = pf.rdd.aggregate[scala.collection.mutable.HashSet[String]](new scala.collection.mutable.HashSet[String]())( (s, v) => s += v.getString(4), (s1, s2) => s1 ++= s2

RE: Easy way to convert Row back to case class

2015-05-11 Thread Ulanov, Alexander
Thank you for suggestions! From: Reynold Xin [mailto:r...@databricks.com] Sent: Friday, May 08, 2015 11:10 AM To: Will Benton Cc: Ulanov, Alexander; dev@spark.apache.org Subject: Re: Easy way to convert Row back to case class In 1.4, you can do row.getInt("colName") In 1.5, some variant of this

RE: DataFrame distinct vs RDD distinct

2015-05-11 Thread Ulanov, Alexander
Hi, Could you suggest alternative way of implementing distinct, e.g. via fold or aggregate? Both SQL distinct and RDD distinct fail on my dataset due to overflow of Spark shuffle disk. I have 7 nodes with 300GB dedicated to Spark shuffle each. My dataset is 2B rows, the field which I'm performi

Re: [SPARK-7400] PortableDataStream UDT

2015-05-11 Thread Reynold Xin
Sorry it's hard to give a definitive answer due to the lack of details (I'm not sure what exactly is entailed to have this PortableDataStream), but the answer is probably no if we need to change some existing code and expose a whole new data type to users. On Mon, May 11, 2015 at 9:02 AM, Eron Wr

Re: Intellij Spark Source Compilation

2015-05-11 Thread rtimp
Hi Iulian, I was able to successfully compile in eclipse after, on the command line, using sbt avro:generate followed by a sbt clean compile (and then a full clean compile in eclipse). Thanks for your help! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabbl

Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Sandy Ryza
Wow, I hadn't noticed this, but 5 seconds is really long. It's true that it's configurable, but I think we need to provide a decent out-of-the-box experience. For comparison, the MapReduce equivalent is 1 second. I filed https://issues.apache.org/jira/browse/SPARK-7533 for this. -Sandy On Mon,

Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Mridul Muralidharan
For tiny/small clusters (particularly single tenet), you can set it to lower value. But for anything reasonably large or multi-tenet, the request storm can be bad if large enough number of applications start aggressively polling RM. That is why the interval is set to configurable. - Mridul On Mo

[SPARK-7400] PortableDataStream UDT

2015-05-11 Thread Eron Wright
Hello, I'm working on SPARK-7400 for DataFrame support for PortableDataStream, i.e. the data type associated with the RDD from sc.binaryFiles(...). Assuming a patch is available soon, what is the likelihood of inclusion in Spark 1.4? Thanks

Re: Intellij Spark Source Compilation

2015-05-11 Thread Iulian Dragoș
Oh, I see. So then try to run one build on the command time firs (or try sbt avro:generate, though I’m not sure it’s enough). I just noticed that I have an additional source folder target/scala-2.10/src_managed/main/compiled_avro for spark-streaming-flume-sink. I guess I built the project once and

Re: Intellij Spark Source Compilation

2015-05-11 Thread rtimp
Hi, Thanks Iulian. Yeah, I was kind of anticipating I could just ignore old-deps ultimately. However, Even after doing a clean and build all, I get the following still: Description LocationResourcePathType not found: type EventBatch line 72 SparkAvroCallbackHandler.sc

Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Zoltán Zvara
Isn't this issue something that should be improved? Based on the following discussion, there are two places were YARN's heartbeat interval is respected on job start-up, but do we really need to respect it on start-up? On Fri, May 8, 2015 at 12:14 PM Taeyun Kim wrote: > I think so. > > In fact, t

Re: Intellij Spark Source Compilation

2015-05-11 Thread Iulian Dragoș
Hi, `old-deps` is not really a project, so you can simply skip it (or close it). The rest should work fine (clean and build all). On Sat, May 9, 2015 at 10:27 PM, rtimp wrote: > Hi Iulian, > > Thanks for the reply! > > With respect to eclipse, I'm doing this all with a fresh download of the > s

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-11 Thread Michal Haris
This is the stack trace of the worker thread: org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150) org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32) org.apache.spark.util.collection.ExternalAppendOnlyMap.insert

Re: PySpark DataFrame: Preserving nesting when selecting a nested field

2015-05-11 Thread Reynold Xin
In 1.4, you can use "struct" function to create a struct, e.g. you can explicitly select out the "version" column, and then create a new struct named "settings". The current semantics of select basically follows closely relational database's SQL, which is well understood and defined. I wouldn't a

Re: Change for submitting to yarn in 1.3.1

2015-05-11 Thread Mridul Muralidharan
That works when it is launched from same process - which is unfortunately not our case :-) - Mridul On Sun, May 10, 2015 at 9:05 PM, Manku Timma wrote: > sc.applicationId gives the yarn appid. > > On 11 May 2015 at 08:13, Mridul Muralidharan wrote: >> >> We had a similar requirement, and as a s