date:20150511

Updation of spark metrics.

2015-05-11 Thread Archit Thakur

Hi, I can see executor sends the TaskMetrics in the heartbeat every 10 sec(configurable). but in JobProgListener there is a check for task completion. If the task is not completed, it doesn't update the metrics on the UI. Any specific reason for this.? I don't think there would be much of the perf

Re: [PySpark DataFrame] When a Row is not a Row

2015-05-11 Thread Ted Yu

In Row#equals(): while (i < len) { if (apply(i) != that.apply(i)) { '!=' should be !apply(i).equals(that.apply(i)) ? Cheers On Mon, May 11, 2015 at 1:49 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > This is really strange. > > >>> # Spark 1.3.1 > >>> print type(resu

[PySpark DataFrame] When a Row is not a Row

2015-05-11 Thread Nicholas Chammas

This is really strange. >>> # Spark 1.3.1 >>> print type(results) >>> a = results.take(1)[0] >>> print type(a) >>> print pyspark.sql.types.Row >>> print type(a) == pyspark.sql.types.Row False >>> print isinstance(a, pyspark.sql.types.Row) False If I set a as follows, then the type checks p

Re: Recent Spark test failures

2015-05-11 Thread Steve Loughran

> On 7 May 2015, at 01:41, Andrew Or wrote: > > Dear all, > > I'm sure you have all noticed that the Spark tests have been fairly > unstable recently. I wanted to share a tool that I use to track which tests > have been failing most often in order to prioritize fixing these flaky > tests. > >

Re: Recent Spark test failures

2015-05-11 Thread Ted Yu

Makes sense. Having high determinism in these tests would make Jenkins build stable. On Mon, May 11, 2015 at 1:08 PM, Andrew Or wrote: > Hi Ted, > > Yes, those two options can be useful, but in general I think the standard > to set is that tests should never fail. It's actually the worst if tes

Re: Recent Spark test failures

2015-05-11 Thread Andrew Or

Hi Ted, Yes, those two options can be useful, but in general I think the standard to set is that tests should never fail. It's actually the worst if tests fail sometimes but not others, because we can't reproduce them deterministically. Using -M and -A actually tolerates flaky tests to a certain e

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-11 Thread Reynold Xin

Looks like it is spending a lot of time doing hash probing. It could be a number of the following: 1. hash probing itself is inherently expensive compared with rest of your workload 2. murmur3 doesn't work well with this key distribution 3. quadratic probing (triangular sequence) with a power-of

RE: DataFrame distinct vs RDD distinct

2015-05-11 Thread Ulanov, Alexander

The following worked for me as a workaround for distinct: val pf = sqlContext.parquetFile("hdfs://file") val distinctValuesOfColumn4 = pf.rdd.aggregate[scala.collection.mutable.HashSet[String]](new scala.collection.mutable.HashSet[String]())( (s, v) => s += v.getString(4), (s1, s2) => s1 ++= s2

RE: Easy way to convert Row back to case class

2015-05-11 Thread Ulanov, Alexander

Thank you for suggestions! From: Reynold Xin [mailto:r...@databricks.com] Sent: Friday, May 08, 2015 11:10 AM To: Will Benton Cc: Ulanov, Alexander; dev@spark.apache.org Subject: Re: Easy way to convert Row back to case class In 1.4, you can do row.getInt("colName") In 1.5, some variant of this

RE: DataFrame distinct vs RDD distinct

2015-05-11 Thread Ulanov, Alexander

Hi, Could you suggest alternative way of implementing distinct, e.g. via fold or aggregate? Both SQL distinct and RDD distinct fail on my dataset due to overflow of Spark shuffle disk. I have 7 nodes with 300GB dedicated to Spark shuffle each. My dataset is 2B rows, the field which I'm performi

Re: [SPARK-7400] PortableDataStream UDT

2015-05-11 Thread Reynold Xin

Sorry it's hard to give a definitive answer due to the lack of details (I'm not sure what exactly is entailed to have this PortableDataStream), but the answer is probably no if we need to change some existing code and expose a whole new data type to users. On Mon, May 11, 2015 at 9:02 AM, Eron Wr

Re: Intellij Spark Source Compilation

2015-05-11 Thread rtimp

Hi Iulian, I was able to successfully compile in eclipse after, on the command line, using sbt avro:generate followed by a sbt clean compile (and then a full clean compile in eclipse). Thanks for your help! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabbl

Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Sandy Ryza

Wow, I hadn't noticed this, but 5 seconds is really long. It's true that it's configurable, but I think we need to provide a decent out-of-the-box experience. For comparison, the MapReduce equivalent is 1 second. I filed https://issues.apache.org/jira/browse/SPARK-7533 for this. -Sandy On Mon,

Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Mridul Muralidharan

For tiny/small clusters (particularly single tenet), you can set it to lower value. But for anything reasonably large or multi-tenet, the request storm can be bad if large enough number of applications start aggressively polling RM. That is why the interval is set to configurable. - Mridul On Mo

[SPARK-7400] PortableDataStream UDT

2015-05-11 Thread Eron Wright

Hello, I'm working on SPARK-7400 for DataFrame support for PortableDataStream, i.e. the data type associated with the RDD from sc.binaryFiles(...). Assuming a patch is available soon, what is the likelihood of inclusion in Spark 1.4? Thanks

Re: Intellij Spark Source Compilation

2015-05-11 Thread Iulian Dragoș

Oh, I see. So then try to run one build on the command time firs (or try sbt avro:generate, though I’m not sure it’s enough). I just noticed that I have an additional source folder target/scala-2.10/src_managed/main/compiled_avro for spark-streaming-flume-sink. I guess I built the project once and

Re: Intellij Spark Source Compilation

2015-05-11 Thread rtimp

Hi, Thanks Iulian. Yeah, I was kind of anticipating I could just ignore old-deps ultimately. However, Even after doing a clean and build all, I get the following still: Description LocationResourcePathType not found: type EventBatch line 72 SparkAvroCallbackHandler.sc

Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Zoltán Zvara

Isn't this issue something that should be improved? Based on the following discussion, there are two places were YARN's heartbeat interval is respected on job start-up, but do we really need to respect it on start-up? On Fri, May 8, 2015 at 12:14 PM Taeyun Kim wrote: > I think so. > > In fact, t

Re: Intellij Spark Source Compilation

2015-05-11 Thread Iulian Dragoș

Hi, `old-deps` is not really a project, so you can simply skip it (or close it). The rest should work fine (clean and build all). On Sat, May 9, 2015 at 10:27 PM, rtimp wrote: > Hi Iulian, > > Thanks for the reply! > > With respect to eclipse, I'm doing this all with a fresh download of the > s

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-11 Thread Michal Haris

This is the stack trace of the worker thread: org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150) org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32) org.apache.spark.util.collection.ExternalAppendOnlyMap.insert

Re: PySpark DataFrame: Preserving nesting when selecting a nested field

2015-05-11 Thread Reynold Xin

In 1.4, you can use "struct" function to create a struct, e.g. you can explicitly select out the "version" column, and then create a new struct named "settings". The current semantics of select basically follows closely relational database's SQL, which is well understood and defined. I wouldn't a

Re: Change for submitting to yarn in 1.3.1

2015-05-11 Thread Mridul Muralidharan

That works when it is launched from same process - which is unfortunately not our case :-) - Mridul On Sun, May 10, 2015 at 9:05 PM, Manku Timma wrote: > sc.applicationId gives the yarn appid. > > On 11 May 2015 at 08:13, Mridul Muralidharan wrote: >> >> We had a similar requirement, and as a s

Updation of spark metrics.

Re: [PySpark DataFrame] When a Row is not a Row

[PySpark DataFrame] When a Row is not a Row

Re: Recent Spark test failures

Re: Recent Spark test failures

Re: Recent Spark test failures

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

RE: DataFrame distinct vs RDD distinct

RE: Easy way to convert Row back to case class

RE: DataFrame distinct vs RDD distinct

Re: [SPARK-7400] PortableDataStream UDT

Re: Intellij Spark Source Compilation

Re: YARN mode startup takes too long (10+ secs)

Re: YARN mode startup takes too long (10+ secs)

[SPARK-7400] PortableDataStream UDT

Re: Intellij Spark Source Compilation

Re: Intellij Spark Source Compilation

Re: YARN mode startup takes too long (10+ secs)

Re: Intellij Spark Source Compilation

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

Re: PySpark DataFrame: Preserving nesting when selecting a nested field

Re: Change for submitting to yarn in 1.3.1

22 matches

Site Navigation

Mail list logo

Footer information