Hi,
I can see executor sends the TaskMetrics in the heartbeat every 10
sec(configurable). but in JobProgListener there is a check for task
completion. If the task is not completed, it doesn't update the metrics on
the UI.
Any specific reason for this.? I don't think there would be much of the
perf
In Row#equals():
while (i < len) {
if (apply(i) != that.apply(i)) {
'!=' should be !apply(i).equals(that.apply(i)) ?
Cheers
On Mon, May 11, 2015 at 1:49 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:
> This is really strange.
>
> >>> # Spark 1.3.1
> >>> print type(resu
This is really strange.
>>> # Spark 1.3.1
>>> print type(results)
>>> a = results.take(1)[0]
>>> print type(a)
>>> print pyspark.sql.types.Row
>>> print type(a) == pyspark.sql.types.Row
False
>>> print isinstance(a, pyspark.sql.types.Row)
False
If I set a as follows, then the type checks p
> On 7 May 2015, at 01:41, Andrew Or wrote:
>
> Dear all,
>
> I'm sure you have all noticed that the Spark tests have been fairly
> unstable recently. I wanted to share a tool that I use to track which tests
> have been failing most often in order to prioritize fixing these flaky
> tests.
>
>
Makes sense.
Having high determinism in these tests would make Jenkins build stable.
On Mon, May 11, 2015 at 1:08 PM, Andrew Or wrote:
> Hi Ted,
>
> Yes, those two options can be useful, but in general I think the standard
> to set is that tests should never fail. It's actually the worst if tes
Hi Ted,
Yes, those two options can be useful, but in general I think the standard
to set is that tests should never fail. It's actually the worst if tests
fail sometimes but not others, because we can't reproduce them
deterministically. Using -M and -A actually tolerates flaky tests to a
certain e
Looks like it is spending a lot of time doing hash probing. It could be a
number of the following:
1. hash probing itself is inherently expensive compared with rest of your
workload
2. murmur3 doesn't work well with this key distribution
3. quadratic probing (triangular sequence) with a power-of
The following worked for me as a workaround for distinct:
val pf = sqlContext.parquetFile("hdfs://file")
val distinctValuesOfColumn4 =
pf.rdd.aggregate[scala.collection.mutable.HashSet[String]](new
scala.collection.mutable.HashSet[String]())( (s, v) => s += v.getString(4),
(s1, s2) => s1 ++= s2
Thank you for suggestions!
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Friday, May 08, 2015 11:10 AM
To: Will Benton
Cc: Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Easy way to convert Row back to case class
In 1.4, you can do
row.getInt("colName")
In 1.5, some variant of this
Hi,
Could you suggest alternative way of implementing distinct, e.g. via fold or
aggregate? Both SQL distinct and RDD distinct fail on my dataset due to
overflow of Spark shuffle disk. I have 7 nodes with 300GB dedicated to Spark
shuffle each. My dataset is 2B rows, the field which I'm performi
Sorry it's hard to give a definitive answer due to the lack of details (I'm
not sure what exactly is entailed to have this PortableDataStream), but the
answer is probably no if we need to change some existing code and expose a
whole new data type to users.
On Mon, May 11, 2015 at 9:02 AM, Eron Wr
Hi Iulian,
I was able to successfully compile in eclipse after, on the command line,
using sbt avro:generate followed by a sbt clean compile (and then a full
clean compile in eclipse). Thanks for your help!
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabbl
Wow, I hadn't noticed this, but 5 seconds is really long. It's true that
it's configurable, but I think we need to provide a decent out-of-the-box
experience. For comparison, the MapReduce equivalent is 1 second.
I filed https://issues.apache.org/jira/browse/SPARK-7533 for this.
-Sandy
On Mon,
For tiny/small clusters (particularly single tenet), you can set it to
lower value.
But for anything reasonably large or multi-tenet, the request storm
can be bad if large enough number of applications start aggressively
polling RM.
That is why the interval is set to configurable.
- Mridul
On Mo
Hello,
I'm working on SPARK-7400 for DataFrame support for PortableDataStream, i.e.
the data type associated with the RDD from sc.binaryFiles(...).
Assuming a patch is available soon, what is the likelihood of inclusion in
Spark 1.4?
Thanks
Oh, I see. So then try to run one build on the command time firs (or try sbt
avro:generate, though I’m not sure it’s enough). I just noticed that I have
an additional source folder target/scala-2.10/src_managed/main/compiled_avro
for spark-streaming-flume-sink. I guess I built the project once and
Hi,
Thanks Iulian. Yeah, I was kind of anticipating I could just ignore old-deps
ultimately. However, Even after doing a clean and build all, I get the
following still:
Description LocationResourcePathType
not found: type EventBatch line 72 SparkAvroCallbackHandler.sc
Isn't this issue something that should be improved? Based on the following
discussion, there are two places were YARN's heartbeat interval is
respected on job start-up, but do we really need to respect it on start-up?
On Fri, May 8, 2015 at 12:14 PM Taeyun Kim
wrote:
> I think so.
>
> In fact, t
Hi,
`old-deps` is not really a project, so you can simply skip it (or close
it). The rest should work fine (clean and build all).
On Sat, May 9, 2015 at 10:27 PM, rtimp wrote:
> Hi Iulian,
>
> Thanks for the reply!
>
> With respect to eclipse, I'm doing this all with a fresh download of the
> s
This is the stack trace of the worker thread:
org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150)
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
org.apache.spark.util.collection.ExternalAppendOnlyMap.insert
In 1.4, you can use "struct" function to create a struct, e.g. you can
explicitly select out the "version" column, and then create a new struct
named "settings".
The current semantics of select basically follows closely relational
database's SQL, which is well understood and defined. I wouldn't a
That works when it is launched from same process - which is
unfortunately not our case :-)
- Mridul
On Sun, May 10, 2015 at 9:05 PM, Manku Timma wrote:
> sc.applicationId gives the yarn appid.
>
> On 11 May 2015 at 08:13, Mridul Muralidharan wrote:
>>
>> We had a similar requirement, and as a s
22 matches
Mail list logo