[Spark Core] How do Spark workers exchange data in standalone mode?

2017-02-27 Thread vdukic
Hello All,

I want to know more about data exchange between Spark workers in
standalone mode. Every time a task wants to read result of another task,
I want to log that event.

Information I need:
source task / stage
destination task / stage
size of the data transfer

So far I've managed to do something similar by changing two methods in
Spark Core:

In order to get which task produced which partition / block, I added
SortShuffleWriter.sala#SortShuffleWriter#write:

logError(s"""PRODUCED SORT:
|BlockId: ${blockId.shuffleId} ${blockId.mapId}
|PartitionId: ${context.partitionId()}
|TaskAttemptId: ${context.taskAttemptId()}
|StageId: ${context.stageId()}
   """.stripMargin)

To get which task consumed which partition / block, I added to
ShuffleBlockFetcherIterator.scala#ShuffleBlockFetcherIterator#sendRequest

blockIds.foreach{ blockId =>
  logError(
s"""CONSUMED:
   |BlockId: ${blockId},
   |PartitionId: ${context.partitionId()},
   |TaskAttemptId: ${context.taskAttemptId()}
   |StageId: ${context.stageId()},
   |Address: ${address}
   |Size: ${sizeMap(blockId)}
 """.stripMargin)
}

Using these two changes, I managed to partially reconstruct the
communication graph, but there are a couple of problems:
1. I cannot map all PRODUCED/CONSUMED logs
2. The amount of data (filed "size") does not match real traffic numbers
that I got from the OS. On the other hand, it matches the numbers for
Shuffle Read/Write on Spark History Server.

I've found an article that explains data exchange in Apache Flink to a
certain extent. Is there something similar for Spark?
https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks

Thanks.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Core-How-do-Spark-workers-exchange-data-in-standalone-mode-tp21087.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-02-27 Thread Ryan Blue
I'd like to see more discussion on the issues I raised. I don't think there
was a response for why voting is limited to PMC members.

Tim was kind enough to reply with his rationale for a shepherd, but I don't
think that it justifies failing proposals. I think it boiled down to
"shepherds can be helpful", which isn't a good reason to require them in my
opinion. Sam also had some good comments on this and I think that there's
more to talk about.

That said, I'd rather not have this proposal fail because we're tired of
talking about it. If most people are okay with it as it stands and want a
vote, I'm fine testing this out and fixing it later.

rb

On Fri, Feb 24, 2017 at 8:28 PM, Joseph Bradley 
wrote:

> The current draft LGTM.  I agree some of the various concerns may need to
> be addressed in the future, depending on how SPIPs progress in practice.
> If others agree, let's put it to a vote and revisit the proposal in a few
> months.
> Joseph
>
> On Fri, Feb 24, 2017 at 5:35 AM, Cody Koeninger 
> wrote:
>
>> It's been a week since any further discussion.
>>
>> Do PMC members think the current draft is OK to vote on?
>>
>> On Fri, Feb 17, 2017 at 10:41 PM, vaquar khan 
>> wrote:
>> > I like document and happy to see SPIP draft version however i feel
>> shepherd
>> > role is again hurdle in process improvement ,It's like everything
>> depends
>> > only on shepherd .
>> >
>> > Also want to add point that SPIP  should be time bound with define SLA
>> else
>> > will defeats purpose.
>> >
>> >
>> > Regards,
>> > Vaquar khan
>> >
>> > On Thu, Feb 16, 2017 at 3:26 PM, Ryan Blue 
>> > wrote:
>> >>
>> >> > [The shepherd] can advise on technical and procedural considerations
>> for
>> >> > people outside the community
>> >>
>> >> The sentiment is good, but this doesn't justify requiring a shepherd
>> for a
>> >> proposal. There are plenty of people that wouldn't need this, would get
>> >> feedback during discussion, or would ask a committer or PMC member if
>> it
>> >> weren't a formal requirement.
>> >>
>> >> > if no one is willing to be a shepherd, the proposed idea is probably
>> not
>> >> > going to receive much traction in the first place.
>> >>
>> >> This also doesn't sound like a reason for needing a shepherd. Saying
>> that
>> >> a shepherd probably won't hurt the process doesn't give me an idea of
>> why a
>> >> shepherd should be required in the first place.
>> >>
>> >> What was the motivation for adding a shepherd originally? It may not be
>> >> bad and it could be helpful, but neither of those makes me think that
>> they
>> >> should be required or else the proposal fails.
>> >>
>> >> rb
>> >>
>> >> On Thu, Feb 16, 2017 at 12:23 PM, Tim Hunter > >
>> >> wrote:
>> >>>
>> >>> The doc looks good to me.
>> >>>
>> >>> Ryan, the role of the shepherd is to make sure that someone
>> >>> knowledgeable with Spark processes is involved: this person can advise
>> >>> on technical and procedural considerations for people outside the
>> >>> community. Also, if no one is willing to be a shepherd, the proposed
>> >>> idea is probably not going to receive much traction in the first
>> >>> place.
>> >>>
>> >>> Tim
>> >>>
>> >>> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger 
>> >>> wrote:
>> >>> > Reynold, thanks, LGTM.
>> >>> >
>> >>> > Sean, great concerns.  I agree that behavior is largely cultural and
>> >>> > writing down a process won't necessarily solve any problems one way
>> or
>> >>> > the other.  But one outwardly visible change I'm hoping for out of
>> >>> > this a way for people who have a stake in Spark, but can't follow
>> >>> > jiras closely, to go to the Spark website, see the list of proposed
>> >>> > major changes, contribute discussion on issues that are relevant to
>> >>> > their needs, and see a clear direction once a vote has passed.  We
>> >>> > don't have that now.
>> >>> >
>> >>> > Ryan, realistically speaking any PMC member can and will stop any
>> >>> > changes they don't like anyway, so might as well be up front about
>> the
>> >>> > reality of the situation.
>> >>> >
>> >>> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen 
>> wrote:
>> >>> >> The text seems fine to me. Really, this is not describing a
>> >>> >> fundamentally
>> >>> >> new process, which is good. We've always had JIRAs, we've always
>> been
>> >>> >> able
>> >>> >> to call a VOTE for a big question. This just writes down a sensible
>> >>> >> set of
>> >>> >> guidelines for putting those two together when a major change is
>> >>> >> proposed. I
>> >>> >> look forward to turning some big JIRAs into a request for a SPIP.
>> >>> >>
>> >>> >> My only hesitation is that this seems to be perceived by some as a
>> new
>> >>> >> or
>> >>> >> different thing, that is supposed to solve some problems that
>> aren't
>> >>> >> otherwise solvable. I see mentioned problems like: clear process
>> for
>> >>> >> managing work, public communication, more committers, some sort of
>> >>> >> binding
>> >>> >> outcome and deadline.
>> >>> >>
>> >>> >> 

Re: Spark Improvement Proposals

2017-02-27 Thread Sean Owen
To me, no new process is being invented here, on purpose, and so we should
just rely on whatever governs any large JIRA or vote, because SPIPs are
really just guidance for making a big JIRA.

http://apache.org/foundation/voting.html suggests that PMC members have the
binding votes in general, and for code-modification votes in particular,
which is what this is. Absent a strong reason to diverge from that, I'd go
with that.

(PS: On reading this, I didn't realize that the guidance was that releases
are blessed just by majority vote. Oh well, not that it has mattered.)

I also don't see a need to require a shepherd, because JIRAs don't have
such a process, though I also can't see a situation where nobody with a
vote cares to endorse the SPIP ever, but three people vote for it and
nobody objects?

Perhaps downgrade this to "strongly suggested, so that you don't waste your
time."

Or, implicitly, that proposing a SPIP calls a vote that lasts for, dunno, a
month. If fewer than 3 PMC vote for it, it doesn't pass anyway. If at least
1 does, OK, they're the shepherd(s). No new process.

On Mon, Feb 27, 2017 at 9:09 PM Ryan Blue  wrote:

> I'd like to see more discussion on the issues I raised. I don't think
> there was a response for why voting is limited to PMC members.
>
> Tim was kind enough to reply with his rationale for a shepherd, but I
> don't think that it justifies failing proposals. I think it boiled down to
> "shepherds can be helpful", which isn't a good reason to require them in my
> opinion. Sam also had some good comments on this and I think that there's
> more to talk about.
>
> That said, I'd rather not have this proposal fail because we're tired of
> talking about it. If most people are okay with it as it stands and want a
> vote, I'm fine testing this out and fixing it later.
>
> rb
>
>


[build system] emergency jenkins master reboot, got some wedged processes

2017-02-27 Thread shane knapp
the jenkins master is wedged and i'm going to reboot it to increase
it's happiness.

more updates as they come.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] emergency jenkins master reboot, got some wedged processes

2017-02-27 Thread shane knapp
we're back and things are much snappier!  sorry for the downtime.

On Mon, Feb 27, 2017 at 1:58 PM, shane knapp  wrote:
> the jenkins master is wedged and i'm going to reboot it to increase
> it's happiness.
>
> more updates as they come.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Implementation of RNN/LSTM in Spark

2017-02-27 Thread Yuhao Yang
Welcome to try and contribute to our BigDL:
https://github.com/intel-analytics/BigDL

It's native on Spark and fast by leveraging Intel MKL.

2017-02-23 4:51 GMT-08:00 Joeri Hermans :

> Hi Nikita,
>
> We are actively working on this: https://github.com/cerndb/dist-keras
> This will allow you to run Keras on Spark (with distributed optimization
> algorithms) through pyspark. I recommend you to check the examples
> https://github.com/cerndb/dist-keras/tree/master/examples. However, you
> need to be aware that distributed optimization is a research topic, and has
> several approaches and caveats you need to be aware of. I wrote a blog post
> on this if you like to have some additional information on this topic
> https://db-blog.web.cern.ch/blog/joeri-hermans/2017-01-
> distributed-deep-learning-apache-spark-and-keras
>
> However, if you don't want to use a distributed optimization algorithm, we
> also support a "sequential trainer" which allows you to train a model on
> Spark dataframes.
>
> Kind regards,
>
> Joeri
> .
> From: Nick Pentreath [nick.pentre...@gmail.com]
> Sent: 23 February 2017 13:39
> To: dev@spark.apache.org
> Subject: Re: Implementation of RNN/LSTM in Spark
>
> The short answer is there is none and highly unlikely to be inside of
> Spark MLlib any time in the near future.
>
> The best bets are to look at other DL libraries - for JVM there is
> Deeplearning4J and BigDL (there are others but these seem to be the most
> comprehensive I have come across) - that run on Spark. Also there are
> various flavours of TensorFlow / Caffe on Spark. And of course the libs
> such as Torch, Keras, Tensorflow, MXNet, Caffe etc. Some of them have Java
> or Scala APIs and some form of Spark integration out there in the community
> (in varying states of development).
>
> Integrations with Spark are a bit patchy currently but include the
> "XOnSpark" flavours mentioned above and TensorFrames (again, there may be
> others).
>
> On Thu, 23 Feb 2017 at 14:23 n1kt0  > wrote:
> Hi,
> can anyone tell me what the current status about RNNs in Spark is?
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Implementation-of-RNN-LSTM-in-Spark-
> tp14866p21060.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org dev-unsubscr...@spark.apache.org>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>