Re: [VOTE] Merge HDFS-6581 to trunk - Writing to replicas in memory.

Colin McCabe Wed, 24 Sep 2014 14:20:13 -0700

On Wed, Sep 24, 2014 at 11:12 AM, Suresh Srinivas
<[email protected]> wrote:
> Features are done in a separate branch for two reasons: 1) During a feature
> development the branch may be not functional 2) The high level approach and
> design is not very clear and development can continue while that is being
> sorted out.


Up until the last two days, we had no benchmarks at all, so we didn't
have any way to evaluate whether this performance improvement was
functional.  Andrew commented on this as well in this thread, and we
also raised the issue on the JIRA.  I am glad that a few benchmarks
have finally been posted.  I would appreciate in the future if
benchmarks were posted a day or two before a merge vote of a
performance improvement was suggested.  As it is, it feels like we are
racing the clock right now to figure out how well this works, and it
puts the reviewers in an unpleasant position.

>In case of this feature, clearly (2) is not an issue. We have
> had enough discussion about the approach. I also think this branch is ready
> for merge without rendering trunk not functional.

I agree that this can be merged without rendering trunk
non-functional.  I don't agree that we have achieved consensus on all
the high-level approach and design.

There are two areas where I think we need more clarification.  The
first is whether other eviction strategies can be implemented besides
LRU.  Yes, there is a pluggable interface, but I am concerned that it
may not be flexible enough to implement anything besides LRU.  I am
working on a patch now that implements 2Q.  I will find out for
myself, I guess.

The other area is how the memory management for HDFS-6581 fits in with
the memory management for HDFS-4949.  I am very concerned that this
has been handwaved away as future work.  If we create a configuration
mess for system administrators as a result, I will be sad.  This is
the kind of design that I would have liked to see done a while ago.  I
would certainly veto any merge to 2.6 until we have figured this out.

I commented about this a month ago:
https://issues.apache.org/jira/browse/HDFS-6581?focusedCommentId=14106025&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14106025
But my concerns were not addressed in the branch.

If these two design clarifications can be made in time for the merge,
I will be -0 on this.  I have opened HDFS-7141 and HDFS-7142 to
continue the discussion on these points.

> That leaves us with one
> objection that is being raised; the content is not complete. This did not
> prevent recently merged crypto file system work from being merged to trunk.
> We discussed and decided that it can be merged to trunk and we will finish
> the remaining work that is a must have in trunk before merging to branch-2.
> That is a reasonable approach for this feature as well. In fact there are a
> lot of bug fixes and improvements still happening on crypto work even after
> branch-2 merge (which is a good thing in my opinion, it shows that the
> testing is continuing and feature is still being improved).

I am OK with merging this before it is complete.  I was just pointing
out that this could have benefited from more time in development.  It
would have been a much smoother merge and less stressful for everyone.

>
> As regards to the eviction policy we have had discussion in the jira. I
> disagree that there exists a "usable" or "better" strategy that works well
> for all the work loads. I also disagree that additional policy to the
> existing LRU is needed, leave alone it being needed for trunk merge. New
> eviction policies need to be developed as people experiment with it. Hence
> making the eviction policy pluggable is great way to approach it.

LRU is a poor policy for scan workloads.  For example, see here:
http://www.cse.iitd.ernet.in/~sbansal/os/lec/l30.html
"In general, LRU works well, but there are certain workloads where LRU
performs very poorly. One well-known workload is a "scan" wherein a
large memory region (larger than physical memory) is accessed in quick
succession. e.g... Such scans are often quite common, as some thread
may want to go through all its data. And so, plain LRU (or CLOCK) does
not work as well in practice. Instead a variant of LRU that considers
both recency and frequency of access to a page, is typically used in
an OS (e.g., 2Q, CLOCK with Adaptive Replacement, etc.)."

So is your argument that we are not using this for scanning workloads?
 That seems tough to believe, and as I said earlier, you have posted
no system-level benchmarks, so how can I evaluate that claim?

> There are some testing information posted on the jira, such as performance.
> But I think we need more details on how this feature is being tested.
> Arpit, can you please post test details. As for me, given how much work
> (both discussion and coding) has gone into it, and given it is a very
> important feature that allows experimentation to use memory tier, I would
> like to see this available in release 2.6.

Just to reiterate, I am -0 provided we can address HDFS-7141 and
HDFS-7142 before this gets set in stone.  I would hate to -1 this,
because it would mean that you could not call another vote for a week.
But I feel that it would have been better to float the idea of a merge
on the JIRA before actually calling it, to avoid having discussions
like this where we are racing the clock.

thanks,
Colin

>
> On Tue, Sep 23, 2014 at 6:09 PM, Colin McCabe <[email protected]>
> wrote:
>
>> This seems like a really aggressive timeframe for a merge.  We still
>> haven't implemented:
>>
>> * Checksum skipping on read and write from lazy persisted replicas.
>> * Allowing mmaped reads from the lazy persisted data.
>> * Any eviction strategy other than LRU.
>> * Integration with cache pool limits (how do HDFS-4949 and lazy
>> persist replicas share memory)?
>> * Eviction from RAM disk via truncation (HDFS-6918)
>> * Metrics
>> * System testing to find out how useful this is, and what the best
>> eviction strategy is.
>>
>> I see why we might want to defer checksum skipping, metrics, allowing
>> mmap, eviction via truncation, and so forth until later.  But I feel
>> like we need to figure out how this will integrate with the memory
>> used by HDFS-4949 before we merge.  I also would like to see another
>> eviction strategy other than LRU, which is a very poor eviction
>> strategy for scanning workloads.  I mentioned this a few times on the
>> JIRA.
>>
>> I'd also like to get some idea of how much testing this has received
>> in a multi-node cluster.  What makes us confident that this is the
>> right time to merge, rather than in a week or two?
>>
>> best,
>> Colin
>>
>>
>> On Tue, Sep 23, 2014 at 4:55 PM, Arpit Agarwal <[email protected]>
>> wrote:
>> > I have posted write benchmark results to the Jira.
>> >
>> > On Tue, Sep 23, 2014 at 3:41 PM, Arpit Agarwal <[email protected]
>> >
>> > wrote:
>> >
>> >> Hi Andrew, I said "it is not going to be a substantial fraction of
>> memory
>> >> bandwidth". That is certainly not the same as saying it won't be good or
>> >> there won't be any improvement.
>> >>
>> >> Any time you have transfers over RPC or the network stack you will not
>> get
>> >> close to the memory bandwidth even for intra-host transfers.
>> >>
>> >> I'll add some micro-benchmark results to the Jira shortly.
>> >>
>> >> Thanks,
>> >> Arpit
>> >>
>> >> On Tue, Sep 23, 2014 at 2:33 PM, Andrew Wang <[email protected]>
>> >> wrote:
>> >>
>> >>> Hi Arpit,
>> >>>
>> >>> Here is the comment. It was certainly not my intention to misquote
>> anyone.
>> >>>
>> >>>
>> >>>
>> https://issues.apache.org/jira/browse/HDFS-6581?focusedCommentId=14138223&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14138223
>> >>>
>> >>> Quote:
>> >>>
>> >>> It would be nice to see that would could get a substantial fraction of
>> >>> memory bandwidth when writing to a single replica in-memory.
>> >>>
>> >>> The comparison will be interesting but I can tell you without
>> measurement
>> >>> it is not going to be a substantial fraction of memory bandwidth. We
>> are
>> >>> still going through DataTransferProtocol with all the copies and
>> overhead
>> >>> that involves.
>> >>>
>> >>> When the goal is in-memory writes and we are unable to achieve a
>> >>> substantial fraction of memory bandwidth, to me that is "not good
>> >>> performance."
>> >>>
>> >>> I also looked through the subtasks, and AFAICT the only one related to
>> >>> improving this is deferring checksum computation. The benchmarking we
>> did
>> >>> on HDFS-4949 showed that this only really helps when you're down to
>> single
>> >>> copy or zero copies with SCR/ZCR. DTP reads didn't see much of an
>> >>> improvement, so I'd guess the same would be true for DTP writes.
>> >>>
>> >>> I think my above three questions are still open, as well as my question
>> >>> about why we're merging now, as opposed to when the performance of the
>> >>> branch is proven out.
>> >>>
>> >>> Thanks,
>> >>> Andrew
>> >>>
>> >>> On Tue, Sep 23, 2014 at 2:10 PM, Arpit Agarwal <
>> [email protected]>
>> >>> wrote:
>> >>>
>> >>> > Andrew, don't misquote me. Can you link the comment where I said
>> >>> > performance wasn't going to be good?
>> >>> >
>> >>> > I will add some add some preliminary write results to the Jira later
>> >>> today.
>> >>> >
>> >>> > > What's the plan to improve write performance?
>> >>> > I described this in response to your and Colin's comments on the
>> Jira.
>> >>> >
>> >>> > For the benefit of folks not following the Jira, the immediate task
>> we'd
>> >>> > like to get done post-merge is moving checksum computation off the
>> write
>> >>> > path. Also see open subtasks of HDFS-6581 for other planned perf
>> >>> > improvements.
>> >>> >
>> >>> > Thanks,
>> >>> > Arpit
>> >>> >
>> >>> >
>> >>> > On Tue, Sep 23, 2014 at 1:07 PM, Andrew Wang <
>> [email protected]>
>> >>> > wrote:
>> >>> >
>> >>> > > Hi Arpit,
>> >>> > >
>> >>> > > On HDFS-6581, I asked for write benchmarks on Sep 19th, and you
>> >>> responded
>> >>> > > that the performance wasn't going to be good. However, I thought
>> the
>> >>> > > primary goal of this JIRA was to improve write performance, and
>> write
>> >>> > > performance is listed as the first feature requirement in the
>> design
>> >>> doc.
>> >>> > >
>> >>> > > So, this leads me to a few questions, which I also asked last week
>> on
>> >>> the
>> >>> > > JIRA (I believe still unanswered):
>> >>> > >
>> >>> > > - What's the plan to improve write performance?
>> >>> > > - What kind of performance can we expect after the plan is
>> completed?
>> >>> > > - Can this expected performance be validated with a prototype?
>> >>> > >
>> >>> > > Even with these questions answered, I don't understand the need to
>> >>> merge
>> >>> > > this before the write optimization work is completed. Write perf is
>> >>> > listed
>> >>> > > as a feature requirement, so the branch can reasonably be called
>> not
>> >>> > > feature complete until it's shown to be faster.
>> >>> > >
>> >>> > > Thanks,
>> >>> > > Andrew
>> >>> > >
>> >>> > > On Tue, Sep 23, 2014 at 11:47 AM, Jitendra Pandey <
>> >>> > > [email protected]>
>> >>> > > wrote:
>> >>> > >
>> >>> > > > +1. I have reviewed most of the code in the branch, and I think
>> its
>> >>> > ready
>> >>> > > > to be merged to trunk.
>> >>> > > >
>> >>> > > >
>> >>> > > > On Mon, Sep 22, 2014 at 5:24 PM, Arpit Agarwal <
>> >>> > [email protected]
>> >>> > > >
>> >>> > > > wrote:
>> >>> > > >
>> >>> > > > > HDFS Devs,
>> >>> > > > >
>> >>> > > > > We propose merging the HDFS-6581 development branch to trunk.
>> >>> > > > >
>> >>> > > > > The work adds support to write to HDFS blocks in memory. The
>> >>> target
>> >>> > use
>> >>> > > > > case covers applications writing relatively small, intermediate
>> >>> data
>> >>> > > sets
>> >>> > > > > with low latency. We introduce a new CreateFlag for the
>> existing
>> >>> > > > CreateFile
>> >>> > > > > API. HDFS will subsequently attempt to place replicas of file
>> >>> blocks
>> >>> > in
>> >>> > > > > local memory with disk writes occurring off the hot path. The
>> >>> current
>> >>> > > > > design is a simplification of original ideas from Sanjay Radia
>> on
>> >>> > > > > HDFS-5851.
>> >>> > > > >
>> >>> > > > > Key goals of the feature were minimal API changes to reduce
>> >>> > application
>> >>> > > > > burden and best effort data durability. The feature is optional
>> >>> and
>> >>> > > > > requires appropriate DN configuration from administrators.
>> >>> > > > >
>> >>> > > > > Design doc:
>> >>> > > > >
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> https://issues.apache.org/jira/secure/attachment/12661926/HDFSWriteableReplicasInMemory.pdf
>> >>> > > > >
>> >>> > > > > Test plan:
>> >>> > > > >
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> https://issues.apache.org/jira/secure/attachment/12669452/Test-Plan-for-HDFS-6581-Memory-Storage.pdf
>> >>> > > > >
>> >>> > > > > There are 28 resolved sub-tasks under HDFS-6581, 3 open tasks
>> for
>> >>> > > > > tests+Jenkins issues  and 7 open subtasks tracking planned
>> >>> > > improvements.
>> >>> > > > > The latest merge patch is 3300 lines of changed code of which
>> 1300
>> >>> > > lines
>> >>> > > > is
>> >>> > > > > new and updated tests. Merging the branch to trunk will allow
>> HDFS
>> >>> > > > > applications to start evaluating the feature. We will continue
>> >>> work
>> >>> > on
>> >>> > > > > documentation, performance tuning and metrics in parallel with
>> the
>> >>> > vote
>> >>> > > > and
>> >>> > > > > post-merge.
>> >>> > > > >
>> >>> > > > > Contributors to design and code include Xiaoyu Yao, Sanjay
>> Radia,
>> >>> > > > Jitendra
>> >>> > > > > Pandey, Tassapol Athiapinya, Gopal V, Bikas Saha, Vikram Dixit,
>> >>> > Suresh
>> >>> > > > > Srinivas and Chris Nauroth.
>> >>> > > > >
>> >>> > > > > Thanks to Haohui Mai, Colin Patrick McCabe, Andrew Wang, Todd
>> >>> Lipcon,
>> >>> > > > Eric
>> >>> > > > > Baldeschwieler and Vinayakumar B for providing useful feedback
>> on
>> >>> > > > > HDFS-6581, HDFS-5851 and sub-tasks.
>> >>> > > > >
>> >>> > > > > The vote runs for the usual 7 days and will expire at 12am PDT
>> on
>> >>> Sep
>> >>> > > 30.
>> >>> > > > > Here is my +1 for the merge.
>> >>> > > > >
>> >>> > > > > Regards,
>> >>> > > > > Arpit
>> >>> > > > >
>> >>> > > > > --
>> >>> > > > > CONFIDENTIALITY NOTICE
>> >>> > > > > NOTICE: This message is intended for the use of the individual
>> or
>> >>> > > entity
>> >>> > > > to
>> >>> > > > > which it is addressed and may contain information that is
>> >>> > confidential,
>> >>> > > > > privileged and exempt from disclosure under applicable law. If
>> the
>> >>> > > reader
>> >>> > > > > of this message is not the intended recipient, you are hereby
>> >>> > notified
>> >>> > > > that
>> >>> > > > > any printing, copying, dissemination, distribution, disclosure
>> or
>> >>> > > > > forwarding of this communication is strictly prohibited. If you
>> >>> have
>> >>> > > > > received this communication in error, please contact the sender
>> >>> > > > immediately
>> >>> > > > > and delete it from your system. Thank You.
>> >>> > > > >
>> >>> > > >
>> >>> > > >
>> >>> > > >
>> >>> > > > --
>> >>> > > > <http://hortonworks.com/download/>
>> >>> > > >
>> >>> > > > --
>> >>> > > > CONFIDENTIALITY NOTICE
>> >>> > > > NOTICE: This message is intended for the use of the individual or
>> >>> > entity
>> >>> > > to
>> >>> > > > which it is addressed and may contain information that is
>> >>> confidential,
>> >>> > > > privileged and exempt from disclosure under applicable law. If
>> the
>> >>> > reader
>> >>> > > > of this message is not the intended recipient, you are hereby
>> >>> notified
>> >>> > > that
>> >>> > > > any printing, copying, dissemination, distribution, disclosure or
>> >>> > > > forwarding of this communication is strictly prohibited. If you
>> have
>> >>> > > > received this communication in error, please contact the sender
>> >>> > > immediately
>> >>> > > > and delete it from your system. Thank You.
>> >>> > > >
>> >>> > >
>> >>> >
>> >>> > --
>> >>> > CONFIDENTIALITY NOTICE
>> >>> > NOTICE: This message is intended for the use of the individual or
>> >>> entity to
>> >>> > which it is addressed and may contain information that is
>> confidential,
>> >>> > privileged and exempt from disclosure under applicable law. If the
>> >>> reader
>> >>> > of this message is not the intended recipient, you are hereby
>> notified
>> >>> that
>> >>> > any printing, copying, dissemination, distribution, disclosure or
>> >>> > forwarding of this communication is strictly prohibited. If you have
>> >>> > received this communication in error, please contact the sender
>> >>> immediately
>> >>> > and delete it from your system. Thank You.
>> >>> >
>> >>>
>> >>
>> >>
>> >
>> > --
>> > CONFIDENTIALITY NOTICE
>> > NOTICE: This message is intended for the use of the individual or entity
>> to
>> > which it is addressed and may contain information that is confidential,
>> > privileged and exempt from disclosure under applicable law. If the reader
>> > of this message is not the intended recipient, you are hereby notified
>> that
>> > any printing, copying, dissemination, distribution, disclosure or
>> > forwarding of this communication is strictly prohibited. If you have
>> > received this communication in error, please contact the sender
>> immediately
>> > and delete it from your system. Thank You.
>>
>
>
>
> --
> http://hortonworks.com/download/
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: [VOTE] Merge HDFS-6581 to trunk - Writing to replicas in memory.

Reply via email to