Re: why write to journal is happening after write to ledgerStorage?

Sijie Guo Mon, 01 May 2017 20:38:40 -0700

On Mon, May 1, 2017 at 7:04 PM, Yiming Zang <yz...@twitter.com.invalid>
wrote:


> What about long poll read? I know it's not in OSS version yet, but if we're
> going to support long poll read in open source version in the future we
> need to think about this. Based on my understanding, the long poll read
> request will be satisfied once the LAC is advanced to previousLAC+1.
>
> For example, the current lac=1 and client issue a long poll read request
> for entry 2 with previousLAC=1. Now we send a write request for entry 2,
> and when that entry is added to memTable and LAC is advanced to 2,


When adding entry 2, LAC is not 2. LAC is always less than the entry id.


> and that
> entry is not yet written to Journal, the long poll read request can already
> be satisfied, and client will get response back for entry 2, but in fact,
> entry 2 could even failed to be written to Journal, or bookie could crash
> at that time. Will that bring inconsistency?
>
> Correct me if I'm wrong.
>
> Best,
> Yiming
>
> On Mon, May 1, 2017 at 6:52 PM, Sijie Guo <guosi...@gmail.com> wrote:
>
> > On Mon, May 1, 2017 at 6:35 PM, Venkateswara Rao Jujjuri <
> > jujj...@gmail.com>
> > wrote:
> >
> > > The real problem/issue is - having extremely fast journal disk doesn't
> > > really mask write latencies from a slower ledger disk.
> > >
> >
> > In most of the case, it mask the write latencies from a slower ledger
> disk.
> > Because the write should only happen in the memory.
> >
> > The worse case here is the write is being throttled - that typically
> means
> > something really bad happening.
> >
> > In this case, a larger write buffer would help?
> >
> >
> > >
> > > To address this rate correctness issue, cant we read from journal if
> the
> > > entryid >= LAC (as we cache now on bookie) and journal read fails?
> > >
> >
> > First, the correctness isn't entryid >= LAC case, as client can't really
> > read beyond LAC. The correctness issue is on entryid <= LAC case: the
> entry
> > appears on journal but not in ledger storage.
> >
> > Second, the purpose of having a separate journal disk is to avoid reads
> on
> > journal that would impact writes. If we circle back reads on journals,
> this
> > would potentially  cause performance degradation on writes as well.
> >
> > Last, in order to be able to read journals, you basically still need to
> add
> > some indexed structures into memory, so you know where to look up the
> > entries. No matter you store an entry in memtable or just store an entry
> > pointer pointing back to journal, you will still hit same problem - as
> you
> > can keep everything in memory, which you have to write data back to disks
> > and the throttling would happen again.
> >
> > From all these three points, I don't see too much value about changing
> > this. Instead, the question would be simpler - can you increase the
> memory
> > buffer size? If you can't, that means your hardware's capacity can't
> > basically keep up with the incoming write traffic. More capacity is
> needed
> > then.
> >
> > - Sijie
> >
> >
> >
> > >
> > > On Mon, May 1, 2017 at 6:33 PM, Sijie Guo <guosi...@gmail.com> wrote:
> > >
> > > > In the other to think about this,
> > > >
> > > > when 'throttling' happens,  it typically means:
> > > >
> > > > - the bookie doesn't have enough bandwidth/capacity to keep up with
> the
> > > > traffic.
> > > > - the disks on the bookie might have problems (e.g. slow down or
> other
> > > > hardware issues).
> > > >
> > > > Either case can happen. It might be worth to let the throttling kick
> > in,
> > > > rather than let journal disk accepting writes and putting ledger
> > storage
> > > > into worse state.
> > > >
> > > > - Sijie
> > > >
> > > > On Mon, May 1, 2017 at 6:23 PM, Sijie Guo <guosi...@gmail.com>
> wrote:
> > > >
> > > > >
> > > > >
> > > > > On Mon, May 1, 2017 at 6:14 PM, Venkateswara Rao Jujjuri <
> > > > > jujj...@gmail.com> wrote:
> > > > >
> > > > >> On Mon, May 1, 2017 at 6:03 PM, Venkateswara Rao Jujjuri <
> > > > >> jujj...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> >
> > > > >> >
> > > > >> > On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <guosi...@gmail.com>
> > > wrote:
> > > > >> >
> > > > >> >> I don't think this is an inconsistent issue. The in memory
> update
> > > is
> > > > >> >> updating lac not current entry. Even the entry is added into
> > memory
> > > > but
> > > > >> >> this entry will not be readable after lac is advanced, lac is
> > > > advanced
> > > > >> >> only
> > > > >> >> after the next entry is added which happened after current
> entry
> > is
> > > > >> acked.
> > > > >> >>
> > > > >> >
> > > > >> > That is not true. You are talking about piggy-backed LAC only.
> But
> > > > with
> > > > >> > Explicit LAC
> > > > >> > you don't need next entry to move LAC on bookie.
> > > > >> >
> > > > >>
> > > > >> Sorry, I pushed send before finishing. :)
> > > > >>
> > > > >> So you don't need next entry to move LAC forward, but its client
> job
> > > to
> > > > >> move LAC forward.
> > > > >> Hence client need to send explicit LAC to update LAC after it hear
> > > back
> > > > >> from AckQuorum.
> > > > >> Hence Sijie is right on this part, it is not a consistency issue.
> :)
> > > > >>
> > > > >>
> > > > >> But never the less, I believe we need to change the order as it is
> > not
> > > > >> completely shielding
> > > > >> writes from other activity. @Sijie do you see any issue if we
> write
> > to
> > > > >> journal, ack to client
> > > > >> and the write to ledger ?
> > > > >>
> > > > >
> > > > > Based on my understanding about this email thread, the concern
> comes
> > > from
> > > > > the latency on write. However, it doesn't change any latency
> behavior
> > > if
> > > > > you add to journal first and add to memtable later. 'Throttling'
> will
> > > > still
> > > > > happen when you add entry to memtable.
> > > > >
> > > > > So the question would be "can we write to journal and back back
> > > immediate
> > > > > after written to journal, and add the entry to memtable in
> > background"?
> > > > >
> > > > > The answer would be "no". Because this would volatile the
> > correctness.
> > > It
> > > > > might end up a case - the lac is already advanced but the entry is
> > not
> > > > > found - it can happen in following sequence.
> > > > >
> > > > > - Client issue write entry N (lac = N-1)
> > > > > - Bookie write the entry to the journal and acknowledge. Entry N is
> > in
> > > > the
> > > > > journal but haven't been added to the memtable.
> > > > > - Client received the acknowledge and advanced LAC from N-1 to N.
> > > > > - Client write another entry N+1 (lac = N) to advance LAC.
> > > > > - Another client (reader) detects LAC is advanced from N-1 to N. it
> > > > > attempts to read entry N but N isn't added to ledger storage. (*The
> > > > > correctness is volatiled here*)
> > > > >
> > > > > So to summarize my thoughts on this:
> > > > >
> > > > > - The acknowledge should happen after both writing the entry to
> > journal
> > > > > and write the entry to memtable.
> > > > > - The order of writing the entry to journal and writing entry to
> > > memtable
> > > > > doesn't matter here.
> > > > > - Writing the entry to the memtable helps with tailing latency
> > (because
> > > > it
> > > > > will advance LAC first).
> > > > >
> > > > > - Sijie
> > > > >
> > > > >
> > > > >>
> > > > >> JV
> > > > >>
> > > > >>
> > > > >> >
> > > > >> >
> > > > >> >> So adding the entry to memory doesn't expose any consistency
> > issue.
> > > > >> >>
> > > > >> >> On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" <
> > > > jujj...@gmail.com>
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> On Mon, May 1, 2017 at 2:31 PM, Yiming Zang
> > > > <yz...@twitter.com.invalid
> > > > >> >
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> > Hi Andrey,
> > > > >> >> >
> > > > >> >> > That's a good point, and you're actually correct that if
> write
> > to
> > > > >> >> memTable
> > > > >> >> > got throttled somehow, the addEntry request latency will be
> > > > affected
> > > > >> a
> > > > >> >> lot.
> > > > >> >> > This actually happens a few times in production cluster.
> > > Normally,
> > > > >> the
> > > > >> >> idea
> > > > >> >> > of using Journal is to write data to the write-ahead log and
> > then
> > > > >> >> persist
> > > > >> >> > the actual data to disks or add to memTable. However, my
> > > > >> understanding
> > > > >> >> of
> > > > >> >> > why we choose to write entry to ledgerStorage first is to
> > improve
> > > > the
> > > > >> >> > tailing-read performance.
> > > > >> >> >
> > > > >> >> > In SortedLedgerStorage.java, we first add entry to memTable
> and
> > > > then
> > > > >> we
> > > > >> >> > update lastAddConfirmed, which means if there's a long poll
> > read
> > > > >> request
> > > > >> >> or
> > > > >> >> > readLastAddConfirmed request, it will immediately get
> satisfied
> > > for
> > > > >> the
> > > > >> >> > latest entry before we actually log the entry into Journal.
> So
> > > > >> >> tailing-read
> > > > >> >> > doesn't actually need to wait for any disk operation in
> > > Bookkeeper
> > > > >> >> > including Journal operation.
> > > > >> >> >
> > > > >> >> > public long addEntry(ByteBuffer entry) throws IOException {
> > > > >> >> > long ledgerId = entry.getLong();
> > > > >> >> > long entryId = entry.getLong();
> > > > >> >> > long lac = entry.getLong();
> > > > >> >> > entry.rewind();
> > > > >> >> > memTable.addEntry(ledgerId, entryId, entry, this);
> > > > >> >> > ledgerCache.updateLastAddConfirmed(ledgerId, lac);
> > > > >> >> > return entryId;
> > > > >> >> > }
> > > > >> >> >
> > > > >> >> > But thinking about here, I'm wondering if it's actually safe
> to
> > > > >> update
> > > > >> >> the
> > > > >> >> > LAC before we write the entry to Journal. What if we tell the
> > > > client
> > > > >> the
> > > > >> >> > LAC has been updated but we actually failed to write the
> entry
> > to
> > > > >> >> Journal
> > > > >> >> > and Bookie crashed at that time? Would this bring any
> > > inconsistency
> > > > >> >> issue?
> > > > >> >> >
> > > > >> >>
> > > > >> >> Good point. This is indeed an inconsistency issue. BK
> guarantees
> > > "if
> > > > >> you
> > > > >> >> read once you can read it all the time".
> > > > >> >> If it is really done for LAC it is not really good idea.
> Unless I
> > > am
> > > > >> >> missing something, this must be changed ASAP.
> > > > >> >>
> > > > >> >> Thanks,
> > > > >> >> JV
> > > > >> >>
> > > > >> >>
> > > > >> >> >
> > > > >> >> > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <
> > > > >> >> andrey.yego...@gmail.com>
> > > > >> >> > wrote:
> > > > >> >> >
> > > > >> >> > > Hi,
> > > > >> >> > >
> > > > >> >> > > Looking at the code in Bookie.java I noticed that write to
> > > > journal
> > > > >> >> (which
> > > > >> >> > > is supposed to be a write-ahead log as I understand)
> happened
> > > > after
> > > > >> >> write
> > > > >> >> > > to ledger storage.
> > > > >> >> > > This looks counter-intuitive, can someone explain why is it
> > > done
> > > > in
> > > > >> >> this
> > > > >> >> > > order?
> > > > >> >> > >
> > > > >> >> > > My primary concern is that ledger storage write can be
> > delayed
> > > > >> (i.e.
> > > > >> >> > > EntryMemTable's addEntry can do throttleWriters() in some
> > > cases)
> > > > >> thus
> > > > >> >> > > dragging overall client's view of add latency up even
> though
> > it
> > > > is
> > > > >> >> > possible
> > > > >> >> > > that journal's write (i.e. in case of dedicated journal
> disk)
> > > > will
> > > > >> >> > complete
> > > > >> >> > > faster.
> > > > >> >> > >
> > > > >> >> > >     private void addEntryInternal(LedgerDescriptor handle,
> > > > >> ByteBuffer
> > > > >> >> > > entry, WriteCallback cb, Object ctx)
> > > > >> >> > >
> > > > >> >> > >             throws IOException, BookieException {
> > > > >> >> > >
> > > > >> >> > >         long ledgerId = handle.getLedgerId();
> > > > >> >> > >
> > > > >> >> > >         entry.rewind();
> > > > >> >> > >
> > > > >> >> > > *// ledgerStorage.addEntry() is happening here*
> > > > >> >> > >
> > > > >> >> > >         long entryId = handle.addEntry(entry);
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >         entry.rewind();
> > > > >> >> > >
> > > > >> >> > >         writeBytes.add(entry.remaining());
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >         LOG.trace("Adding {}@{}", entryId, ledgerId);
> > > > >> >> > >
> > > > >> >> > > *// journal add entry is happening here*
> > > > >> >> > >
> > > > >> >> > > *// callback/response to client is sent after journal add
> is
> > > > done.*
> > > > >> >> > >
> > > > >> >> > >         journal.logAddEntry(entry, cb, ctx);
> > > > >> >> > >
> > > > >> >> > >     }
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > > ----------
> > > > >> >> > > Andrey Yegorov
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> --
> > > > >> >> Jvrao
> > > > >> >> ---
> > > > >> >> First they ignore you, then they laugh at you, then they fight
> > you,
> > > > >> then
> > > > >> >> you win. - Mahatma Gandhi
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Jvrao
> > > > >> > ---
> > > > >> > First they ignore you, then they laugh at you, then they fight
> > you,
> > > > then
> > > > >> > you win. - Mahatma Gandhi
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Jvrao
> > > > >> ---
> > > > >> First they ignore you, then they laugh at you, then they fight
> you,
> > > then
> > > > >> you win. - Mahatma Gandhi
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Jvrao
> > > ---
> > > First they ignore you, then they laugh at you, then they fight you,
> then
> > > you win. - Mahatma Gandhi
> > >
> >
>

Re: why write to journal is happening after write to ledgerStorage?

Reply via email to