The real problem/issue is - having extremely fast journal disk doesn't really mask write latencies from a slower ledger disk.
To address this rate correctness issue, cant we read from journal if the entryid >= LAC (as we cache now on bookie) and journal read fails? On Mon, May 1, 2017 at 6:33 PM, Sijie Guo <[email protected]> wrote: > In the other to think about this, > > when 'throttling' happens, it typically means: > > - the bookie doesn't have enough bandwidth/capacity to keep up with the > traffic. > - the disks on the bookie might have problems (e.g. slow down or other > hardware issues). > > Either case can happen. It might be worth to let the throttling kick in, > rather than let journal disk accepting writes and putting ledger storage > into worse state. > > - Sijie > > On Mon, May 1, 2017 at 6:23 PM, Sijie Guo <[email protected]> wrote: > > > > > > > On Mon, May 1, 2017 at 6:14 PM, Venkateswara Rao Jujjuri < > > [email protected]> wrote: > > > >> On Mon, May 1, 2017 at 6:03 PM, Venkateswara Rao Jujjuri < > >> [email protected]> > >> wrote: > >> > >> > > >> > > >> > On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <[email protected]> wrote: > >> > > >> >> I don't think this is an inconsistent issue. The in memory update is > >> >> updating lac not current entry. Even the entry is added into memory > but > >> >> this entry will not be readable after lac is advanced, lac is > advanced > >> >> only > >> >> after the next entry is added which happened after current entry is > >> acked. > >> >> > >> > > >> > That is not true. You are talking about piggy-backed LAC only. But > with > >> > Explicit LAC > >> > you don't need next entry to move LAC on bookie. > >> > > >> > >> Sorry, I pushed send before finishing. :) > >> > >> So you don't need next entry to move LAC forward, but its client job to > >> move LAC forward. > >> Hence client need to send explicit LAC to update LAC after it hear back > >> from AckQuorum. > >> Hence Sijie is right on this part, it is not a consistency issue. :) > >> > >> > >> But never the less, I believe we need to change the order as it is not > >> completely shielding > >> writes from other activity. @Sijie do you see any issue if we write to > >> journal, ack to client > >> and the write to ledger ? > >> > > > > Based on my understanding about this email thread, the concern comes from > > the latency on write. However, it doesn't change any latency behavior if > > you add to journal first and add to memtable later. 'Throttling' will > still > > happen when you add entry to memtable. > > > > So the question would be "can we write to journal and back back immediate > > after written to journal, and add the entry to memtable in background"? > > > > The answer would be "no". Because this would volatile the correctness. It > > might end up a case - the lac is already advanced but the entry is not > > found - it can happen in following sequence. > > > > - Client issue write entry N (lac = N-1) > > - Bookie write the entry to the journal and acknowledge. Entry N is in > the > > journal but haven't been added to the memtable. > > - Client received the acknowledge and advanced LAC from N-1 to N. > > - Client write another entry N+1 (lac = N) to advance LAC. > > - Another client (reader) detects LAC is advanced from N-1 to N. it > > attempts to read entry N but N isn't added to ledger storage. (*The > > correctness is volatiled here*) > > > > So to summarize my thoughts on this: > > > > - The acknowledge should happen after both writing the entry to journal > > and write the entry to memtable. > > - The order of writing the entry to journal and writing entry to memtable > > doesn't matter here. > > - Writing the entry to the memtable helps with tailing latency (because > it > > will advance LAC first). > > > > - Sijie > > > > > >> > >> JV > >> > >> > >> > > >> > > >> >> So adding the entry to memory doesn't expose any consistency issue. > >> >> > >> >> On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" < > [email protected]> > >> >> wrote: > >> >> > >> >> On Mon, May 1, 2017 at 2:31 PM, Yiming Zang > <[email protected] > >> > > >> >> wrote: > >> >> > >> >> > Hi Andrey, > >> >> > > >> >> > That's a good point, and you're actually correct that if write to > >> >> memTable > >> >> > got throttled somehow, the addEntry request latency will be > affected > >> a > >> >> lot. > >> >> > This actually happens a few times in production cluster. Normally, > >> the > >> >> idea > >> >> > of using Journal is to write data to the write-ahead log and then > >> >> persist > >> >> > the actual data to disks or add to memTable. However, my > >> understanding > >> >> of > >> >> > why we choose to write entry to ledgerStorage first is to improve > the > >> >> > tailing-read performance. > >> >> > > >> >> > In SortedLedgerStorage.java, we first add entry to memTable and > then > >> we > >> >> > update lastAddConfirmed, which means if there's a long poll read > >> request > >> >> or > >> >> > readLastAddConfirmed request, it will immediately get satisfied for > >> the > >> >> > latest entry before we actually log the entry into Journal. So > >> >> tailing-read > >> >> > doesn't actually need to wait for any disk operation in Bookkeeper > >> >> > including Journal operation. > >> >> > > >> >> > public long addEntry(ByteBuffer entry) throws IOException { > >> >> > long ledgerId = entry.getLong(); > >> >> > long entryId = entry.getLong(); > >> >> > long lac = entry.getLong(); > >> >> > entry.rewind(); > >> >> > memTable.addEntry(ledgerId, entryId, entry, this); > >> >> > ledgerCache.updateLastAddConfirmed(ledgerId, lac); > >> >> > return entryId; > >> >> > } > >> >> > > >> >> > But thinking about here, I'm wondering if it's actually safe to > >> update > >> >> the > >> >> > LAC before we write the entry to Journal. What if we tell the > client > >> the > >> >> > LAC has been updated but we actually failed to write the entry to > >> >> Journal > >> >> > and Bookie crashed at that time? Would this bring any inconsistency > >> >> issue? > >> >> > > >> >> > >> >> Good point. This is indeed an inconsistency issue. BK guarantees "if > >> you > >> >> read once you can read it all the time". > >> >> If it is really done for LAC it is not really good idea. Unless I am > >> >> missing something, this must be changed ASAP. > >> >> > >> >> Thanks, > >> >> JV > >> >> > >> >> > >> >> > > >> >> > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov < > >> >> [email protected]> > >> >> > wrote: > >> >> > > >> >> > > Hi, > >> >> > > > >> >> > > Looking at the code in Bookie.java I noticed that write to > journal > >> >> (which > >> >> > > is supposed to be a write-ahead log as I understand) happened > after > >> >> write > >> >> > > to ledger storage. > >> >> > > This looks counter-intuitive, can someone explain why is it done > in > >> >> this > >> >> > > order? > >> >> > > > >> >> > > My primary concern is that ledger storage write can be delayed > >> (i.e. > >> >> > > EntryMemTable's addEntry can do throttleWriters() in some cases) > >> thus > >> >> > > dragging overall client's view of add latency up even though it > is > >> >> > possible > >> >> > > that journal's write (i.e. in case of dedicated journal disk) > will > >> >> > complete > >> >> > > faster. > >> >> > > > >> >> > > private void addEntryInternal(LedgerDescriptor handle, > >> ByteBuffer > >> >> > > entry, WriteCallback cb, Object ctx) > >> >> > > > >> >> > > throws IOException, BookieException { > >> >> > > > >> >> > > long ledgerId = handle.getLedgerId(); > >> >> > > > >> >> > > entry.rewind(); > >> >> > > > >> >> > > *// ledgerStorage.addEntry() is happening here* > >> >> > > > >> >> > > long entryId = handle.addEntry(entry); > >> >> > > > >> >> > > > >> >> > > entry.rewind(); > >> >> > > > >> >> > > writeBytes.add(entry.remaining()); > >> >> > > > >> >> > > > >> >> > > LOG.trace("Adding {}@{}", entryId, ledgerId); > >> >> > > > >> >> > > *// journal add entry is happening here* > >> >> > > > >> >> > > *// callback/response to client is sent after journal add is > done.* > >> >> > > > >> >> > > journal.logAddEntry(entry, cb, ctx); > >> >> > > > >> >> > > } > >> >> > > > >> >> > > > >> >> > > > >> >> > > ---------- > >> >> > > Andrey Yegorov > >> >> > > > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> Jvrao > >> >> --- > >> >> First they ignore you, then they laugh at you, then they fight you, > >> then > >> >> you win. - Mahatma Gandhi > >> >> > >> > > >> > > >> > > >> > -- > >> > Jvrao > >> > --- > >> > First they ignore you, then they laugh at you, then they fight you, > then > >> > you win. - Mahatma Gandhi > >> > > >> > > >> > > >> > >> > >> -- > >> Jvrao > >> --- > >> First they ignore you, then they laugh at you, then they fight you, then > >> you win. - Mahatma Gandhi > >> > > > > > -- Jvrao --- First they ignore you, then they laugh at you, then they fight you, then you win. - Mahatma Gandhi
