On Mon, May 1, 2017 at 7:04 PM, Yiming Zang <yz...@twitter.com.invalid> wrote:
> What about long poll read? I know it's not in OSS version yet, but if we're > going to support long poll read in open source version in the future we > need to think about this. Based on my understanding, the long poll read > request will be satisfied once the LAC is advanced to previousLAC+1. > > For example, the current lac=1 and client issue a long poll read request > for entry 2 with previousLAC=1. Now we send a write request for entry 2, > and when that entry is added to memTable and LAC is advanced to 2, When adding entry 2, LAC is not 2. LAC is always less than the entry id. > and that > entry is not yet written to Journal, the long poll read request can already > be satisfied, and client will get response back for entry 2, but in fact, > entry 2 could even failed to be written to Journal, or bookie could crash > at that time. Will that bring inconsistency? > > Correct me if I'm wrong. > > Best, > Yiming > > On Mon, May 1, 2017 at 6:52 PM, Sijie Guo <guosi...@gmail.com> wrote: > > > On Mon, May 1, 2017 at 6:35 PM, Venkateswara Rao Jujjuri < > > jujj...@gmail.com> > > wrote: > > > > > The real problem/issue is - having extremely fast journal disk doesn't > > > really mask write latencies from a slower ledger disk. > > > > > > > In most of the case, it mask the write latencies from a slower ledger > disk. > > Because the write should only happen in the memory. > > > > The worse case here is the write is being throttled - that typically > means > > something really bad happening. > > > > In this case, a larger write buffer would help? > > > > > > > > > > To address this rate correctness issue, cant we read from journal if > the > > > entryid >= LAC (as we cache now on bookie) and journal read fails? > > > > > > > First, the correctness isn't entryid >= LAC case, as client can't really > > read beyond LAC. The correctness issue is on entryid <= LAC case: the > entry > > appears on journal but not in ledger storage. > > > > Second, the purpose of having a separate journal disk is to avoid reads > on > > journal that would impact writes. If we circle back reads on journals, > this > > would potentially cause performance degradation on writes as well. > > > > Last, in order to be able to read journals, you basically still need to > add > > some indexed structures into memory, so you know where to look up the > > entries. No matter you store an entry in memtable or just store an entry > > pointer pointing back to journal, you will still hit same problem - as > you > > can keep everything in memory, which you have to write data back to disks > > and the throttling would happen again. > > > > From all these three points, I don't see too much value about changing > > this. Instead, the question would be simpler - can you increase the > memory > > buffer size? If you can't, that means your hardware's capacity can't > > basically keep up with the incoming write traffic. More capacity is > needed > > then. > > > > - Sijie > > > > > > > > > > > > On Mon, May 1, 2017 at 6:33 PM, Sijie Guo <guosi...@gmail.com> wrote: > > > > > > > In the other to think about this, > > > > > > > > when 'throttling' happens, it typically means: > > > > > > > > - the bookie doesn't have enough bandwidth/capacity to keep up with > the > > > > traffic. > > > > - the disks on the bookie might have problems (e.g. slow down or > other > > > > hardware issues). > > > > > > > > Either case can happen. It might be worth to let the throttling kick > > in, > > > > rather than let journal disk accepting writes and putting ledger > > storage > > > > into worse state. > > > > > > > > - Sijie > > > > > > > > On Mon, May 1, 2017 at 6:23 PM, Sijie Guo <guosi...@gmail.com> > wrote: > > > > > > > > > > > > > > > > > > > On Mon, May 1, 2017 at 6:14 PM, Venkateswara Rao Jujjuri < > > > > > jujj...@gmail.com> wrote: > > > > > > > > > >> On Mon, May 1, 2017 at 6:03 PM, Venkateswara Rao Jujjuri < > > > > >> jujj...@gmail.com> > > > > >> wrote: > > > > >> > > > > >> > > > > > >> > > > > > >> > On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <guosi...@gmail.com> > > > wrote: > > > > >> > > > > > >> >> I don't think this is an inconsistent issue. The in memory > update > > > is > > > > >> >> updating lac not current entry. Even the entry is added into > > memory > > > > but > > > > >> >> this entry will not be readable after lac is advanced, lac is > > > > advanced > > > > >> >> only > > > > >> >> after the next entry is added which happened after current > entry > > is > > > > >> acked. > > > > >> >> > > > > >> > > > > > >> > That is not true. You are talking about piggy-backed LAC only. > But > > > > with > > > > >> > Explicit LAC > > > > >> > you don't need next entry to move LAC on bookie. > > > > >> > > > > > >> > > > > >> Sorry, I pushed send before finishing. :) > > > > >> > > > > >> So you don't need next entry to move LAC forward, but its client > job > > > to > > > > >> move LAC forward. > > > > >> Hence client need to send explicit LAC to update LAC after it hear > > > back > > > > >> from AckQuorum. > > > > >> Hence Sijie is right on this part, it is not a consistency issue. > :) > > > > >> > > > > >> > > > > >> But never the less, I believe we need to change the order as it is > > not > > > > >> completely shielding > > > > >> writes from other activity. @Sijie do you see any issue if we > write > > to > > > > >> journal, ack to client > > > > >> and the write to ledger ? > > > > >> > > > > > > > > > > Based on my understanding about this email thread, the concern > comes > > > from > > > > > the latency on write. However, it doesn't change any latency > behavior > > > if > > > > > you add to journal first and add to memtable later. 'Throttling' > will > > > > still > > > > > happen when you add entry to memtable. > > > > > > > > > > So the question would be "can we write to journal and back back > > > immediate > > > > > after written to journal, and add the entry to memtable in > > background"? > > > > > > > > > > The answer would be "no". Because this would volatile the > > correctness. > > > It > > > > > might end up a case - the lac is already advanced but the entry is > > not > > > > > found - it can happen in following sequence. > > > > > > > > > > - Client issue write entry N (lac = N-1) > > > > > - Bookie write the entry to the journal and acknowledge. Entry N is > > in > > > > the > > > > > journal but haven't been added to the memtable. > > > > > - Client received the acknowledge and advanced LAC from N-1 to N. > > > > > - Client write another entry N+1 (lac = N) to advance LAC. > > > > > - Another client (reader) detects LAC is advanced from N-1 to N. it > > > > > attempts to read entry N but N isn't added to ledger storage. (*The > > > > > correctness is volatiled here*) > > > > > > > > > > So to summarize my thoughts on this: > > > > > > > > > > - The acknowledge should happen after both writing the entry to > > journal > > > > > and write the entry to memtable. > > > > > - The order of writing the entry to journal and writing entry to > > > memtable > > > > > doesn't matter here. > > > > > - Writing the entry to the memtable helps with tailing latency > > (because > > > > it > > > > > will advance LAC first). > > > > > > > > > > - Sijie > > > > > > > > > > > > > > >> > > > > >> JV > > > > >> > > > > >> > > > > >> > > > > > >> > > > > > >> >> So adding the entry to memory doesn't expose any consistency > > issue. > > > > >> >> > > > > >> >> On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" < > > > > jujj...@gmail.com> > > > > >> >> wrote: > > > > >> >> > > > > >> >> On Mon, May 1, 2017 at 2:31 PM, Yiming Zang > > > > <yz...@twitter.com.invalid > > > > >> > > > > > >> >> wrote: > > > > >> >> > > > > >> >> > Hi Andrey, > > > > >> >> > > > > > >> >> > That's a good point, and you're actually correct that if > write > > to > > > > >> >> memTable > > > > >> >> > got throttled somehow, the addEntry request latency will be > > > > affected > > > > >> a > > > > >> >> lot. > > > > >> >> > This actually happens a few times in production cluster. > > > Normally, > > > > >> the > > > > >> >> idea > > > > >> >> > of using Journal is to write data to the write-ahead log and > > then > > > > >> >> persist > > > > >> >> > the actual data to disks or add to memTable. However, my > > > > >> understanding > > > > >> >> of > > > > >> >> > why we choose to write entry to ledgerStorage first is to > > improve > > > > the > > > > >> >> > tailing-read performance. > > > > >> >> > > > > > >> >> > In SortedLedgerStorage.java, we first add entry to memTable > and > > > > then > > > > >> we > > > > >> >> > update lastAddConfirmed, which means if there's a long poll > > read > > > > >> request > > > > >> >> or > > > > >> >> > readLastAddConfirmed request, it will immediately get > satisfied > > > for > > > > >> the > > > > >> >> > latest entry before we actually log the entry into Journal. > So > > > > >> >> tailing-read > > > > >> >> > doesn't actually need to wait for any disk operation in > > > Bookkeeper > > > > >> >> > including Journal operation. > > > > >> >> > > > > > >> >> > public long addEntry(ByteBuffer entry) throws IOException { > > > > >> >> > long ledgerId = entry.getLong(); > > > > >> >> > long entryId = entry.getLong(); > > > > >> >> > long lac = entry.getLong(); > > > > >> >> > entry.rewind(); > > > > >> >> > memTable.addEntry(ledgerId, entryId, entry, this); > > > > >> >> > ledgerCache.updateLastAddConfirmed(ledgerId, lac); > > > > >> >> > return entryId; > > > > >> >> > } > > > > >> >> > > > > > >> >> > But thinking about here, I'm wondering if it's actually safe > to > > > > >> update > > > > >> >> the > > > > >> >> > LAC before we write the entry to Journal. What if we tell the > > > > client > > > > >> the > > > > >> >> > LAC has been updated but we actually failed to write the > entry > > to > > > > >> >> Journal > > > > >> >> > and Bookie crashed at that time? Would this bring any > > > inconsistency > > > > >> >> issue? > > > > >> >> > > > > > >> >> > > > > >> >> Good point. This is indeed an inconsistency issue. BK > guarantees > > > "if > > > > >> you > > > > >> >> read once you can read it all the time". > > > > >> >> If it is really done for LAC it is not really good idea. > Unless I > > > am > > > > >> >> missing something, this must be changed ASAP. > > > > >> >> > > > > >> >> Thanks, > > > > >> >> JV > > > > >> >> > > > > >> >> > > > > >> >> > > > > > >> >> > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov < > > > > >> >> andrey.yego...@gmail.com> > > > > >> >> > wrote: > > > > >> >> > > > > > >> >> > > Hi, > > > > >> >> > > > > > > >> >> > > Looking at the code in Bookie.java I noticed that write to > > > > journal > > > > >> >> (which > > > > >> >> > > is supposed to be a write-ahead log as I understand) > happened > > > > after > > > > >> >> write > > > > >> >> > > to ledger storage. > > > > >> >> > > This looks counter-intuitive, can someone explain why is it > > > done > > > > in > > > > >> >> this > > > > >> >> > > order? > > > > >> >> > > > > > > >> >> > > My primary concern is that ledger storage write can be > > delayed > > > > >> (i.e. > > > > >> >> > > EntryMemTable's addEntry can do throttleWriters() in some > > > cases) > > > > >> thus > > > > >> >> > > dragging overall client's view of add latency up even > though > > it > > > > is > > > > >> >> > possible > > > > >> >> > > that journal's write (i.e. in case of dedicated journal > disk) > > > > will > > > > >> >> > complete > > > > >> >> > > faster. > > > > >> >> > > > > > > >> >> > > private void addEntryInternal(LedgerDescriptor handle, > > > > >> ByteBuffer > > > > >> >> > > entry, WriteCallback cb, Object ctx) > > > > >> >> > > > > > > >> >> > > throws IOException, BookieException { > > > > >> >> > > > > > > >> >> > > long ledgerId = handle.getLedgerId(); > > > > >> >> > > > > > > >> >> > > entry.rewind(); > > > > >> >> > > > > > > >> >> > > *// ledgerStorage.addEntry() is happening here* > > > > >> >> > > > > > > >> >> > > long entryId = handle.addEntry(entry); > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > entry.rewind(); > > > > >> >> > > > > > > >> >> > > writeBytes.add(entry.remaining()); > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > LOG.trace("Adding {}@{}", entryId, ledgerId); > > > > >> >> > > > > > > >> >> > > *// journal add entry is happening here* > > > > >> >> > > > > > > >> >> > > *// callback/response to client is sent after journal add > is > > > > done.* > > > > >> >> > > > > > > >> >> > > journal.logAddEntry(entry, cb, ctx); > > > > >> >> > > > > > > >> >> > > } > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > ---------- > > > > >> >> > > Andrey Yegorov > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> -- > > > > >> >> Jvrao > > > > >> >> --- > > > > >> >> First they ignore you, then they laugh at you, then they fight > > you, > > > > >> then > > > > >> >> you win. - Mahatma Gandhi > > > > >> >> > > > > >> > > > > > >> > > > > > >> > > > > > >> > -- > > > > >> > Jvrao > > > > >> > --- > > > > >> > First they ignore you, then they laugh at you, then they fight > > you, > > > > then > > > > >> > you win. - Mahatma Gandhi > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > > >> -- > > > > >> Jvrao > > > > >> --- > > > > >> First they ignore you, then they laugh at you, then they fight > you, > > > then > > > > >> you win. - Mahatma Gandhi > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Jvrao > > > --- > > > First they ignore you, then they laugh at you, then they fight you, > then > > > you win. - Mahatma Gandhi > > > > > >