[google-appengine] Re: Latency spikes and aborted requests in the last 24 hours

Jason C Thu, 23 Sep 2010 13:25:56 -0700

Yes, in the /_ah/queue/deferred requests, we are almost certainly
doing datastore/memcache work and very likely doing urlfetch work (I'm
referring to this probabilistically, because we use deferred for many
things).


BTW, appid steprep is Python.

j

On Sep 23, 2:21 pm, "Ikai Lan (Google)" <[email protected]>
wrote:
> Jason, are you calling URLFetch/datastore/Memcache in that action?
>
> So far there are no reports of Java applications being impacted. We usually
> have a bit more of flexibility when it comes to tearing down and spinning up
> Python instances, so I'm wondering if this is related.
>
> --
> Ikai Lan
> Developer Programs Engineer, Google App Engine
> Blogger:http://googleappengine.blogspot.com
> Reddit:http://www.reddit.com/r/appengine
> Twitter:http://twitter.com/app_engine
>
>
>
> On Thu, Sep 23, 2010 at 4:17 PM, Jason C <[email protected]> wrote:
> > When I refer to bursts of errors, this is an example of what I mean.
> > This is a query for any 500-series errors against appid: steprep. Note
> > how close together the log entries are (i.e., the bursty-ness). These
> > requests operate very quickly when App Engine is performing well.
>
> > 09-23 01:07PM 42.459 /_ah/queue/deferred 500 29768ms
> > 09-23 01:07PM 42.084 /_ah/queue/deferred 500 29485ms
> > 09-23 01:07PM 38.017 /_ah/queue/deferred 500 29523ms
> > 09-23 01:07PM 33.636 /_ah/queue/deferred 500 29417ms
> > 09-23 01:07PM 32.176 /_ah/queue/deferred 500 29787ms
> > 09-23 01:07PM 30.812 /_ah/queue/deferred 500 29451ms
> > 09-23 01:07PM 25.477 /_ah/queue/deferred 500 29626ms
> > 09-23 01:07PM 24.605 /_ah/queue/deferred 500 30025ms
> > 09-23 01:07PM 14.422 /mgmt/search/start/ 500 40086ms
> > 09-23 01:07PM 24.204 /_ah/queue/deferred 500 29768ms
> > 09-23 01:07PM 20.746 /_ah/queue/deferred 500 29968ms
> > 09-23 01:07PM 30.383 /exec-reports/queue/increment-all-time-data-
> > counter 500 10003ms
> > 09-23 01:05PM 47.180 /_ah/queue/deferred 500 30070ms
> > 09-23 01:05PM 42.770 /mgmt/search/start/ 500 30053ms
> > 09-23 01:04PM 55.080 /_ah/queue/deferred 500 29404ms
> > 09-23 01:04PM 49.054 /_ah/queue/deferred 500 29589ms
>
> > Anything around the 30-s mark is a DeadlineExceededError, anything
> > around the 10-s mark is the "Request was aborted after waiting too
> > long to attempt to service your request....".
>
> > These bursts are _NOT_ rare; I can find them at will, including
> > similar bursts of the 10-s style:
>
> > 09-23 12:52PM 07.445 /_ah/queue/deferred 500 10006ms
> > 09-23 12:52PM 06.510 /_ah/queue/deferred 500 10016ms
> > 09-23 12:52PM 06.303 /_ah/queue/deferred 500 10003ms
> > 09-23 12:52PM 05.902 /_ah/queue/deferred 500 10017ms
> > 09-23 12:52PM 05.641 /_ah/queue/deferred 500 10017ms
> > 09-23 12:52PM 04.872 /_ah/queue/deferred 500 10002ms
> > 09-23 12:52PM 04.211 /_ah/queue/deferred 500 10030ms
> > 09-23 12:52PM 03.759 /_ah/queue/deferred 500 10029ms
> > 09-23 12:52PM 03.652 /_ah/queue/deferred 500 10021ms
>
> > I am running out of ideas of what I can do at my end to address this
> > issue. Does anyone from Google have suggestions? Is this an
> > infrastructure / autoscaling problem, or is it a problem at my end?
>
> > On Sep 23, 11:06 am, Jason C <[email protected]> wrote:
> > > We get lots of bursts of loading failures. Most recently: 9.45a to
> > > 9.47a log time (2010-09-23). Appid: steprep
>
> > > We also get lots of ongoing 10-second timeout. Using the logs, it
> > > seems to me that our user-facing requests are below the 1000ms
> > > threshold, so I don't know why we get so many 10-second timeouts
> > > (i.e., the "Request was aborted after waiting too long to attempt to
> > > service your request. This may happen sporadically when the App Engine
> > > serving cluster is under unexpectedly high or uneven load. If you see
> > > this message frequently, please contact the App Engine team." error).
>
> > > Our QPS varies throughout the day from 1 to around 12, though we have
> > > jobs that introduce short spikes of 50-60.
>
> > > j
>
> > > On Sep 23, 10:39 am, Eli Jones <[email protected]> wrote:
>
> > > > My app is extremely lightweight.
>
> > > > The average CPU usage for a cold start falls in these ranges:
>
> > > > CPU_MS:    1030 - 1230
> > > > API_CPU_MS:  408
>
> > > > The rough outline of the code that runs is:
>
> > > > 1.  Pull credentials form memcache and decrypt.
> > > > 2.  Connect to gdata service, and get some data.
> > > > 3.  Create 4 new entities (of same Model type) containing this fetched
> > data
> > > > and append to a list.
> > > > 4.  Stick those entities in memcache as protobufs.
> > > > 5.  Create new entity (for a different Model) and append to the list
> > used in
> > > > Step 3.
> > > > 6.  db.put(list) to the datastore. (or keep trying until success)
> > > > 7. If necessary, add task to run at next 5 minute mark.
>
> > > > So.. as long as memcache is working and the encrypted credentials are
> > > > there.. there are not any Gets from the datastore.  There is just the
> > Put
> > > > that is putting 4 entities of one model type and 1 entity of another
> > model
> > > > type all at once.
>
> > > > timezone from Pytz is used for correct timezone calculations at the
> > start of
> > > > the day.. to make sure that the initial task starts at the correct
> > time.
> > > >  (At this point I could move the "from pytz import timezone" down to
> > just
> > > > the function that calculates the initial start delay.. but I avoid
> > making
> > > > changes to this code unless it's really necessary (timezone was used in
> > more
> > > > than one place in the past.. so it was imported at the top of the
> > file).
>
> > > > On Thu, Sep 23, 2010 at 12:20 PM, Ikai Lan (Google) <
>
> > > > [email protected] <ikai.l%[email protected]> <
> > ikai.l%[email protected] <ikai.l%[email protected]>>> wrote:
> > > > > It sounds like these are impacting Python developers - I haven't
> > heard any
> > > > > Java app developers chime in yet.
>
> > > > > What's the ballpark of the QPS you guys are doing? <1qps, 10qps,
> > 100qps,
> > > > > etc?
>
> > > > > --
> > > > > Ikai Lan
> > > > > Developer Programs Engineer, Google App Engine
> > > > > Blogger:http://googleappengine.blogspot.com
> > > > > Reddit:http://www.reddit.com/r/appengine
> > > > > Twitter:http://twitter.com/app_engine
>
> > > > > On Thu, Sep 23, 2010 at 12:15 PM, Eli Jones <[email protected]>
> > wrote:
>
> > > > >> Also, a cold start for this task can take as little as 960 ms (where
> > it
> > > > >> shows the log detail: "This request caused a new process to be
> > started for
> > > > >> your application...").
>
> > > > >> And, the task is doing a URLFetch, so.. I'm sure that is taking up a
> > chunk
> > > > >> of the running time.
>
> > > > >> One of my imports is pytz (no django or anything like that though)..
> > so
> > > > >> that could also be of interest.
>
> > > > >> On Thu, Sep 23, 2010 at 12:10 PM, Eli Jones <[email protected]>
> > wrote:
>
> > > > >>> My app is in Python.  The only thing it does between 6:30 AM PST
> > and 1:00
> > > > >>> PM PST is run a single chained task.  The task just runs once every
> > 5
> > > > >>> minutes.  (There are no other backend processes or users).
>
> > > > >>> Yesterday at 11:00 AM PST and 11:02 AM PST, the task experienced
> > the 10
> > > > >>> second, 500 error (with 0 CPU_MS used).. then the task successfully
> > retried
> > > > >>> at 11:02:55 AM PST.
>
> > > > >>> There was another one (not mentioned in your list of times for
> > yesterday)
> > > > >>> at 7:45 AM PST.  Failed after 10 seconds with a 500 error and 0
> > CPU_MS used.
> > > > >>>  The task successfully retried at 7:47 AM PST (for some reason, the
> > tasks in
> > > > >>> my app wait 120 seconds for the first retry)
>
> > > > >>> For some reason, the log doesn't contain the message that usually
> > goes
> > > > >>> with this error (there is no detailed log message):
>
> > > > >>>  "Request was aborted after waiting too long to attempt to service
> > your
> > > > >>> request. This may happen sporadically when the App Engine serving
> > cluster is
> > > > >>> under unexpectedly high or uneven load. If you see this message
> > frequently,
> > > > >>> please contact the App Engine team."
>
> > > > >>> But, it has the exact same fingerprint. (It just seems that the
> > logs are
> > > > >>> trimming the detailed messages faster than usual.)
>
> > > > >>> This chained task runs 79 times each day (this would be 82 requests
> > > > >>> counting these 3 errors).. so these errors accounted for 3.66 % of
> > the
> > > > >>> requests during the day yesterday.
>
> > > > >>> On Thu, Sep 23, 2010 at 11:19 AM, Ikai Lan (Google) <
> > > > >>> [email protected] <ikai.l%[email protected]> <
> > ikai.l%[email protected] <ikai.l%[email protected]>>> wrote:
>
> > > > >>>> Hey guys,
>
> > > > >>>> We've been tracking latency spikes and intermittent batches of
> > aborted
> > > > >>>> requests over the last 24 hour period. There were at least 3
> > periods of
> > > > >>>> spikes yesterday:
>
> > > > >>>> ~1100 Pacific (moderate, brief period of high)
> > > > >>>> ~1300 Pacific (small)
> > > > >>>> ~1530 Pacific (small)
>
> > > > >>>> Community reports seem to indicate that these are related to
> > loading
> > > > >>>> requests. If you are being affected, can you report your
> > environment,
> > > > >>>> language and if these times correlate with your experiences in
> > this thread?
> > > > >>>> Are there periods that this particular report did not cover?
> > Please provide
> > > > >>>> measurements, times and application IDs.
>
> > > > >>>> --
> > > > >>>> Ikai Lan
> > > > >>>> Developer Programs Engineer, Google App Engine
> > > > >>>> Blogger:http://googleappengine.blogspot.com
> > > > >>>> Reddit:http://www.reddit.com/r/appengine
> > > > >>>> Twitter:http://twitter.com/app_engine
>
> > > > >>>>  --
> > > > >>>> You received this message because you are subscribed to the Google
> > > > >>>> Groups "Google App Engine" group.
> > > > >>>> To post to this group, send email to
> > [email protected].
> > > > >>>> To unsubscribe from this group, send email to
> > > > >>>> [email protected]<google-appengine%2Bunsubscrib
> > > > >>>>  [email protected]><google-appengine%2Bunsubscrib
> > [email protected]>
> > > > >>>> .
> > > > >>>> For more options, visit this group at
> > > > >>>>http://groups.google.com/group/google-appengine?hl=en.
>
> > > > >>  --
> > > > >> You received this message because you are subscribed to the Google
> > Groups
> > > > >> "Google App Engine" group.
> > > > >> To post to this group, send email to...
>
> read more »

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Re: Latency spikes and aborted requests in the last 24 hours

Reply via email to