[google-appengine] Re: Latency spikes and aborted requests in the last 24 hours

Jason C Thu, 23 Sep 2010 13:18:06 -0700

When I refer to bursts of errors, this is an example of what I mean.
This is a query for any 500-series errors against appid: steprep. Note
how close together the log entries are (i.e., the bursty-ness). These
requests operate very quickly when App Engine is performing well.


09-23 01:07PM 42.459 /_ah/queue/deferred 500 29768ms
09-23 01:07PM 42.084 /_ah/queue/deferred 500 29485ms
09-23 01:07PM 38.017 /_ah/queue/deferred 500 29523ms
09-23 01:07PM 33.636 /_ah/queue/deferred 500 29417ms
09-23 01:07PM 32.176 /_ah/queue/deferred 500 29787ms
09-23 01:07PM 30.812 /_ah/queue/deferred 500 29451ms
09-23 01:07PM 25.477 /_ah/queue/deferred 500 29626ms
09-23 01:07PM 24.605 /_ah/queue/deferred 500 30025ms
09-23 01:07PM 14.422 /mgmt/search/start/ 500 40086ms
09-23 01:07PM 24.204 /_ah/queue/deferred 500 29768ms
09-23 01:07PM 20.746 /_ah/queue/deferred 500 29968ms
09-23 01:07PM 30.383 /exec-reports/queue/increment-all-time-data-
counter 500 10003ms
09-23 01:05PM 47.180 /_ah/queue/deferred 500 30070ms
09-23 01:05PM 42.770 /mgmt/search/start/ 500 30053ms
09-23 01:04PM 55.080 /_ah/queue/deferred 500 29404ms
09-23 01:04PM 49.054 /_ah/queue/deferred 500 29589ms

Anything around the 30-s mark is a DeadlineExceededError, anything
around the 10-s mark is the "Request was aborted after waiting too
long to attempt to service your request....".

These bursts are _NOT_ rare; I can find them at will, including
similar bursts of the 10-s style:

09-23 12:52PM 07.445 /_ah/queue/deferred 500 10006ms
09-23 12:52PM 06.510 /_ah/queue/deferred 500 10016ms
09-23 12:52PM 06.303 /_ah/queue/deferred 500 10003ms
09-23 12:52PM 05.902 /_ah/queue/deferred 500 10017ms
09-23 12:52PM 05.641 /_ah/queue/deferred 500 10017ms
09-23 12:52PM 04.872 /_ah/queue/deferred 500 10002ms
09-23 12:52PM 04.211 /_ah/queue/deferred 500 10030ms
09-23 12:52PM 03.759 /_ah/queue/deferred 500 10029ms
09-23 12:52PM 03.652 /_ah/queue/deferred 500 10021ms

I am running out of ideas of what I can do at my end to address this
issue. Does anyone from Google have suggestions? Is this an
infrastructure / autoscaling problem, or is it a problem at my end?



On Sep 23, 11:06 am, Jason C <[email protected]> wrote:
> We get lots of bursts of loading failures. Most recently: 9.45a to
> 9.47a log time (2010-09-23). Appid: steprep
>
> We also get lots of ongoing 10-second timeout. Using the logs, it
> seems to me that our user-facing requests are below the 1000ms
> threshold, so I don't know why we get so many 10-second timeouts
> (i.e., the "Request was aborted after waiting too long to attempt to
> service your request. This may happen sporadically when the App Engine
> serving cluster is under unexpectedly high or uneven load. If you see
> this message frequently, please contact the App Engine team." error).
>
> Our QPS varies throughout the day from 1 to around 12, though we have
> jobs that introduce short spikes of 50-60.
>
> j
>
> On Sep 23, 10:39 am, Eli Jones <[email protected]> wrote:
>
>
>
> > My app is extremely lightweight.
>
> > The average CPU usage for a cold start falls in these ranges:
>
> > CPU_MS:    1030 - 1230
> > API_CPU_MS:  408
>
> > The rough outline of the code that runs is:
>
> > 1.  Pull credentials form memcache and decrypt.
> > 2.  Connect to gdata service, and get some data.
> > 3.  Create 4 new entities (of same Model type) containing this fetched data
> > and append to a list.
> > 4.  Stick those entities in memcache as protobufs.
> > 5.  Create new entity (for a different Model) and append to the list used in
> > Step 3.
> > 6.  db.put(list) to the datastore. (or keep trying until success)
> > 7. If necessary, add task to run at next 5 minute mark.
>
> > So.. as long as memcache is working and the encrypted credentials are
> > there.. there are not any Gets from the datastore.  There is just the Put
> > that is putting 4 entities of one model type and 1 entity of another model
> > type all at once.
>
> > timezone from Pytz is used for correct timezone calculations at the start of
> > the day.. to make sure that the initial task starts at the correct time.
> >  (At this point I could move the "from pytz import timezone" down to just
> > the function that calculates the initial start delay.. but I avoid making
> > changes to this code unless it's really necessary (timezone was used in more
> > than one place in the past.. so it was imported at the top of the file).
>
> > On Thu, Sep 23, 2010 at 12:20 PM, Ikai Lan (Google) <
>
> > [email protected] <ikai.l%[email protected]>> wrote:
> > > It sounds like these are impacting Python developers - I haven't heard any
> > > Java app developers chime in yet.
>
> > > What's the ballpark of the QPS you guys are doing? <1qps, 10qps, 100qps,
> > > etc?
>
> > > --
> > > Ikai Lan
> > > Developer Programs Engineer, Google App Engine
> > > Blogger:http://googleappengine.blogspot.com
> > > Reddit:http://www.reddit.com/r/appengine
> > > Twitter:http://twitter.com/app_engine
>
> > > On Thu, Sep 23, 2010 at 12:15 PM, Eli Jones <[email protected]> wrote:
>
> > >> Also, a cold start for this task can take as little as 960 ms (where it
> > >> shows the log detail: "This request caused a new process to be started 
> > >> for
> > >> your application...").
>
> > >> And, the task is doing a URLFetch, so.. I'm sure that is taking up a 
> > >> chunk
> > >> of the running time.
>
> > >> One of my imports is pytz (no django or anything like that though).. so
> > >> that could also be of interest.
>
> > >> On Thu, Sep 23, 2010 at 12:10 PM, Eli Jones <[email protected]> wrote:
>
> > >>> My app is in Python.  The only thing it does between 6:30 AM PST and 
> > >>> 1:00
> > >>> PM PST is run a single chained task.  The task just runs once every 5
> > >>> minutes.  (There are no other backend processes or users).
>
> > >>> Yesterday at 11:00 AM PST and 11:02 AM PST, the task experienced the 10
> > >>> second, 500 error (with 0 CPU_MS used).. then the task successfully 
> > >>> retried
> > >>> at 11:02:55 AM PST.
>
> > >>> There was another one (not mentioned in your list of times for 
> > >>> yesterday)
> > >>> at 7:45 AM PST.  Failed after 10 seconds with a 500 error and 0 CPU_MS 
> > >>> used.
> > >>>  The task successfully retried at 7:47 AM PST (for some reason, the 
> > >>> tasks in
> > >>> my app wait 120 seconds for the first retry)
>
> > >>> For some reason, the log doesn't contain the message that usually goes
> > >>> with this error (there is no detailed log message):
>
> > >>>  "Request was aborted after waiting too long to attempt to service your
> > >>> request. This may happen sporadically when the App Engine serving 
> > >>> cluster is
> > >>> under unexpectedly high or uneven load. If you see this message 
> > >>> frequently,
> > >>> please contact the App Engine team."
>
> > >>> But, it has the exact same fingerprint. (It just seems that the logs are
> > >>> trimming the detailed messages faster than usual.)
>
> > >>> This chained task runs 79 times each day (this would be 82 requests
> > >>> counting these 3 errors).. so these errors accounted for 3.66 % of the
> > >>> requests during the day yesterday.
>
> > >>> On Thu, Sep 23, 2010 at 11:19 AM, Ikai Lan (Google) <
> > >>> [email protected] <ikai.l%[email protected]>> wrote:
>
> > >>>> Hey guys,
>
> > >>>> We've been tracking latency spikes and intermittent batches of aborted
> > >>>> requests over the last 24 hour period. There were at least 3 periods of
> > >>>> spikes yesterday:
>
> > >>>> ~1100 Pacific (moderate, brief period of high)
> > >>>> ~1300 Pacific (small)
> > >>>> ~1530 Pacific (small)
>
> > >>>> Community reports seem to indicate that these are related to loading
> > >>>> requests. If you are being affected, can you report your environment,
> > >>>> language and if these times correlate with your experiences in this 
> > >>>> thread?
> > >>>> Are there periods that this particular report did not cover? Please 
> > >>>> provide
> > >>>> measurements, times and application IDs.
>
> > >>>> --
> > >>>> Ikai Lan
> > >>>> Developer Programs Engineer, Google App Engine
> > >>>> Blogger:http://googleappengine.blogspot.com
> > >>>> Reddit:http://www.reddit.com/r/appengine
> > >>>> Twitter:http://twitter.com/app_engine
>
> > >>>>  --
> > >>>> You received this message because you are subscribed to the Google
> > >>>> Groups "Google App Engine" group.
> > >>>> To post to this group, send email to [email protected].
> > >>>> To unsubscribe from this group, send email to
> > >>>> [email protected]<google-appengine%2Bunsubscrib
> > >>>>  [email protected]>
> > >>>> .
> > >>>> For more options, visit this group at
> > >>>>http://groups.google.com/group/google-appengine?hl=en.
>
> > >>  --
> > >> You received this message because you are subscribed to the Google Groups
> > >> "Google App Engine" group.
> > >> To post to this group, send email to [email protected].
> > >> To unsubscribe from this group, send email to
> > >> [email protected]<google-appengine%2Bunsubscrib
> > >>  [email protected]>
> > >> .
> > >> For more options, visit this group at
> > >>http://groups.google.com/group/google-appengine?hl=en.
>
> > >  --
> > > You received this message because you are subscribed to the Google Groups
> > > "Google App Engine" group.
> > > To post to this group, send email to [email protected].
> > > To unsubscribe from this group, send email to
> > > [email protected]<google-appengine%2Bunsubscrib
> > >  [email protected]>
> > > .
> > > For more options, visit this group at
> > >http://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Re: Latency spikes and aborted requests in the last 24 hours

Reply via email to