When I refer to bursts of errors, this is an example of what I mean. This is a query for any 500-series errors against appid: steprep. Note how close together the log entries are (i.e., the bursty-ness). These requests operate very quickly when App Engine is performing well.
09-23 01:07PM 42.459 /_ah/queue/deferred 500 29768ms 09-23 01:07PM 42.084 /_ah/queue/deferred 500 29485ms 09-23 01:07PM 38.017 /_ah/queue/deferred 500 29523ms 09-23 01:07PM 33.636 /_ah/queue/deferred 500 29417ms 09-23 01:07PM 32.176 /_ah/queue/deferred 500 29787ms 09-23 01:07PM 30.812 /_ah/queue/deferred 500 29451ms 09-23 01:07PM 25.477 /_ah/queue/deferred 500 29626ms 09-23 01:07PM 24.605 /_ah/queue/deferred 500 30025ms 09-23 01:07PM 14.422 /mgmt/search/start/ 500 40086ms 09-23 01:07PM 24.204 /_ah/queue/deferred 500 29768ms 09-23 01:07PM 20.746 /_ah/queue/deferred 500 29968ms 09-23 01:07PM 30.383 /exec-reports/queue/increment-all-time-data- counter 500 10003ms 09-23 01:05PM 47.180 /_ah/queue/deferred 500 30070ms 09-23 01:05PM 42.770 /mgmt/search/start/ 500 30053ms 09-23 01:04PM 55.080 /_ah/queue/deferred 500 29404ms 09-23 01:04PM 49.054 /_ah/queue/deferred 500 29589ms Anything around the 30-s mark is a DeadlineExceededError, anything around the 10-s mark is the "Request was aborted after waiting too long to attempt to service your request....". These bursts are _NOT_ rare; I can find them at will, including similar bursts of the 10-s style: 09-23 12:52PM 07.445 /_ah/queue/deferred 500 10006ms 09-23 12:52PM 06.510 /_ah/queue/deferred 500 10016ms 09-23 12:52PM 06.303 /_ah/queue/deferred 500 10003ms 09-23 12:52PM 05.902 /_ah/queue/deferred 500 10017ms 09-23 12:52PM 05.641 /_ah/queue/deferred 500 10017ms 09-23 12:52PM 04.872 /_ah/queue/deferred 500 10002ms 09-23 12:52PM 04.211 /_ah/queue/deferred 500 10030ms 09-23 12:52PM 03.759 /_ah/queue/deferred 500 10029ms 09-23 12:52PM 03.652 /_ah/queue/deferred 500 10021ms I am running out of ideas of what I can do at my end to address this issue. Does anyone from Google have suggestions? Is this an infrastructure / autoscaling problem, or is it a problem at my end? On Sep 23, 11:06 am, Jason C <[email protected]> wrote: > We get lots of bursts of loading failures. Most recently: 9.45a to > 9.47a log time (2010-09-23). Appid: steprep > > We also get lots of ongoing 10-second timeout. Using the logs, it > seems to me that our user-facing requests are below the 1000ms > threshold, so I don't know why we get so many 10-second timeouts > (i.e., the "Request was aborted after waiting too long to attempt to > service your request. This may happen sporadically when the App Engine > serving cluster is under unexpectedly high or uneven load. If you see > this message frequently, please contact the App Engine team." error). > > Our QPS varies throughout the day from 1 to around 12, though we have > jobs that introduce short spikes of 50-60. > > j > > On Sep 23, 10:39 am, Eli Jones <[email protected]> wrote: > > > > > My app is extremely lightweight. > > > The average CPU usage for a cold start falls in these ranges: > > > CPU_MS: 1030 - 1230 > > API_CPU_MS: 408 > > > The rough outline of the code that runs is: > > > 1. Pull credentials form memcache and decrypt. > > 2. Connect to gdata service, and get some data. > > 3. Create 4 new entities (of same Model type) containing this fetched data > > and append to a list. > > 4. Stick those entities in memcache as protobufs. > > 5. Create new entity (for a different Model) and append to the list used in > > Step 3. > > 6. db.put(list) to the datastore. (or keep trying until success) > > 7. If necessary, add task to run at next 5 minute mark. > > > So.. as long as memcache is working and the encrypted credentials are > > there.. there are not any Gets from the datastore. There is just the Put > > that is putting 4 entities of one model type and 1 entity of another model > > type all at once. > > > timezone from Pytz is used for correct timezone calculations at the start of > > the day.. to make sure that the initial task starts at the correct time. > > (At this point I could move the "from pytz import timezone" down to just > > the function that calculates the initial start delay.. but I avoid making > > changes to this code unless it's really necessary (timezone was used in more > > than one place in the past.. so it was imported at the top of the file). > > > On Thu, Sep 23, 2010 at 12:20 PM, Ikai Lan (Google) < > > > [email protected] <ikai.l%[email protected]>> wrote: > > > It sounds like these are impacting Python developers - I haven't heard any > > > Java app developers chime in yet. > > > > What's the ballpark of the QPS you guys are doing? <1qps, 10qps, 100qps, > > > etc? > > > > -- > > > Ikai Lan > > > Developer Programs Engineer, Google App Engine > > > Blogger:http://googleappengine.blogspot.com > > > Reddit:http://www.reddit.com/r/appengine > > > Twitter:http://twitter.com/app_engine > > > > On Thu, Sep 23, 2010 at 12:15 PM, Eli Jones <[email protected]> wrote: > > > >> Also, a cold start for this task can take as little as 960 ms (where it > > >> shows the log detail: "This request caused a new process to be started > > >> for > > >> your application..."). > > > >> And, the task is doing a URLFetch, so.. I'm sure that is taking up a > > >> chunk > > >> of the running time. > > > >> One of my imports is pytz (no django or anything like that though).. so > > >> that could also be of interest. > > > >> On Thu, Sep 23, 2010 at 12:10 PM, Eli Jones <[email protected]> wrote: > > > >>> My app is in Python. The only thing it does between 6:30 AM PST and > > >>> 1:00 > > >>> PM PST is run a single chained task. The task just runs once every 5 > > >>> minutes. (There are no other backend processes or users). > > > >>> Yesterday at 11:00 AM PST and 11:02 AM PST, the task experienced the 10 > > >>> second, 500 error (with 0 CPU_MS used).. then the task successfully > > >>> retried > > >>> at 11:02:55 AM PST. > > > >>> There was another one (not mentioned in your list of times for > > >>> yesterday) > > >>> at 7:45 AM PST. Failed after 10 seconds with a 500 error and 0 CPU_MS > > >>> used. > > >>> The task successfully retried at 7:47 AM PST (for some reason, the > > >>> tasks in > > >>> my app wait 120 seconds for the first retry) > > > >>> For some reason, the log doesn't contain the message that usually goes > > >>> with this error (there is no detailed log message): > > > >>> "Request was aborted after waiting too long to attempt to service your > > >>> request. This may happen sporadically when the App Engine serving > > >>> cluster is > > >>> under unexpectedly high or uneven load. If you see this message > > >>> frequently, > > >>> please contact the App Engine team." > > > >>> But, it has the exact same fingerprint. (It just seems that the logs are > > >>> trimming the detailed messages faster than usual.) > > > >>> This chained task runs 79 times each day (this would be 82 requests > > >>> counting these 3 errors).. so these errors accounted for 3.66 % of the > > >>> requests during the day yesterday. > > > >>> On Thu, Sep 23, 2010 at 11:19 AM, Ikai Lan (Google) < > > >>> [email protected] <ikai.l%[email protected]>> wrote: > > > >>>> Hey guys, > > > >>>> We've been tracking latency spikes and intermittent batches of aborted > > >>>> requests over the last 24 hour period. There were at least 3 periods of > > >>>> spikes yesterday: > > > >>>> ~1100 Pacific (moderate, brief period of high) > > >>>> ~1300 Pacific (small) > > >>>> ~1530 Pacific (small) > > > >>>> Community reports seem to indicate that these are related to loading > > >>>> requests. If you are being affected, can you report your environment, > > >>>> language and if these times correlate with your experiences in this > > >>>> thread? > > >>>> Are there periods that this particular report did not cover? Please > > >>>> provide > > >>>> measurements, times and application IDs. > > > >>>> -- > > >>>> Ikai Lan > > >>>> Developer Programs Engineer, Google App Engine > > >>>> Blogger:http://googleappengine.blogspot.com > > >>>> Reddit:http://www.reddit.com/r/appengine > > >>>> Twitter:http://twitter.com/app_engine > > > >>>> -- > > >>>> You received this message because you are subscribed to the Google > > >>>> Groups "Google App Engine" group. > > >>>> To post to this group, send email to [email protected]. > > >>>> To unsubscribe from this group, send email to > > >>>> [email protected]<google-appengine%2Bunsubscrib > > >>>> [email protected]> > > >>>> . > > >>>> For more options, visit this group at > > >>>>http://groups.google.com/group/google-appengine?hl=en. > > > >> -- > > >> You received this message because you are subscribed to the Google Groups > > >> "Google App Engine" group. > > >> To post to this group, send email to [email protected]. > > >> To unsubscribe from this group, send email to > > >> [email protected]<google-appengine%2Bunsubscrib > > >> [email protected]> > > >> . > > >> For more options, visit this group at > > >>http://groups.google.com/group/google-appengine?hl=en. > > > > -- > > > You received this message because you are subscribed to the Google Groups > > > "Google App Engine" group. > > > To post to this group, send email to [email protected]. > > > To unsubscribe from this group, send email to > > > [email protected]<google-appengine%2Bunsubscrib > > > [email protected]> > > > . > > > For more options, visit this group at > > >http://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
