FxA meeting summary: backend meeting, 15-Nov-2016

Phil Booth Tue, 15 Nov 2016 15:06:55 -0800

Hey everyone,

We just had our regular backend catch-up. It was not recorded but I did
take some stream-of-consciousness notes on what we discussed:


* train 74 was cut on time!

* but the deploy doc was empty!
  * a db migration was not mentioned
  * some config changes weren't mentioned
  * we should all try to remember to keep the deploy doc up to data as we go

* oauth token purging
  * on the back burner at the moment, last worked on a few weeks ago
  * currently up to november 2015, which contains twice as many tokens for
some reason
  * were 500s returned on requests from profile server?

* verification reminders disabled
  * no smoking gun except replication couldn't keep up
  * replication thread spending most time in system-locked state
  * there was a transaction that had millions of rows locked
    * stored procedures make it difficult to see exactly what the problem is
    * once that transaction unwinds, everything returns to a good state
  * it's recommended not to use temporary tables with replication, is that
the issue?
  * or could it be something to do with select for update?
  * do we need a consensus protocol for competing processes?
  * we don't actually need verification reminders running on all instances
anyway
    * only needs to be 1 query every 30 seconds
    * with replication we're making a query every second
    * no need to compete
  * what are the release mechanics if we limit it to 1 instance?

* teamcity permissions
  * deployed a box to try out teamcity 10 with github integration
  * it wants write permissions to the github org, including private repos
  * means the teamcity box would need serious hardening
  * what if there's a zero-day exploit in teamcity?
  * vlad's going to email teamcity

* sentry
  * deployed with train 73 content server
  * where are all the errors?
    * only two errors reported, one from git clone, one from git fetch
    * "<path>/experiments already exists and is not empty"
  * if things go wrong it should show up there

* profile server
  * docker stack is slow, seeing 4-10 times higher latency than non-docker
    * seconds of latency
  * cpu is only a little bit higher
  * also docker stack never returns a 304 for /v1/profile
    * usually there are around 4% or 5% 304s
    * absolute dead zero inside docker
    * how the heck can docker affect this?
    * etag-related?
    * maybe a significant clue of where the slowness comes from
  * try disabling all logging so there is no disk i/o, will that fix the
latency?
    * log data is handled differently in docker
    * different centos version, different log flow, uses systemd and
journald
  * one of the devs to try running in docker locally to debug the 304 issue

* profile server avatars
  * only one avatar per user
  * drop some tables
  * content server should have no more code for get-avatars api
  * if the endpoint is not being elsewhere (it shouldn't be), we should get
rid of it

And that was it.

Cheerio,
Pb

_______________________________________________
Dev-fxacct mailing list
[email protected]
https://mail.mozilla.org/listinfo/dev-fxacct

FxA meeting summary: backend meeting, 15-Nov-2016

Reply via email to