On 4/05/2015 15:23, Andrew Chilton wrote:
> I'm currently looking at FxA user metrics this quarter and have a few
> thoughts around some ideas. Let me kick off by just having a brief
> outline so you get an idea of what's needed.
> 
>       "All services attached to FxA should report metrics on user
> activity in a consistent, privacy-respecting manner.  We should have
> dashboards that allow us to measure success and monitor for problems."

Thanks for kicking this off Andy, getting better metrics is a big part
of our vision for a successful Q2.  You and I have chatted a bit about
this IRL, but I'll add some more notes and context below for the benefit
of discussion on the list.

> This may or may not be the exact goal we're reaching for, but it's
> currently a good statement to aim for. Other questions we might like to
> ask and answer in the future are:
> 
> * how many users signed up this month?
> * how many users used a particular service (e.g. Hello, Sync)?
> * how many users logged in with a mobile device?
> * (perhaps other questions we don't yet know about)

One of the tricky-but-important questions we need to answer is:

* how many users accessed more than one FxA service this month?

It's worth calling this one out explicitly, because this is why we need
to somehow correlate user activity across services.

> To do this we need to figure out how to make this happen. By logging
> 'user events' we should be able to take advantage of the regular data
> pipeline running through Heka/ElasticSearch/Kibana and/or any other
> Reporting/MapReduce plus custom dashoards as aimed for by the Data
> Pipeline v2 [1]. Therefore this email is mainly concentrating on what we
> do at the edges, i.e. our application servers.
>  
> We would love to correlate users across services so we need something
> which allows us to do this. Of course the uid of a user is the obvious
> answer but one that would raise some privacy questions.

Right, so the simplest thing would be for each service to just emit a
bunch of JSON log entries like this:

  log.info({
      service: "hello"
      uid: "ABCDEF123456"
      event: "call",
      timestamp: 1430802399476,
  })

  log.info({
      service: "readinglist"
      uid: "ABCDEF123456"
      event: "save_item",
      timestamp: 1430803546071,
  })

Heka could slurp this up and send them off for processing/aggregation in
the same way that we currently do for the existing FxA
monthly-active-users count.

Would it be OK for us to just go ahead and ship it this way?

To me it seems a little creepy.  If these events are being stored
somewhere, you could potentially build up a pretty nice picture of an
individual user's activity by analyzing their stream of events.
Accidentally leaking metrics in this form would be a pretty big deal.

Can we do it in a more privacy-conscious manner?

> Perhaps we could
> post-process these in the data pipeline into something else, or we can
> log something locally which we could use to correlate that same user to
> another service (but not back to the user him/herself). The idea of a
> Metrics ID has been raised which is a one-way mapping from uid to
> Metrics ID (am leaving out any implementation details for now).

We have a tiny bit of prior art here, in the monthly-active-users
counting for sync:

  https://bugzilla.mozilla.org/show_bug.cgi?id=1136014

For this, we wound up emitting metrics events that look like:

  log.info({
      uid: HMAC_SHA256(<secret key>, <uid>),
      timestamp: 1430802399476,
      ...other sync-specific metrics...
  })

In other words, we use HMAC to derive an opaque "metrics id" from the
account uid.  This lets us count unique users of the service, but makes
it harder to correlate the logs with a particular user record from FxA.

If all the services used the same technique, we could do cross-service
activity correlation.

I'd be interested in people's thoughts on the usefulness of this
obfuscation.

> Of course, all services would need to know how to make that MetricsID if it
> was logged at the edge, but if the uid was post-processed in the data
> pipeline this could be done centrally.

Yep.  If every service is able to do the uid -> metrics-id mapping at
will, then does it really gain us anything?


I'd love for people to weigh in with their gut reactions here, even if
you don't have any comments on the technical details.

We will of course have to be in compliance with Mozilla's terms, privacy
policy, etc when collecting all these metrics.  But IMHO saying "we're
compliant with the posted ToS!" is not much help if what we're doing
just feels wrong to people.

So how can we make the gathering of these metrics feel as
privacy-sensitive, as safe, as *right* as possible?


  Cheers,

    Ryan
_______________________________________________
Dev-fxacct mailing list
[email protected]
https://mail.mozilla.org/listinfo/dev-fxacct

Reply via email to