Hi, (Sorry, this email got so long that I'll answer the others separately.)
On Sat, 23 Mar 2019 at 11:26, Arjun Salyan wrote: > On Sat, Mar 23, 2019 at 3:15 PM Mojca Miklavec wrote: >> >> I would use the first definition: number of users currently having the >> port installed. It might be pretty common to have to reinstall the >> same port multiple times (maybe just for debugging / development >> reasons) and we don't want to count the port developer 20 times. If >> the user uninstalled the port, it's equivalent to me as never having >> it installed in the first place. > > > Thanks. But in that case what would be considered as number of installations > in a particular month? Suppose, the first weekly submission contains port P > in active_ports, but during second submission(in the same month), the port is > uninstalled. > > One way would be to have it consider the number of users having it in active > ports on the last day of the month or on 15th. Short answer: I could consider the port as installed by a particular user if it was reported as installed at least once in that month (if it was installed during the first report, then uninstalled, count it as installed; it will not be counted next month anyway if the user just made a mistake / changed their mind). Long answer: I would say that there is no single correct answer (I'll try to give a few examples below), but I find it quite important not to do any "lossy data import" at the time of importing the statistics. Non-lossy import allows you to change the representation of data (what to show and how) at any given point in the future. The existing statistics page discards a lot of information at the time of import. For example: it just counts the overall number of a certain macOS versions which turned out to be completely useless piece of information if it's not correlated with time. We want to know how many users of 10.8 we have today, not counting the users which have migrated since. A big mistake we did in the early days of GSOC is that we didn't try to deploy the solutions early enough (this was properly deployed only long after the GSOC was over), so the student only ever worked with made-up data and nobody ever noticed that this would be a problem. But even when put that late deployment aside ... if the data wasn't lost during the statistics submission, we could still recalculate historical data and change the representation to the exact form in which we want it now (after months or years of experience and feedback). If we still had raw data in the form of (uuid, timestamp, os_version) we could still experiment with various data representations and draw the desired graphs. Now we only keep (uuid, os_version) in the database. Granted, from the second representation it's much easier to draw the graph than from the first one, but the first one bears a lot more information. With proper database indexing and some non-trivial sql queries you could easily draw "any graph you want" from the first table. Ideally the database should contain only raw data, and then some views to assist with further statistics. Certain pages could be cached, so that the database would not need to recalculate the same data over and over again even when the underlying data didn't change at all. Only if we run into serious performance issues I would start doing some pre-calculations and store them back to the database, maybe run nightly, hourly or so. Here are some examples of why I don't see a single correct answer to your initial question. Let's assume that you know absolutely everything about all MacPorts installation (exact timestamp of when each port was installed or uninstalled, exact timestamp of MacPorts installations / upgrades / removals ...) and you want to know the answer to "How many users have port Foo installed on each OS version in March 2019?" 1.) Assume I have it installed on computer in the office, but I was on vacations or business trip all March, so the computer was not even online to submit its monthly statistics. Does that computer count? It won't count now as it would not submit the statistics, but it could count if you knew everything about that computer. If you recorded the event when I installed the port and didn't see any uninstallation /deactivation events since, you could still count it as active (maybe). Well, you could argue that I didn't use that computer for a month anyway, so it has all the rights not to be counted, which is a fair argument, but ... 2.) I also have that port on my laptop and I used it actively during that time. But since I was travelling, I hardly ever had access to internet from the laptop (as good as never), so there would be no statistics sent either. 3.) I have that port on my old laptop which I didn't turn on since the last few months (but the software is still there). Even if you knew everything about the history of macports installations on that laptop: would you count that port? Probably not, you cannot even know if that computer didn't end up in recycling in the meantime. Then I open it again next month, the installation is still there, ports are reported as present. You could potentially interpolate the missing months and count the port as present in those months as well (you probably don't want to actually do that, I'm just providing some border-case examples). 4) You may know that the user installed the port on the 5th of March, uninstalled it again five days later, then installed it again on the 25th. I assume you could in theory count this as "days_installed / days_in_month" (or seconds_installed / seconds_in_month), but that would be overdoing it; I would say that if the user reported the port as installed at least once in that month, count it as installed. The only thing that you really need to be careful about is not to count a certain port as installed 10 times in case one user upgraded that port 9 times. Additional points to bear in mind (not with a high priority): - This requires modification of base, but we might want to add statistics submission at each port install / uninstall / activate / deactivate command. Not something to implement right now, but maybe something to keep in subconscious mind when designing the database representation. - I'm not sure how the current submission works; would statistics even be submitted if I'm offline at that one time in week when I was supposed to send the statistics? Mojca