* M. Zhou <lu...@debian.org> [2023-12-27 19:00]:
Thanks for sharing the figure. The data seems correlated with the
number of new Debian accounts. See the figure below:
Python Code for this figure:
```
# modified from ChatGPT.
# XXX: members.csv is copy-pasted from https://nm.debian.org/members/
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('members.csv', sep='\t')
df = df[df['Since'] != '(unknown)'] # filter out invalid data
df['Since'] = pd.to_datetime(df['Since'])
df['Year'] = df['Since'].dt.year
account_counts = df['Year'].value_counts().sort_index()
smoothed_counts = account_counts.rolling(window=3).mean()
plt.figure(figsize=(10, 6))
plt.bar(account_counts.index, account_counts.values, color='skyblue')
plt.plot(smoothed_counts.index, smoothed_counts.values, color='orange',
label=f'Smoothed (Window=3)')
plt.xlabel('Year')
plt.ylabel('Number of Accounts Created')
plt.title('Number of Accounts Created Each Year')
plt.legend()
plt.savefig('nm-year.png')
```
Thanks for the code and the figure. Indeed, the trend is confirmed by
fitting a linear model count ~ year to the new members list. The
coefficient is -1.39 member/year, which is significantly different from
zero (F[1,22] = 11.8, p < 0.01). Even when we take out the data from year
2001, that could be interpreted as an outlier, the trend is still
siginificant, with a drop of 0.98 member/year (F[1,21] = 8.48, p < 0.01).
Best,
Rafael Laboissière
P.S.1: The correct way to do the analysis above is by using a
generalized linear model, with the count data from a Poisson distribution
(or, perhaps, by considering overdispersed data). I will eventually add
this to my code in Git.
P.S.2: In your Python code, it is possible to get the data frame directly
from the web page, without copying&pasting. Just replace the line:
df = pd.read_csv('members.csv', sep='\t')
by:
df = pd.read_html("https://nm.debian.org/members/")[0]
I am wondering whether ChatGPT could have figured this out…