On 12/28/23 10:34, Rafael Laboissière wrote:
* M. Zhou <lu...@debian.org> [2023-12-27 19:00]:
Thanks for the code and the figure. Indeed, the trend is confirmed by
fitting a linear model count ~ year to the new members list. The
coefficient is -1.39 member/year, which is significantly different
from zero (F[1,22] = 11.8, p < 0.01). Even when we take out the data
from year 2001, that could be interpreted as an outlier, the trend is
still siginificant, with a drop of 0.98 member/year (F[1,21] = 8.48, p
< 0.01).
I thought about to use some models for population statistics, so we can
get the data about DD birth rate and DD retire/leave rate, as well as a
prediction. But since the descendants of DDs are not naturally new DDs,
the typical population models are not likely going to work well. The
birth of DD is more likely mutation, sort of.
Anyway, we do not need sophisticated math models to draw the conclusion
that Debian is an aging community. And yet, we don't seem to have a good
way to reshape the curve using Debian's funds. -- this is one of the key
problems behind the data.
P.S.1: The correct way to do the analysis above is by using a
generalized linear model, with the count data from a Poisson
distribution (or, perhaps, by considering overdispersed data). I will
eventually add this to my code in Git.
Why not integrate them into nm.debian.org when they are ready?
P.S.2: In your Python code, it is possible to get the data frame
directly from the web page, without copying&pasting. Just replace the
line:
df = pd.read_csv('members.csv', sep='\t')
by:
df = pd.read_html("https://nm.debian.org/members/")[0]
I am wondering whether ChatGPT could have figured this out…
I just specified the CSV input format based on what I have copied. It
produces well-formatted code with detailed documentation in most of the
time. I deleted too much from its outputs to keep the snippet short.
I have to justify one thing to avoid giving you a wrong impression about
large language models. In fact, the performance of an LLM (such as
ChatGPT) greatly varies based on the prompt and the context people
provided to it. Exploring this in-context learning capability is still
one of the cutting edge research topics. For the status-quo LLMs, their
answers on boilerplate code like plotting (matplotlib) and simple
statistics (pandas) are terribly perfect.