Hey all again, and thanks for your thoughts Andrius and Andreas!
On 2021-03-03 09:36, Andreas Tille wrote:
Hi Andrius,
On 2021-03-03 08:54, Andrius Merkys wrote:
Dear Matus,
On 2021-03-02 19:56, Matus Kalas wrote:
I'd suggest hearing from the folks who have done the most of the work
with manually including those IDs, and letting them approve/decide.
Absolutely!
Steffen et al., your opninions on this matter?
I can imagine that for purely practical reasons in the process of the
manual curation, it might make sense to allow explicitly:
- Name: OMICtools
Entry: N/A (Meaning: I have checked and there was no
record)
- Name: bio.tools
Entry: "" (Meaning: I or someone else should check this
out;
or perhaps: I checked but wasn't conclusive yet)
The latter might be useful for contributors who aren't used to all
those
IDs, to make them more visible (including where the gaps are). But on
the other hand, if those are well present in an upstream/metadata
template and very clear in the documentation of upstream/metadata,
then
it is not necessary and I'd then tend to like your suggestion
Andrius.
To me, three flavors of "unknown" looks like an overkill. Most of the
metadata in Debian does not even have the two flavors of "unknown":
missing Bug-Submit field in d/u/metadata, Homepage in d/control and
Upstream-Contact in d/copyright means that this piece of information
is
either nonexistent or simply not entered (for example, due to the lack
of time). Thus I am not sure whether the added value is worth the
infrastructure/effort here. But again, this is solely my opinion,
certainly not aimed at reflecting those of the people who enter and
use
the data in d/u/metadata.
I wrote the UDD importer for the metadata files and thus look at the
data as a "consumer" of the provided information. From this side those
different meanings of unknown are all turned into "ignore this value".
So in this respect differentiating between those unknowns is basically
helpful for those who edit the metadata files. Flagging something as
"I
was here and have checked" is probably kind of helpful. However, it
might perfectly be that some registry will include that specific
software later and re-checking makes sense.
For this reason I was recommending to not make those simple things to
complex since making it complex just drains time from the people who
are
working on it with no visible effect to the users.
If three flavors option would be preferred, I would also suggest
adding
date fields for each entry to signal at which point in time the
registry
was inspected.
As I wrote above later addition of some software to some registry can
spoil the different meanings of unknown. This could be cured by such a
date field but I don't think it is of any better value than draining
time from people maintaining that extra field. Thus I do not think we
should do this.
We definitely don't need a date, git blame does that. Also in the form
of the Blame button in Salsa. Without a possibility for inconsistency.
Thanks a lot for your work on this
Andreas.
--
http://fam-tille.de
Best,
Andrius
There is one closely related issue, which we just briefly touched upon
with Steffen and Hervé in a telcon: What to do with those "NA" packages
that are missing in e.g. bio.tools?
The regitration in bio.tools (and surely also SciCrunch) could be
automated, but there are at least a couple of things needing human
curation:
- Which src packages represent one tool (often e.g. libs | language
bindings form separate Debian pkgs). How to mark this and where? Is
there an exisiting Debian mechanism? Or do we need to abuse the
d/u/metadata "Entry" for that, before they're added? (3rd or 4th flavour
of info then 😀 ; btw. git branches could help here 😉 ; and not in google
spreadsheet perhaps 😜 as it has to be machine-readable)
- Choosing an available, reasonable biotoolsID and tool name. Ideally
tool name and biotoolsID are identical with ID having all small case and
spaces removed/replaced.
- Any other things needing human curation?
Thank you all, I'm very happy seeing this progressing!
Matus
P.S.: Could you please leave all the contents in when replying to the
thread, so that others can reply to previously mentioned points without
having to read every single email in the thread and possibly breaking
linearity of it? I agree that's it not ecological to broadcast the same
text all around the globe again and again, but there are other solutions
than emails that handle that without compromising. Many thanks!