Hi Steffen, On 2021-03-03 19:58, Steffen Möller wrote: > > Am 03.03.21 um 17:39 schrieb Matus Kalas: >> Hey all again, and thanks for your thoughts Andrius and Andreas! >> >> On 2021-03-03 09:36, Andreas Tille wrote: >>> Hi Andrius, >>> >>> On 2021-03-03 08:54, Andrius Merkys wrote: >>>> Dear Matus, >>>> >>>> On 2021-03-02 19:56, Matus Kalas wrote: >>>>> I'd suggest hearing from the folks who have done the most of the work >>>>> with manually including those IDs, and letting them approve/decide. >>>> >>>> Absolutely! >> >> Steffen et al., your opninions on this matter? > > Sorry for being late on this. > > So, "NA" indeed means like "hey, I checked but this was not found". This > information should not be lost. > > An empty entry, as if from a template, does not have the same meaning. > If NA (which is how R expects it and I found it likely to be easier to > parse) or N/A - I would not be bother to do all these changes and would > just leave it. Indeed, on the Excel sheet I am using N/A. > > As it happens, we had a quick thought exchange on zoom today and I tend > to think that the general idea is that these NAs have to disappear, i.e. > add these entries to bio.tools.
Thank you for confirming the distinction between empty value and "NA". >>>>> I can imagine that for purely practical reasons in the process of the >>>>> manual curation, it might make sense to allow explicitly: >>>>> - Name: OMICtools >>>>> Entry: N/A (Meaning: I have checked and there was no record) >>>>> - Name: bio.tools >>>>> Entry: "" (Meaning: I or someone else should check this >>>>> out; >>>>> or perhaps: I checked but wasn't conclusive yet) >>>>> >>>>> The latter might be useful for contributors who aren't used to all >>>>> those >>>>> IDs, to make them more visible (including where the gaps are). But on >>>>> the other hand, if those are well present in an upstream/metadata >>>>> template and very clear in the documentation of upstream/metadata, >>>>> then >>>>> it is not necessary and I'd then tend to like your suggestion Andrius. >>>> >>>> To me, three flavors of "unknown" looks like an overkill. Most of the >>>> metadata in Debian does not even have the two flavors of "unknown": >>>> missing Bug-Submit field in d/u/metadata, Homepage in d/control and >>>> Upstream-Contact in d/copyright means that this piece of information is >>>> either nonexistent or simply not entered (for example, due to the lack >>>> of time). Thus I am not sure whether the added value is worth the >>>> infrastructure/effort here. But again, this is solely my opinion, >>>> certainly not aimed at reflecting those of the people who enter and use >>>> the data in d/u/metadata. > > Hm. I see the following: > > * empty - nobody cared, yet > * "N/A" or "NA" or "<N/A>" or "<NA>" the latter two I would prefer but > do not really care, may be too difficult in YAML since < is a special > character - checked but not found > * "<rejected>" - bio.tools decided against referencing that package. We > are likely to see a few of these in near future. Just a suggestion: maybe a "Status" field could be of use here? If more special values of "Entry" are about to be introduced, it is better to use a separate field to make this more machine-readable. Suggested values for "Status": * "confirmed" (default) - an entry in the registry is confirmed, and its ID is stored in "Entry" field; * "not-found" - the registry was checked for a match, but it was not found at that point of time (here timestamp field could be of value); * "rejected" - the registry explicitly rejected an attempt to register the package; * "pending" - package is submitted for registry, no response yet; * ... >>> <all easy for Andreas> >>>> >>>> If three flavors option would be preferred, I would also suggest adding >>>> date fields for each entry to signal at which point in time the >>>> registry >>>> was inspected. >>> >>> As I wrote above later addition of some software to some registry can >>> spoil the different meanings of unknown. This could be cured by such a >>> date field but I don't think it is of any better value than draining >>> time from people maintaining that extra field. Thus I do not think we >>> should do this. >> >> We definitely don't need a date, git blame does that. Also in the form >> of the Blame button in Salsa. Without a possibility for inconsistency. > > This may be material for another paper: Means to synchronize between > volunteer databases. > > * Provenance is accepted > * data transfer status - this is not yet happening in routine but this > is what we are doing here. > > @Andrius - If I do not need to be involved and if no information is > lost, then I promise to be very happy with whatever you come up with, > whatever this may be. The chance to have a reference named "NA", though, > especially with all caps, that is darn close to zero and I wish you > would invest/sink your valuable time into something else. I do not want to interfere with the current practice nor cause loss of valuable data. From the fact that "NA" special value is not mentioned in DEP 12 I assumed it had the same meaning as empty value - thanks for confirming I was wrong. I am fine with leaving it that way - as you say, chance of having entry in the registry of that name may be small. However, I know nothing about the naming conventions of the registries, and my experience with structured data makes me uneasy about special values. In any case such values should be described, and I volunteer to update DEP 12 to reflect the current usage. > Best, > > Steffen > > >>> -- >>> http://fam-tille.de >>>> >>>> Best, >>>> Andrius >> >> There is one closely related issue, which we just briefly touched upon >> with Steffen and Hervé in a telcon: What to do with those "NA" >> packages that are missing in e.g. bio.tools? >> >> The regitration in bio.tools (and surely also SciCrunch) could be >> automated, but there are at least a couple of things needing human >> curation: >> >> - Which src packages represent one tool (often e.g. libs | language >> bindings form separate Debian pkgs). How to mark this and where? Is >> there an exisiting Debian mechanism? Or do we need to abuse the >> d/u/metadata "Entry" for that, before they're added? (3rd or 4th >> flavour of info then 😀 ; btw. git branches could help here 😉 ; and >> not in google spreadsheet perhaps 😜 as it has to be machine-readable) >> >> - Choosing an available, reasonable biotoolsID and tool name. >> Ideally tool name and biotoolsID are identical with ID having all >> small case and spaces removed/replaced. >> >> - Any other things needing human curation? >> >> >> >> Thank you all, I'm very happy seeing this progressing! >> Matus >> >> >> P.S.: Could you please leave all the contents in when replying to the >> thread, so that others can reply to previously mentioned points >> without having to read every single email in the thread and possibly >> breaking linearity of it? I agree that's it not ecological to >> broadcast the same text all around the globe again and again, but >> there are other solutions than emails that handle that without >> compromising. Many thanks! >> > Best wishes, Andrius