Two hints. The ner from solr isn’t very good, and the relatedness function is iffy at best.
I would take a different approach. Get the ner data as you have it now and use shingles to make a better formed complete index using stop words then use the mlt mech to see if it’s better. If it is, great if not it’s just an idea. > On Jun 21, 2022, at 12:02 PM, Danilo Tomasoni <tomas...@cosbi.eu> wrote: > > Hello all, > I'm experimenting with the SKG features available through json.facet API in > solr 8.11 to discover semantic relations between medical text pre-annotated > with NER algorithms. > I store the NER annotations, annotation id, span ecc in separate solr fields, > to keep text clean. > > The first results looks promising but I found a behaviour that surprises me. > To give a bit of context I'm looking for covid-related papers with a standard > query (q parameter) > Then I set my foreground query to be a set of keywords in OR related to the > mithochondria, and the background query is set to *. > > Then the json.facet parameters are like > > "json.facet": { > "gene":{ > "type": "terms", > "field": "abstracts_gene_pubtator_annotation_ids", > "sort": { "r1": "desc" }, > "limit": 3, > "facet": { > "r1" : "relatedness($fore,$back)" > } > } > } > This should give gene stored in abstracts_gene_pubtator_annotation_ids that > are more likely to occur in mitochondrial papers. > Running a test query gives me this surprising result > > ... > "gene": { > "buckets": [ > { > "val": "3091", > "count": 1, > "rtitles1": { > "relatedness": 0.55649, > "foreground_popularity": 0, > "background_popularity": 0.00018 > } > }, > ... > or for a similar query even bigger relatedness values > ... > "buckets": [ > { > "val": "MESH:D028361", > "count": 1, > "rabstract_conclusions0": { > "relatedness": 0.91506, > "foreground_popularity": 5e-05, > "background_popularity": 5e-05 > }, > > ... > > But If I recall the z-score formula > > countFG("3091") - totalFG * probBG > ------------------------------------------------ > sqrt( totalFG * (1-probBG)*probBG ) > > and set countFG("3091") to 1 this means that the relatedness should be > negative (or at most 0) if totalFG * probBG >=1, while here I find a quite > positive relatedness. > Maybe this can be controlled with min_popularity, but I don't understand how > to use it in conjunction with type=terms and > field=abstracts_gene_pubtator_annotation_ids > > Can you please tell me the correct syntax, and if my reasoning is correct? > Thank you > Danilo > > Danilo Tomasoni > > Fondazione The Microsoft Research - University of Trento Centre for > Computational and Systems Biology (COSBI) > Piazza Manifattura 1, 38068 Rovereto (TN), Italy > tomas...@cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu> > http://www.cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f> > > As for the European General Data Protection Regulation 2016/679 on the > protection of natural persons with regard to the processing of personal data, > we inform you that all the data we possess are object of treatment in the > respect of the normative provided for by the cited GDPR. > It is your right to be informed on which of your data are used and how; you > may ask for their correction, cancellation or you may oppose to their use by > written request sent by recorded delivery to The Microsoft Research – > University of Trento Centre for Computational and Systems Biology Scarl, > Piazza Manifattura 1, 38068 Rovereto (TN), Italy. > P Please don't print this e-mail unless you really need to