Hi Davide, I assume that "abstracts_gene_pubtator_annotation_ids" just contains un-tokenized id, so I don't think there is a matter of tokenization, shingles etc. What we want is a single ID to be a single term in the index.
If you want to debug the relatedness calculation, take a look here : org.apache.solr.search.facet.RelatednessAgg#computeRelatedness The formula you mentioned is ok, but I would recommend remote debugging Solr and putting some breakpoints there to investigate if something doesn't look right. Let me know! -------------------------- *Alessandro Benedetti* CEO @ Sease Ltd. *Apache Lucene/Solr Committer* *Apache Solr PMC Member* e-mail: a.benede...@sease.io *Sease* - Information Retrieval Applied Consulting | Training | Open Source Website: Sease.io <http://sease.io/> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter <https://twitter.com/seaseltd> | Youtube <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github <https://github.com/seaseltd> On Wed, 22 Jun 2022 at 08:37, Danilo Tomasoni <tomas...@cosbi.eu> wrote: > Hello Dave, first of all thank you for your answer. > > I need to clarify that I've used separate (and quite good) NER algorithms > offline and the results were imported to solr. > > Unfortunately the approach that you suggest using the morelikethis > functionality is not suitable for my needs since I need to discover > statistically significative relations between NER entities, while MLT will > give me NER entities "similar" to the ones I'm looking for, as far as I > understand. > > Anyone knows why the relatedness is high even if the foreground (and even > background) popularity is 0? > > Danilo Tomasoni > > Fondazione The Microsoft Research - University of Trento Centre for > Computational and Systems Biology (COSBI) > Piazza Manifattura 1, 38068 Rovereto (TN), Italy > tomas...@cosbi.eu< > https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu > > > http://www.cosbi.eu< > https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f > > > > As for the European General Data Protection Regulation 2016/679 on the > protection of natural persons with regard to the processing of personal > data, we inform you that all the data we possess are object of treatment in > the respect of the normative provided for by the cited GDPR. > It is your right to be informed on which of your data are used and how; > you may ask for their correction, cancellation or you may oppose to their > use by written request sent by recorded delivery to The Microsoft Research > – University of Trento Centre for Computational and Systems Biology Scarl, > Piazza Manifattura 1, 38068 Rovereto (TN), Italy. > P Please don't print this e-mail unless you really need to > ________________________________ > Da: Dave <hastings.recurs...@gmail.com> > Inviato: martedì 21 giugno 2022 19:51 > A: users@solr.apache.org <users@solr.apache.org> > Oggetto: Re: Semantic Knowledge Graph theoric question > > [CAUTION: EXTERNAL SENDER] > [Please check correspondence between Sender Display Name and Sender Email > Address before clicking on any link or opening attachments] > > > Two hints. The ner from solr isn’t very good, and the relatedness function > is iffy at best. > > I would take a different approach. Get the ner data as you have it now and > use shingles to make a better formed complete index using stop words then > use the mlt mech to see if it’s better. If it is, great if not it’s just > an idea. > > > > On Jun 21, 2022, at 12:02 PM, Danilo Tomasoni <tomas...@cosbi.eu> wrote: > > > > Hello all, > > I'm experimenting with the SKG features available through json.facet API > in solr 8.11 to discover semantic relations between medical text > pre-annotated with NER algorithms. > > I store the NER annotations, annotation id, span ecc in separate solr > fields, to keep text clean. > > > > The first results looks promising but I found a behaviour that surprises > me. > > To give a bit of context I'm looking for covid-related papers with a > standard query (q parameter) > > Then I set my foreground query to be a set of keywords in OR related to > the mithochondria, and the background query is set to *. > > > > Then the json.facet parameters are like > > > > "json.facet": { > > "gene":{ > > "type": "terms", > > "field": "abstracts_gene_pubtator_annotation_ids", > > "sort": { "r1": "desc" }, > > "limit": 3, > > "facet": { > > "r1" : "relatedness($fore,$back)" > > } > > } > > } > > This should give gene stored in abstracts_gene_pubtator_annotation_ids > that are more likely to occur in mitochondrial papers. > > Running a test query gives me this surprising result > > > > ... > > "gene": { > > "buckets": [ > > { > > "val": "3091", > > "count": 1, > > "rtitles1": { > > "relatedness": 0.55649, > > "foreground_popularity": 0, > > "background_popularity": 0.00018 > > } > > }, > > ... > > or for a similar query even bigger relatedness values > > ... > > "buckets": [ > > { > > "val": "MESH:D028361", > > "count": 1, > > "rabstract_conclusions0": { > > "relatedness": 0.91506, > > "foreground_popularity": 5e-05, > > "background_popularity": 5e-05 > > }, > > > > ... > > > > But If I recall the z-score formula > > > > countFG("3091") - totalFG * probBG > > ------------------------------------------------ > > sqrt( totalFG * (1-probBG)*probBG ) > > > > and set countFG("3091") to 1 this means that the relatedness should be > negative (or at most 0) if totalFG * probBG >=1, while here I find a quite > positive relatedness. > > Maybe this can be controlled with min_popularity, but I don't understand > how to use it in conjunction with type=terms and > field=abstracts_gene_pubtator_annotation_ids > > > > Can you please tell me the correct syntax, and if my reasoning is > correct? > > Thank you > > Danilo > > > > Danilo Tomasoni > > > > Fondazione The Microsoft Research - University of Trento Centre for > Computational and Systems Biology (COSBI) > > Piazza Manifattura 1, 38068 Rovereto (TN), Italy > > tomas...@cosbi.eu< > https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu > > > > http://www.cosbi.eu< > https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f > > > > > > As for the European General Data Protection Regulation 2016/679 on the > protection of natural persons with regard to the processing of personal > data, we inform you that all the data we possess are object of treatment in > the respect of the normative provided for by the cited GDPR. > > It is your right to be informed on which of your data are used and how; > you may ask for their correction, cancellation or you may oppose to their > use by written request sent by recorded delivery to The Microsoft Research > – University of Trento Centre for Computational and Systems Biology Scarl, > Piazza Manifattura 1, 38068 Rovereto (TN), Italy. > > P Please don't print this e-mail unless you really need to >