Thank you very much Michael for your answer. Below the extra information you asked for, and a sample result
QUERY INFORMATION query=covid back query = *:* fore query = mitochondria sample gene id ="57506" / "54205" facet code: "json.facet": "{'titles_gene': {'type': 'terms', 'field': 'titles_gene_pubtator_annotation_ids', 'limit': 10, 'sort': {'rtitles0': 'desc'}, 'facet': {'rtitles0': 'relatedness($fore,$back)'}} CARDINALITY q=157120 back=157120 fore=321385 fore & back=343 "57506" & back=8 "57506" & fore & back=5 "54205" & back=5 "54205" & fore & back=5 RESULTS titles_gene "val": "57506", "count": 8, "rtitles0": { "relatedness": 0.69182, "foreground_popularity": 1e-05, "background_popularity": 1e-05 } abstracts_gene "val": "54205", "count": 5, "rabstracts0": { "relatedness": 0.94975, "foreground_popularity": 0.00025, "background_popularity": 0.00043 } Here it looks like that the fpopularity and bpopularity are the same for titles_gene (but I expected 5/343 and 8/157120 instead..) but the relatedness of 0.69182 (it should range between -1 and 1) suggests me that "57506" is strongly "characteristic" (meaning that it is occourring more in the fore than in the back, that is a superset of fore) to the fore corpus with respect to the back corpus. I would like to ask: 1. is my interpretation of relatedness correct? 2. why foreground_popularity and background_popularity are like this? 3. how should I change my json.facet query to require a min_popularity? should this solve the strange relatedness values? thank you D Danilo Tomasoni Fondazione The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Piazza Manifattura 1, 38068 Rovereto (TN), Italy tomas...@cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu> http://www.cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f> As for the European General Data Protection Regulation 2016/679 on the protection of natural persons with regard to the processing of personal data, we inform you that all the data we possess are object of treatment in the respect of the normative provided for by the cited GDPR. It is your right to be informed on which of your data are used and how; you may ask for their correction, cancellation or you may oppose to their use by written request sent by recorded delivery to The Microsoft Research – University of Trento Centre for Computational and Systems Biology Scarl, Piazza Manifattura 1, 38068 Rovereto (TN), Italy. P Please don't print this e-mail unless you really need to ________________________________ Da: Michael Gibney <mich...@michaelgibney.net> Inviato: martedì 28 giugno 2022 14:50 A: users@solr.apache.org <users@solr.apache.org> Oggetto: Re: [suspected SPAM] Re: Semantic Knowledge Graph theoric question [CAUTION: EXTERNAL SENDER] [Please check correspondence between Sender Display Name and Sender Email Address before clicking on any link or opening attachments] It's hard to give a concrete answer without knowing the actual counts involved, but iiuc significantTerms and relatedness are basically equivalent (happy to be corrected here if I'm wrong). > the relatedness function is iffy at best ? -- not sure what is meant by this. It's a function, and afaict based on information available it's likely to be working as intended (and most likely equivalent to what you'd get for an analogous config of significantTerms streaming expression?). If you're getting a hit count of 1 returned for any of these terms, then you're probably dealing with numbers low enough that you're going to mostly be getting noise -- indeed, that's the purpose of the min_popularity setting: to screen out results that don't have enough instances (e.g., 1) to calculate a meaningful correlation. Of course feel free to experiment with the other suggestions above, but unless you also screen out low-frequency terms (i.e., with min_popularity) I'd be surprised if shingles would have a helpful impact on your use case. Ultimately though, to reiterate: I think it's going to be hard to provide helpful feedback unless you're able to provide a sense of the integer counts involved (background set size, and intersection size (term of interest & foreground_set; term of interest & background_set). It might also be helpful to have a sense of the overall cardinality and types of values in the field in question. On Tue, Jun 28, 2022 at 8:17 AM Danilo Tomasoni <tomas...@cosbi.eu> wrote: > > Thank you very much Alessandro. > I will look into the code. > > > Danilo Tomasoni > > Fondazione The Microsoft Research - University of Trento Centre for > Computational and Systems Biology (COSBI) > Piazza Manifattura 1, 38068 Rovereto (TN), Italy > tomas...@cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu> > http://www.cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f> > > As for the European General Data Protection Regulation 2016/679 on the > protection of natural persons with regard to the processing of personal data, > we inform you that all the data we possess are object of treatment in the > respect of the normative provided for by the cited GDPR. > It is your right to be informed on which of your data are used and how; you > may ask for their correction, cancellation or you may oppose to their use by > written request sent by recorded delivery to The Microsoft Research – > University of Trento Centre for Computational and Systems Biology Scarl, > Piazza Manifattura 1, 38068 Rovereto (TN), Italy. > P Please don't print this e-mail unless you really need to > ________________________________ > Da: Alessandro Benedetti <a.benede...@sease.io> > Inviato: lunedì 27 giugno 2022 16:17 > A: users@solr.apache.org <users@solr.apache.org> > Oggetto: [suspected SPAM] Re: Semantic Knowledge Graph theoric question > > [CAUTION: EXTERNAL SENDER] > [Please check correspondence between Sender Display Name and Sender Email > Address before clicking on any link or opening attachments] > > > Hi Davide, > I assume that "abstracts_gene_pubtator_annotation_ids" just contains > un-tokenized id, so I don't think there is a matter of tokenization, > shingles etc. > What we want is a single ID to be a single term in the index. > > If you want to debug the relatedness calculation, take a look here : > org.apache.solr.search.facet.RelatednessAgg#computeRelatedness > The formula you mentioned is ok, but I would recommend remote > debugging Solr and putting some breakpoints there to investigate if > something doesn't look right. > > Let me know! > > -------------------------- > *Alessandro Benedetti* > CEO @ Sease Ltd. > *Apache Lucene/Solr Committer* > *Apache Solr PMC Member* > > e-mail: a.benede...@sease.io > > > *Sease* - Information Retrieval Applied > Consulting | Training | Open Source > > Website: Sease.io <http://sease.io/> > LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter > <https://twitter.com/seaseltd> | Youtube > <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github > <https://github.com/seaseltd> > > > On Wed, 22 Jun 2022 at 08:37, Danilo Tomasoni <tomas...@cosbi.eu> wrote: > > > Hello Dave, first of all thank you for your answer. > > > > I need to clarify that I've used separate (and quite good) NER algorithms > > offline and the results were imported to solr. > > > > Unfortunately the approach that you suggest using the morelikethis > > functionality is not suitable for my needs since I need to discover > > statistically significative relations between NER entities, while MLT will > > give me NER entities "similar" to the ones I'm looking for, as far as I > > understand. > > > > Anyone knows why the relatedness is high even if the foreground (and even > > background) popularity is 0? > > > > Danilo Tomasoni > > > > Fondazione The Microsoft Research - University of Trento Centre for > > Computational and Systems Biology (COSBI) > > Piazza Manifattura 1, 38068 Rovereto (TN), Italy > > tomas...@cosbi.eu< > > https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu > > > > > http://www.cosbi.eu< > > https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f > > > > > > > As for the European General Data Protection Regulation 2016/679 on the > > protection of natural persons with regard to the processing of personal > > data, we inform you that all the data we possess are object of treatment in > > the respect of the normative provided for by the cited GDPR. > > It is your right to be informed on which of your data are used and how; > > you may ask for their correction, cancellation or you may oppose to their > > use by written request sent by recorded delivery to The Microsoft Research > > – University of Trento Centre for Computational and Systems Biology Scarl, > > Piazza Manifattura 1, 38068 Rovereto (TN), Italy. > > P Please don't print this e-mail unless you really need to > > ________________________________ > > Da: Dave <hastings.recurs...@gmail.com> > > Inviato: martedì 21 giugno 2022 19:51 > > A: users@solr.apache.org <users@solr.apache.org> > > Oggetto: Re: Semantic Knowledge Graph theoric question > > > > [CAUTION: EXTERNAL SENDER] > > [Please check correspondence between Sender Display Name and Sender Email > > Address before clicking on any link or opening attachments] > > > > > > Two hints. The ner from solr isn’t very good, and the relatedness function > > is iffy at best. > > > > I would take a different approach. Get the ner data as you have it now and > > use shingles to make a better formed complete index using stop words then > > use the mlt mech to see if it’s better. If it is, great if not it’s just > > an idea. > > > > > > > On Jun 21, 2022, at 12:02 PM, Danilo Tomasoni <tomas...@cosbi.eu> wrote: > > > > > > Hello all, > > > I'm experimenting with the SKG features available through json.facet API > > in solr 8.11 to discover semantic relations between medical text > > pre-annotated with NER algorithms. > > > I store the NER annotations, annotation id, span ecc in separate solr > > fields, to keep text clean. > > > > > > The first results looks promising but I found a behaviour that surprises > > me. > > > To give a bit of context I'm looking for covid-related papers with a > > standard query (q parameter) > > > Then I set my foreground query to be a set of keywords in OR related to > > the mithochondria, and the background query is set to *. > > > > > > Then the json.facet parameters are like > > > > > > "json.facet": { > > > "gene":{ > > > "type": "terms", > > > "field": "abstracts_gene_pubtator_annotation_ids", > > > "sort": { "r1": "desc" }, > > > "limit": 3, > > > "facet": { > > > "r1" : "relatedness($fore,$back)" > > > } > > > } > > > } > > > This should give gene stored in abstracts_gene_pubtator_annotation_ids > > that are more likely to occur in mitochondrial papers. > > > Running a test query gives me this surprising result > > > > > > ... > > > "gene": { > > > "buckets": [ > > > { > > > "val": "3091", > > > "count": 1, > > > "rtitles1": { > > > "relatedness": 0.55649, > > > "foreground_popularity": 0, > > > "background_popularity": 0.00018 > > > } > > > }, > > > ... > > > or for a similar query even bigger relatedness values > > > ... > > > "buckets": [ > > > { > > > "val": "MESH:D028361", > > > "count": 1, > > > "rabstract_conclusions0": { > > > "relatedness": 0.91506, > > > "foreground_popularity": 5e-05, > > > "background_popularity": 5e-05 > > > }, > > > > > > ... > > > > > > But If I recall the z-score formula > > > > > > countFG("3091") - totalFG * probBG > > > ------------------------------------------------ > > > sqrt( totalFG * (1-probBG)*probBG ) > > > > > > and set countFG("3091") to 1 this means that the relatedness should be > > negative (or at most 0) if totalFG * probBG >=1, while here I find a quite > > positive relatedness. > > > Maybe this can be controlled with min_popularity, but I don't understand > > how to use it in conjunction with type=terms and > > field=abstracts_gene_pubtator_annotation_ids > > > > > > Can you please tell me the correct syntax, and if my reasoning is > > correct? > > > Thank you > > > Danilo > > > > > > Danilo Tomasoni > > > > > > Fondazione The Microsoft Research - University of Trento Centre for > > Computational and Systems Biology (COSBI) > > > Piazza Manifattura 1, 38068 Rovereto (TN), Italy > > > tomas...@cosbi.eu< > > https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu > > > > > > http://www.cosbi.eu< > > https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f > > > > > > > > > As for the European General Data Protection Regulation 2016/679 on the > > protection of natural persons with regard to the processing of personal > > data, we inform you that all the data we possess are object of treatment in > > the respect of the normative provided for by the cited GDPR. > > > It is your right to be informed on which of your data are used and how; > > you may ask for their correction, cancellation or you may oppose to their > > use by written request sent by recorded delivery to The Microsoft Research > > – University of Trento Centre for Computational and Systems Biology Scarl, > > Piazza Manifattura 1, 38068 Rovereto (TN), Italy. > > > P Please don't print this e-mail unless you really need to > >