If you really want to have fun you build that index using the significant phrases plus the ner and boost accordingly and I have about 90% certainty if you do it well, you hit the mark. Amhik
> On Jun 22, 2022, at 10:08 AM, Dave <hastings.recurs...@gmail.com> wrote: > > This is the right answer. I could go more in depth but if you get the > significant Phrases rather than terms using shingles you will have better > luck with a word length of three, no stop words, and a minimum existence of > about 4. It’s a fun experiment > >> On Jun 22, 2022, at 9:11 AM, Joel Bernstein <joels...@gmail.com> wrote: >> >> For an experiment you can test out the significantTerms Streaming >> Expression, which uses the foreground count and background count to score >> terms. >> >> https://solr.apache.org/guide/8_9/search-sample.html#significantterms >> https://solr.apache.org/guide/8_9/stream-source-reference.html#significantterms-parameters >> >> >> >> >> >> >> >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> >>>> On Wed, Jun 22, 2022 at 2:37 AM Danilo Tomasoni <tomas...@cosbi.eu> wrote: >>> >>> Hello Dave, first of all thank you for your answer. >>> >>> I need to clarify that I've used separate (and quite good) NER algorithms >>> offline and the results were imported to solr. >>> >>> Unfortunately the approach that you suggest using the morelikethis >>> functionality is not suitable for my needs since I need to discover >>> statistically significative relations between NER entities, while MLT will >>> give me NER entities "similar" to the ones I'm looking for, as far as I >>> understand. >>> >>> Anyone knows why the relatedness is high even if the foreground (and even >>> background) popularity is 0? >>> >>> Danilo Tomasoni >>> >>> Fondazione The Microsoft Research - University of Trento Centre for >>> Computational and Systems Biology (COSBI) >>> Piazza Manifattura 1, 38068 Rovereto (TN), Italy >>> tomas...@cosbi.eu< >>> https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu >>>> >>> http://www.cosbi.eu< >>> https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f >>>> >>> >>> As for the European General Data Protection Regulation 2016/679 on the >>> protection of natural persons with regard to the processing of personal >>> data, we inform you that all the data we possess are object of treatment in >>> the respect of the normative provided for by the cited GDPR. >>> It is your right to be informed on which of your data are used and how; >>> you may ask for their correction, cancellation or you may oppose to their >>> use by written request sent by recorded delivery to The Microsoft Research >>> – University of Trento Centre for Computational and Systems Biology Scarl, >>> Piazza Manifattura 1, 38068 Rovereto (TN), Italy. >>> P Please don't print this e-mail unless you really need to >>> ________________________________ >>> Da: Dave <hastings.recurs...@gmail.com> >>> Inviato: martedì 21 giugno 2022 19:51 >>> A: users@solr.apache.org <users@solr.apache.org> >>> Oggetto: Re: Semantic Knowledge Graph theoric question >>> >>> [CAUTION: EXTERNAL SENDER] >>> [Please check correspondence between Sender Display Name and Sender Email >>> Address before clicking on any link or opening attachments] >>> >>> >>> Two hints. The ner from solr isn’t very good, and the relatedness function >>> is iffy at best. >>> >>> I would take a different approach. Get the ner data as you have it now and >>> use shingles to make a better formed complete index using stop words then >>> use the mlt mech to see if it’s better. If it is, great if not it’s just >>> an idea. >>> >>> >>>>> On Jun 21, 2022, at 12:02 PM, Danilo Tomasoni <tomas...@cosbi.eu> wrote: >>>> >>>> Hello all, >>>> I'm experimenting with the SKG features available through json.facet API >>> in solr 8.11 to discover semantic relations between medical text >>> pre-annotated with NER algorithms. >>>> I store the NER annotations, annotation id, span ecc in separate solr >>> fields, to keep text clean. >>>> >>>> The first results looks promising but I found a behaviour that surprises >>> me. >>>> To give a bit of context I'm looking for covid-related papers with a >>> standard query (q parameter) >>>> Then I set my foreground query to be a set of keywords in OR related to >>> the mithochondria, and the background query is set to *. >>>> >>>> Then the json.facet parameters are like >>>> >>>> "json.facet": { >>>> "gene":{ >>>> "type": "terms", >>>> "field": "abstracts_gene_pubtator_annotation_ids", >>>> "sort": { "r1": "desc" }, >>>> "limit": 3, >>>> "facet": { >>>> "r1" : "relatedness($fore,$back)" >>>> } >>>> } >>>> } >>>> This should give gene stored in abstracts_gene_pubtator_annotation_ids >>> that are more likely to occur in mitochondrial papers. >>>> Running a test query gives me this surprising result >>>> >>>> ... >>>> "gene": { >>>> "buckets": [ >>>> { >>>> "val": "3091", >>>> "count": 1, >>>> "rtitles1": { >>>> "relatedness": 0.55649, >>>> "foreground_popularity": 0, >>>> "background_popularity": 0.00018 >>>> } >>>> }, >>>> ... >>>> or for a similar query even bigger relatedness values >>>> ... >>>> "buckets": [ >>>> { >>>> "val": "MESH:D028361", >>>> "count": 1, >>>> "rabstract_conclusions0": { >>>> "relatedness": 0.91506, >>>> "foreground_popularity": 5e-05, >>>> "background_popularity": 5e-05 >>>> }, >>>> >>>> ... >>>> >>>> But If I recall the z-score formula >>>> >>>> countFG("3091") - totalFG * probBG >>>> ------------------------------------------------ >>>> sqrt( totalFG * (1-probBG)*probBG ) >>>> >>>> and set countFG("3091") to 1 this means that the relatedness should be >>> negative (or at most 0) if totalFG * probBG >=1, while here I find a quite >>> positive relatedness. >>>> Maybe this can be controlled with min_popularity, but I don't understand >>> how to use it in conjunction with type=terms and >>> field=abstracts_gene_pubtator_annotation_ids >>>> >>>> Can you please tell me the correct syntax, and if my reasoning is >>> correct? >>>> Thank you >>>> Danilo >>>> >>>> Danilo Tomasoni >>>> >>>> Fondazione The Microsoft Research - University of Trento Centre for >>> Computational and Systems Biology (COSBI) >>>> Piazza Manifattura 1, 38068 Rovereto (TN), Italy >>>> tomas...@cosbi.eu< >>> https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu >>>> >>>> http://www.cosbi.eu< >>> https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f >>>> >>>> >>>> As for the European General Data Protection Regulation 2016/679 on the >>> protection of natural persons with regard to the processing of personal >>> data, we inform you that all the data we possess are object of treatment in >>> the respect of the normative provided for by the cited GDPR. >>>> It is your right to be informed on which of your data are used and how; >>> you may ask for their correction, cancellation or you may oppose to their >>> use by written request sent by recorded delivery to The Microsoft Research >>> – University of Trento Centre for Computational and Systems Biology Scarl, >>> Piazza Manifattura 1, 38068 Rovereto (TN), Italy. >>>> P Please don't print this e-mail unless you really need to >>>