If you really want to have fun you build that index using the significant 
phrases plus the ner and boost accordingly and I have about 90% certainty if 
you do it well, you hit the mark. Amhik

> On Jun 22, 2022, at 10:08 AM, Dave <hastings.recurs...@gmail.com> wrote:
> 
> This is the right answer.  I could go more in depth but if you get the 
> significant Phrases rather than terms using shingles you will have better 
> luck with a word length of three, no stop words, and a minimum existence of 
> about 4. It’s a fun experiment 
> 
>> On Jun 22, 2022, at 9:11 AM, Joel Bernstein <joels...@gmail.com> wrote:
>> 
>> For an experiment you can test out the significantTerms Streaming
>> Expression, which uses the foreground count and background count to score
>> terms.
>> 
>> https://solr.apache.org/guide/8_9/search-sample.html#significantterms
>> https://solr.apache.org/guide/8_9/stream-source-reference.html#significantterms-parameters
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>> 
>> 
>>>> On Wed, Jun 22, 2022 at 2:37 AM Danilo Tomasoni <tomas...@cosbi.eu> wrote:
>>> 
>>> Hello Dave, first of all thank you for your answer.
>>> 
>>> I need to clarify that I've used separate (and quite good) NER  algorithms
>>> offline and the results were imported to solr.
>>> 
>>> Unfortunately the approach that you suggest using the morelikethis
>>> functionality is not suitable for my needs since I need to discover
>>> statistically significative relations between NER entities, while MLT will
>>> give me NER entities "similar" to the ones I'm looking for, as far as I
>>> understand.
>>> 
>>> Anyone knows why the relatedness is high even if the foreground (and even
>>> background) popularity is 0?
>>> 
>>> Danilo Tomasoni
>>> 
>>> Fondazione The Microsoft Research - University of Trento Centre for
>>> Computational and Systems Biology (COSBI)
>>> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
>>> tomas...@cosbi.eu<
>>> https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu
>>>> 
>>> http://www.cosbi.eu<
>>> https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f
>>>> 
>>> 
>>> As for the European General Data Protection Regulation 2016/679 on the
>>> protection of natural persons with regard to the processing of personal
>>> data, we inform you that all the data we possess are object of treatment in
>>> the respect of the normative provided for by the cited GDPR.
>>> It is your right to be informed on which of your data are used and how;
>>> you may ask for their correction, cancellation or you may oppose to their
>>> use by written request sent by recorded delivery to The Microsoft Research
>>> – University of Trento Centre for Computational and Systems Biology Scarl,
>>> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
>>> P Please don't print this e-mail unless you really need to
>>> ________________________________
>>> Da: Dave <hastings.recurs...@gmail.com>
>>> Inviato: martedì 21 giugno 2022 19:51
>>> A: users@solr.apache.org <users@solr.apache.org>
>>> Oggetto: Re: Semantic Knowledge Graph theoric question
>>> 
>>> [CAUTION: EXTERNAL SENDER]
>>> [Please check correspondence between Sender Display Name and Sender Email
>>> Address before clicking on any link or opening attachments]
>>> 
>>> 
>>> Two hints. The ner from solr isn’t very good, and the relatedness function
>>> is iffy at best.
>>> 
>>> I would take a different approach. Get the ner data as you have it now and
>>> use shingles to make a better formed complete index using stop words then
>>> use the mlt mech to see if it’s better.   If it is, great if not it’s just
>>> an idea.
>>> 
>>> 
>>>>> On Jun 21, 2022, at 12:02 PM, Danilo Tomasoni <tomas...@cosbi.eu> wrote:
>>>> 
>>>> Hello all,
>>>> I'm experimenting with the SKG features available through json.facet API
>>> in solr 8.11 to discover semantic relations between medical text
>>> pre-annotated with NER algorithms.
>>>> I store the NER annotations, annotation id, span ecc in separate solr
>>> fields, to keep text clean.
>>>> 
>>>> The first results looks promising but I found a behaviour that surprises
>>> me.
>>>> To give a bit of context I'm looking for covid-related papers with a
>>> standard query (q parameter)
>>>> Then I set my foreground query to be a set of keywords in OR related to
>>> the mithochondria, and the background query is set to *.
>>>> 
>>>> Then the json.facet parameters are like
>>>> 
>>>> "json.facet": {
>>>>  "gene":{
>>>>    "type": "terms",
>>>>    "field": "abstracts_gene_pubtator_annotation_ids",
>>>>    "sort": { "r1": "desc" },
>>>>    "limit": 3,
>>>>    "facet": {
>>>>      "r1" : "relatedness($fore,$back)"
>>>>      }
>>>>    }
>>>>  }
>>>> This should give gene stored in abstracts_gene_pubtator_annotation_ids
>>> that are more likely to occur in mitochondrial papers.
>>>> Running a test query gives me this surprising result
>>>> 
>>>> ...
>>>>      "gene": {
>>>>        "buckets": [
>>>>          {
>>>>            "val": "3091",
>>>>            "count": 1,
>>>>            "rtitles1": {
>>>>              "relatedness": 0.55649,
>>>>              "foreground_popularity": 0,
>>>>              "background_popularity": 0.00018
>>>>            }
>>>>          },
>>>> ...
>>>> or for a similar query even bigger relatedness values
>>>> ...
>>>>  "buckets": [
>>>>    {
>>>>      "val": "MESH:D028361",
>>>>      "count": 1,
>>>>      "rabstract_conclusions0": {
>>>>        "relatedness": 0.91506,
>>>>        "foreground_popularity": 5e-05,
>>>>        "background_popularity": 5e-05
>>>>      },
>>>> 
>>>> ...
>>>> 
>>>> But If I recall the z-score formula
>>>> 
>>>> countFG("3091") - totalFG * probBG
>>>> ------------------------------------------------
>>>> sqrt( totalFG * (1-probBG)*probBG )
>>>> 
>>>> and set countFG("3091") to 1 this means that the relatedness should be
>>> negative (or at most 0) if totalFG * probBG >=1, while here I find a quite
>>> positive relatedness.
>>>> Maybe this can be controlled with min_popularity, but I don't understand
>>> how to use it in conjunction with type=terms and
>>> field=abstracts_gene_pubtator_annotation_ids
>>>> 
>>>> Can you please tell me the correct syntax, and if my reasoning is
>>> correct?
>>>> Thank you
>>>> Danilo
>>>> 
>>>> Danilo Tomasoni
>>>> 
>>>> Fondazione The Microsoft Research - University of Trento Centre for
>>> Computational and Systems Biology (COSBI)
>>>> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
>>>> tomas...@cosbi.eu<
>>> https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu
>>>> 
>>>> http://www.cosbi.eu<
>>> https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f
>>>> 
>>>> 
>>>> As for the European General Data Protection Regulation 2016/679 on the
>>> protection of natural persons with regard to the processing of personal
>>> data, we inform you that all the data we possess are object of treatment in
>>> the respect of the normative provided for by the cited GDPR.
>>>> It is your right to be informed on which of your data are used and how;
>>> you may ask for their correction, cancellation or you may oppose to their
>>> use by written request sent by recorded delivery to The Microsoft Research
>>> – University of Trento Centre for Computational and Systems Biology Scarl,
>>> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
>>>> P Please don't print this e-mail unless you really need to
>>> 

Reply via email to