Hello Dave, first of all thank you for your answer.

I need to clarify that I've used separate (and quite good) NER  algorithms 
offline and the results were imported to solr.

Unfortunately the approach that you suggest using the morelikethis 
functionality is not suitable for my needs since I need to discover 
statistically significative relations between NER entities, while MLT will give 
me NER entities "similar" to the ones I'm looking for, as far as I understand.

Anyone knows why the relatedness is high even if the foreground (and even 
background) popularity is 0?

Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for 
Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomas...@cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu>
http://www.cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f>

As for the European General Data Protection Regulation 2016/679 on the 
protection of natural persons with regard to the processing of personal data, 
we inform you that all the data we possess are object of treatment in the 
respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may 
ask for their correction, cancellation or you may oppose to their use by 
written request sent by recorded delivery to The Microsoft Research – 
University of Trento Centre for Computational and Systems Biology Scarl, Piazza 
Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to
________________________________
Da: Dave <hastings.recurs...@gmail.com>
Inviato: martedì 21 giugno 2022 19:51
A: users@solr.apache.org <users@solr.apache.org>
Oggetto: Re: Semantic Knowledge Graph theoric question

[CAUTION: EXTERNAL SENDER]
[Please check correspondence between Sender Display Name and Sender Email 
Address before clicking on any link or opening attachments]


Two hints. The ner from solr isn’t very good, and the relatedness function is 
iffy at best.

I would take a different approach. Get the ner data as you have it now and use 
shingles to make a better formed complete index using stop words then use the 
mlt mech to see if it’s better.   If it is, great if not it’s just an idea.


> On Jun 21, 2022, at 12:02 PM, Danilo Tomasoni <tomas...@cosbi.eu> wrote:
>
> Hello all,
> I'm experimenting with the SKG features available through json.facet API in 
> solr 8.11 to discover semantic relations between medical text pre-annotated 
> with NER algorithms.
> I store the NER annotations, annotation id, span ecc in separate solr fields, 
> to keep text clean.
>
> The first results looks promising but I found a behaviour that surprises me.
> To give a bit of context I'm looking for covid-related papers with a standard 
> query (q parameter)
> Then I set my foreground query to be a set of keywords in OR related to the 
> mithochondria, and the background query is set to *.
>
> Then the json.facet parameters are like
>
> "json.facet": {
>    "gene":{
>      "type": "terms",
>      "field": "abstracts_gene_pubtator_annotation_ids",
>      "sort": { "r1": "desc" },
>      "limit": 3,
>      "facet": {
>        "r1" : "relatedness($fore,$back)"
>        }
>      }
>    }
> This should give gene stored in abstracts_gene_pubtator_annotation_ids that 
> are more likely to occur in mitochondrial papers.
> Running a test query gives me this surprising result
>
> ...
>        "gene": {
>          "buckets": [
>            {
>              "val": "3091",
>              "count": 1,
>              "rtitles1": {
>                "relatedness": 0.55649,
>                "foreground_popularity": 0,
>                "background_popularity": 0.00018
>              }
>            },
> ...
> or for a similar query even bigger relatedness values
> ...
>    "buckets": [
>      {
>        "val": "MESH:D028361",
>        "count": 1,
>        "rabstract_conclusions0": {
>          "relatedness": 0.91506,
>          "foreground_popularity": 5e-05,
>          "background_popularity": 5e-05
>        },
>
> ...
>
> But If I recall the z-score formula
>
> countFG("3091") - totalFG * probBG
> ------------------------------------------------
> sqrt( totalFG * (1-probBG)*probBG )
>
> and set countFG("3091") to 1 this means that the relatedness should be 
> negative (or at most 0) if totalFG * probBG >=1, while here I find a quite 
> positive relatedness.
> Maybe this can be controlled with min_popularity, but I don't understand how 
> to use it in conjunction with type=terms and 
> field=abstracts_gene_pubtator_annotation_ids
>
> Can you please tell me the correct syntax, and if my reasoning is correct?
> Thank you
> Danilo
>
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for 
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu>
> http://www.cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f>
>
> As for the European General Data Protection Regulation 2016/679 on the 
> protection of natural persons with regard to the processing of personal data, 
> we inform you that all the data we possess are object of treatment in the 
> respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how; you 
> may ask for their correction, cancellation or you may oppose to their use by 
> written request sent by recorded delivery to The Microsoft Research – 
> University of Trento Centre for Computational and Systems Biology Scarl, 
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to

Reply via email to