Two hints. The ner from solr isn’t very good, and the relatedness function is 
iffy at best. 

I would take a different approach. Get the ner data as you have it now and use 
shingles to make a better formed complete index using stop words then use the 
mlt mech to see if it’s better.   If it is, great if not it’s just an idea.


> On Jun 21, 2022, at 12:02 PM, Danilo Tomasoni <tomas...@cosbi.eu> wrote:
> 
> Hello all,
> I'm experimenting with the SKG features available through json.facet API in 
> solr 8.11 to discover semantic relations between medical text pre-annotated 
> with NER algorithms.
> I store the NER annotations, annotation id, span ecc in separate solr fields, 
> to keep text clean.
> 
> The first results looks promising but I found a behaviour that surprises me.
> To give a bit of context I'm looking for covid-related papers with a standard 
> query (q parameter)
> Then I set my foreground query to be a set of keywords in OR related to the 
> mithochondria, and the background query is set to *.
> 
> Then the json.facet parameters are like
> 
> "json.facet": {
>    "gene":{
>      "type": "terms",
>      "field": "abstracts_gene_pubtator_annotation_ids",
>      "sort": { "r1": "desc" },
>      "limit": 3,
>      "facet": {
>        "r1" : "relatedness($fore,$back)"
>        }
>      }
>    }
> This should give gene stored in abstracts_gene_pubtator_annotation_ids that 
> are more likely to occur in mitochondrial papers.
> Running a test query gives me this surprising result
> 
> ...
>        "gene": {
>          "buckets": [
>            {
>              "val": "3091",
>              "count": 1,
>              "rtitles1": {
>                "relatedness": 0.55649,
>                "foreground_popularity": 0,
>                "background_popularity": 0.00018
>              }
>            },
> ...
> or for a similar query even bigger relatedness values
> ...
>    "buckets": [
>      {
>        "val": "MESH:D028361",
>        "count": 1,
>        "rabstract_conclusions0": {
>          "relatedness": 0.91506,
>          "foreground_popularity": 5e-05,
>          "background_popularity": 5e-05
>        },
> 
> ...
> 
> But If I recall the z-score formula
> 
> countFG("3091") - totalFG * probBG
> ------------------------------------------------
> sqrt( totalFG * (1-probBG)*probBG )
> 
> and set countFG("3091") to 1 this means that the relatedness should be 
> negative (or at most 0) if totalFG * probBG >=1, while here I find a quite 
> positive relatedness.
> Maybe this can be controlled with min_popularity, but I don't understand how 
> to use it in conjunction with type=terms and 
> field=abstracts_gene_pubtator_annotation_ids
> 
> Can you please tell me the correct syntax, and if my reasoning is correct?
> Thank you
> Danilo
> 
> Danilo Tomasoni
> 
> Fondazione The Microsoft Research - University of Trento Centre for 
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu>
> http://www.cosbi.eu<https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f>
> 
> As for the European General Data Protection Regulation 2016/679 on the 
> protection of natural persons with regard to the processing of personal data, 
> we inform you that all the data we possess are object of treatment in the 
> respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how; you 
> may ask for their correction, cancellation or you may oppose to their use by 
> written request sent by recorded delivery to The Microsoft Research – 
> University of Trento Centre for Computational and Systems Biology Scarl, 
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to

Reply via email to