Re: Delete by Id in solr cloud

2022-06-28 Thread Radu Gheorghe
Hi Satya,

I didn't try it, but does it work if you add "shards=shard1,shard2..." to
the request?

Worst case scenario, if you have the address of each shard (you can get it
from Zookeeper), you can run the delete command N times, one hitting each
shard address.

Best regards,
Radu
--
Elasticsearch/OpenSearch & Solr Consulting, Production Support & Training
Sematext Cloud - Full Stack Observability
http://sematext.com/


On Tue, Jun 28, 2022 at 7:55 AM Satya Nand 
wrote:

> Hi,
>
> I have an 8 shards collection, where I am using *compositeId* routing
> with *router.field
> *(a field named parentglUsrId). The unique Id of the collection is a
> different field *displayid*.
>
> I am trying a delete by id operation where I pass a list of displayids to
> delete. I observed that no documents are being deleted. when I checked the
> logs I found that the deletion request for an Id might not go to the
> correct shard and perform a request on some other shard that was not
> hosting this Id. This might be due to solr trying to find the shard based
> on the hash of displayid but my sharding is done on the basis of
> parentglUsrId.
>
>
> is there anything I am missing? Because it seems like a simple operation.
> what do I need to do to broadcast a delete by id request to all shards so
> relevant id can be deleted on each shard?
>


Re: Delete by Id in solr cloud

2022-06-28 Thread Satya Nand
Hi Radu,

I am using solrj for executing the query. I couldn't find any function with
accepts additional parameters like routing, shards, solr Params etc.

I also tried delete by query instead of deleteById, But it is very slow.

https://solr.apache.org/docs/8_1_0/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrClient.html
deleteById

(String

 collection, List

https://docs.oracle.com/javase/8/docs/api/java/lang/String.html?is-external=true>>
ids,
int commitWithinMs)




On Tue, Jun 28, 2022 at 12:58 PM Radu Gheorghe 
wrote:

> Hi Satya,
>
> I didn't try it, but does it work if you add "shards=shard1,shard2..." to
> the request?
>
> Worst case scenario, if you have the address of each shard (you can get it
> from Zookeeper), you can run the delete command N times, one hitting each
> shard address.
>
> Best regards,
> Radu
> --
> Elasticsearch/OpenSearch & Solr Consulting, Production Support & Training
> Sematext Cloud - Full Stack Observability
> http://sematext.com/
>
>
> On Tue, Jun 28, 2022 at 7:55 AM Satya Nand  .invalid>
> wrote:
>
> > Hi,
> >
> > I have an 8 shards collection, where I am using *compositeId* routing
> > with *router.field
> > *(a field named parentglUsrId). The unique Id of the collection is a
> > different field *displayid*.
> >
> > I am trying a delete by id operation where I pass a list of displayids to
> > delete. I observed that no documents are being deleted. when I checked
> the
> > logs I found that the deletion request for an Id might not go to the
> > correct shard and perform a request on some other shard that was not
> > hosting this Id. This might be due to solr trying to find the shard based
> > on the hash of displayid but my sharding is done on the basis of
> > parentglUsrId.
> >
> >
> > is there anything I am missing? Because it seems like a simple operation.
> > what do I need to do to broadcast a delete by id request to all shards so
> > relevant id can be deleted on each shard?
> >
>


RE: Delete by Id in solr cloud

2022-06-28 Thread Peter Lancaster
Hi Satya,

I think you would need to use a HttpSolrClient that uses the url of the shard 
where the record exists.

Regards,
Peter.

-Original Message-
From: Satya Nand 
Sent: 28 June 2022 10:43
To: users@solr.apache.org
Subject: Re: Delete by Id in solr cloud

EXTERNAL SENDER: Do not click any links or open any attachments unless you 
trust the sender and know the content is safe.


Hi Radu,

I am using solrj for executing the query. I couldn't find any function with 
accepts additional parameters like routing, shards, solr Params etc.

I also tried delete by query instead of deleteById, But it is very slow.

https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsolr.apache.org%2Fdocs%2F8_1_0%2Fsolr-solrj%2Forg%2Fapache%2Fsolr%2Fclient%2Fsolrj%2Fimpl%2FCloudSolrClient.html&data=05%7C01%7Cpeter.lancaster%40findmypast.com%7C52e71d1ca9294234c62808da58eaa4a0%7C75e41e0807c2445db397039b2b54c244%7C0%7C0%7C637920062049080011%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4d7N5LCpx8TXnEv7GW%2BN2TmoE8YvHa0tgr4c%2FamgOBw%3D&reserved=0
deleteById

(String

 collection, List

https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oracle.com%2Fjavase%2F8%2Fdocs%2Fapi%2Fjava%2Flang%2FString.html%3Fis-external%3Dtrue&data=05%7C01%7Cpeter.lancaster%40findmypast.com%7C52e71d1ca9294234c62808da58eaa4a0%7C75e41e0807c2445db397039b2b54c244%7C0%7C0%7C637920062049080011%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3gPAaYNOQvAkYD8coSuGjm28gau5i3lEJabT4Kqu%2BCk%3D&reserved=0>>
ids,
int commitWithinMs)




On Tue, Jun 28, 2022 at 12:58 PM Radu Gheorghe 
wrote:

> Hi Satya,
>
> I didn't try it, but does it work if you add "shards=shard1,shard2..."
> to the request?
>
> Worst case scenario, if you have the address of each shard (you can
> get it from Zookeeper), you can run the delete command N times, one
> hitting each shard address.
>
> Best regards,
> Radu
> --
> Elasticsearch/OpenSearch & Solr Consulting, Production Support &
> Training Sematext Cloud - Full Stack Observability
> https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsemat
> ext.com%2F&data=05%7C01%7Cpeter.lancaster%40findmypast.com%7C52e71
> d1ca9294234c62808da58eaa4a0%7C75e41e0807c2445db397039b2b54c244%7C0%7C0
> %7C637920062049080011%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQ
> IjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5
> T28n9NppzIpUi9MaWeF1ZYcQuh%2FreGX2iVvsDczleI%3D&reserved=0
>
>
> On Tue, Jun 28, 2022 at 7:55 AM Satya Nand  .invalid>
> wrote:
>
> > Hi,
> >
> > I have an 8 shards collection, where I am using *compositeId*
> > routing with *router.field *(a field named parentglUsrId). The
> > unique Id of the collection is a different field *displayid*.
> >
> > I am trying a delete by id operation where I pass a list of
> > displayids to delete. I observed that no documents are being
> > deleted. when I checked
> the
> > logs I found that the deletion request for an Id might not go to the
> > correct shard and perform a request on some other shard that was not
> > hosting this Id. This might be due to solr trying to find the shard
> > based on the hash of displayid but my sharding is done on the basis
> > of parentglUsrId.
> >
> >
> > is there anything I am missing? Because it seems like a simple operation.
> > what do I need to do to broadcast a delete by id request to all
> > shards so re

Re: Delete by Id in solr cloud

2022-06-28 Thread Satya Nand
Thanks, Peter,
I am checking that, also UpdateRequest class seems to have methods that
take routes as input. I will see if it helps.

On Tue, Jun 28, 2022 at 3:19 PM Peter Lancaster <
peter.lancas...@findmypast.com> wrote:

> Hi Satya,
>
> I think you would need to use a HttpSolrClient that uses the url of the
> shard where the record exists.
>
> Regards,
> Peter.
>
> -Original Message-
> From: Satya Nand 
> Sent: 28 June 2022 10:43
> To: users@solr.apache.org
> Subject: Re: Delete by Id in solr cloud
>
> EXTERNAL SENDER: Do not click any links or open any attachments unless you
> trust the sender and know the content is safe.
>
>
> Hi Radu,
>
> I am using solrj for executing the query. I couldn't find any function
> with accepts additional parameters like routing, shards, solr Params etc.
>
> I also tried delete by query instead of deleteById, But it is very slow.
>
>
> https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsolr.apache.org%2Fdocs%2F8_1_0%2Fsolr-solrj%2Forg%2Fapache%2Fsolr%2Fclient%2Fsolrj%2Fimpl%2FCloudSolrClient.html&data=05%7C01%7Cpeter.lancaster%40findmypast.com%7C52e71d1ca9294234c62808da58eaa4a0%7C75e41e0807c2445db397039b2b54c244%7C0%7C0%7C637920062049080011%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4d7N5LCpx8TXnEv7GW%2BN2TmoE8YvHa0tgr4c%2FamgOBw%3D&reserved=0
> deleteById
> <
> https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsolr.apache.org%2Fdocs%2F7_3_1%2Fsolr-solrj%2Forg%2Fapache%2Fsolr%2Fclient%2Fsolrj%2FSolrClient.html%23deleteById-java.lang.String-java.util.List-int-&data=05%7C01%7Cpeter.lancaster%40findmypast.com%7C52e71d1ca9294234c62808da58eaa4a0%7C75e41e0807c2445db397039b2b54c244%7C0%7C0%7C637920062049080011%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=MdBKsoMlbTqUjx5xzUny1Hrop0La2cwkg6cVZgZ76Es%3D&reserved=0
> >
> (String
> <
> https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oracle.com%2Fjavase%2F8%2Fdocs%2Fapi%2Fjava%2Flang%2FString.html%3Fis-external%3Dtrue&data=05%7C01%7Cpeter.lancaster%40findmypast.com%7C52e71d1ca9294234c62808da58eaa4a0%7C75e41e0807c2445db397039b2b54c244%7C0%7C0%7C637920062049080011%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3gPAaYNOQvAkYD8coSuGjm28gau5i3lEJabT4Kqu%2BCk%3D&reserved=0
> >
>  collection, List
> <
> https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oracle.com%2Fjavase%2F8%2Fdocs%2Fapi%2Fjava%2Futil%2FList.html%3Fis-external%3Dtrue&data=05%7C01%7Cpeter.lancaster%40findmypast.com%7C52e71d1ca9294234c62808da58eaa4a0%7C75e41e0807c2445db397039b2b54c244%7C0%7C0%7C637920062049080011%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bUQ0Fe0pPkP2kFeRy%2BLg%2FuTIBSEM1HVQdk4EEAdQYCQ%3D&reserved=0
> >
>  <
> https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oracle.com%2Fjavase%2F8%2Fdocs%2Fapi%2Fjava%2Flang%2FString.html%3Fis-external%3Dtrue&data=05%7C01%7Cpeter.lancaster%40findmypast.com%7C52e71d1ca9294234c62808da58eaa4a0%7C75e41e0807c2445db397039b2b54c244%7C0%7C0%7C637920062049080011%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3gPAaYNOQvAkYD8coSuGjm28gau5i3lEJabT4Kqu%2BCk%3D&reserved=0
> >>
> ids,
> int commitWithinMs)
>
>
>
>
> On Tue, Jun 28, 2022 at 12:58 PM Radu Gheorghe  >
> wrote:
>
> > Hi Satya,
> >
> > I didn't try it, but does it work if you add "shards=shard1,shard2..."
> > to the request?
> >
> > Worst case scenario, if you have the address of each shard (you can
> > get it from Zookeeper), you can run the delete command N times, one
> > hitting each shard address.
> >
> > Best regards,
> > Radu
> > --
> > Elasticsearch/OpenSearch & Solr Consulting, Production Support &
> > Training Sematext Cloud - Full Stack Observability
> > https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsemat
> > ext.com%2F&data=05%7C01%7Cpeter.lancaster%40findmypast.com%7C52e71
> > d1ca9294234c62808da58eaa4a0%7C75e41e0807c2445db397039b2b54c244%7C0%7C0
> > %7C637920062049080011%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQ
> > IjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5
> > T28n9NppzIpUi9MaWeF1ZYcQuh%2FreGX2iVvsDczleI%3D&reserved=0
> >
> >
> > On Tue, Jun 28, 2022 at 7:55 AM Satya Nand  > .invalid>
> > wrote:
> >
> > > Hi,
> > >
> > > I have an 8 shards collection, where I am using *compositeId*
> > > routing with *router.field *(a field named parentglUsrId). The
> > > unique Id of the collection is a different field *displayid*.
> > >
> > > I am trying a delete by id operation where I pass a list of
> > > displayids to delete. I observed that no documents are being
> > > deleted. when I checked
> > the
> > > logs I found that the deletion request for an Id might not go to the
> > > correct 

R: [suspected SPAM] Re: Semantic Knowledge Graph theoric question

2022-06-28 Thread Danilo Tomasoni
Thank you very much Alessandro.
I will look into the code.


Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for 
Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomas...@cosbi.eu
http://www.cosbi.eu

As for the European General Data Protection Regulation 2016/679 on the 
protection of natural persons with regard to the processing of personal data, 
we inform you that all the data we possess are object of treatment in the 
respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may 
ask for their correction, cancellation or you may oppose to their use by 
written request sent by recorded delivery to The Microsoft Research – 
University of Trento Centre for Computational and Systems Biology Scarl, Piazza 
Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to

Da: Alessandro Benedetti 
Inviato: lunedì 27 giugno 2022 16:17
A: users@solr.apache.org 
Oggetto: [suspected SPAM] Re: Semantic Knowledge Graph theoric question

[CAUTION: EXTERNAL SENDER]
[Please check correspondence between Sender Display Name and Sender Email 
Address before clicking on any link or opening attachments]


Hi Davide,
I assume that "abstracts_gene_pubtator_annotation_ids" just contains
un-tokenized id, so I don't think there is a matter of tokenization,
shingles etc.
What we want is a single ID to be a single term in the index.

If you want to debug the relatedness calculation, take a look here :
org.apache.solr.search.facet.RelatednessAgg#computeRelatedness
The formula you mentioned is ok, but I would recommend remote
debugging Solr and putting some breakpoints there to investigate if
something doesn't look right.

Let me know!

--
*Alessandro Benedetti*
CEO @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benede...@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io 
LinkedIn  | Twitter
 | Youtube
 | Github



On Wed, 22 Jun 2022 at 08:37, Danilo Tomasoni  wrote:

> Hello Dave, first of all thank you for your answer.
>
> I need to clarify that I've used separate (and quite good) NER  algorithms
> offline and the results were imported to solr.
>
> Unfortunately the approach that you suggest using the morelikethis
> functionality is not suitable for my needs since I need to discover
> statistically significative relations between NER entities, while MLT will
> give me NER entities "similar" to the ones I'm looking for, as far as I
> understand.
>
> Anyone knows why the relatedness is high even if the foreground (and even
> background) popularity is 0?
>
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu<
> https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu
> >
> http://www.cosbi.eu<
> https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f
> >
>
> As for the European General Data Protection Regulation 2016/679 on the
> protection of natural persons with regard to the processing of personal
> data, we inform you that all the data we possess are object of treatment in
> the respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how;
> you may ask for their correction, cancellation or you may oppose to their
> use by written request sent by recorded delivery to The Microsoft Research
> – University of Trento Centre for Computational and Systems Biology Scarl,
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to
> 
> Da: Dave 
> Inviato: martedì 21 giugno 2022 19:51
> A: users@solr.apache.org 
> Oggetto: Re: Semantic Knowledge Graph theoric question
>
> [CAUTION: EXTERNAL SENDER]
> [Please check correspondence between Sender Display Name and Sender Email
> Address before clicking on any link or opening attachments]
>
>
> Two hints. The ner from solr isn’t very good, and the relatedness function
> is iffy at best.
>
> I would take a different approach. Get the ner data as you have it now and
> use shingles t

Re: [suspected SPAM] Re: Semantic Knowledge Graph theoric question

2022-06-28 Thread Michael Gibney
It's hard to give a concrete answer without knowing the actual counts
involved, but iiuc significantTerms and relatedness are basically
equivalent (happy to be corrected here if I'm wrong).

> the relatedness function is iffy at best

? -- not sure what is meant by this. It's a function, and afaict based
on information available it's likely to be working as intended (and
most likely equivalent to what you'd get for an analogous config of
significantTerms streaming expression?). If you're getting a hit count
of 1 returned for any of these terms, then you're probably dealing
with numbers low enough that you're going to mostly be getting noise
-- indeed, that's the purpose of the min_popularity setting: to screen
out results that don't have enough instances (e.g., 1) to calculate a
meaningful correlation.

Of course feel free to experiment with the other suggestions above,
but unless you also screen out low-frequency terms (i.e., with
min_popularity) I'd be surprised if shingles would have a helpful
impact on your use case.

Ultimately though, to reiterate: I think it's going to be hard to
provide helpful feedback unless you're able to provide a sense of the
integer counts involved (background set size, and intersection size
(term of interest & foreground_set; term of interest &
background_set). It might also be helpful to have a sense of the
overall cardinality and types of values in the field in question.

On Tue, Jun 28, 2022 at 8:17 AM Danilo Tomasoni  wrote:
>
> Thank you very much Alessandro.
> I will look into the code.
>
>
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for 
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu
> http://www.cosbi.eu
>
> As for the European General Data Protection Regulation 2016/679 on the 
> protection of natural persons with regard to the processing of personal data, 
> we inform you that all the data we possess are object of treatment in the 
> respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how; you 
> may ask for their correction, cancellation or you may oppose to their use by 
> written request sent by recorded delivery to The Microsoft Research – 
> University of Trento Centre for Computational and Systems Biology Scarl, 
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to
> 
> Da: Alessandro Benedetti 
> Inviato: lunedì 27 giugno 2022 16:17
> A: users@solr.apache.org 
> Oggetto: [suspected SPAM] Re: Semantic Knowledge Graph theoric question
>
> [CAUTION: EXTERNAL SENDER]
> [Please check correspondence between Sender Display Name and Sender Email 
> Address before clicking on any link or opening attachments]
>
>
> Hi Davide,
> I assume that "abstracts_gene_pubtator_annotation_ids" just contains
> un-tokenized id, so I don't think there is a matter of tokenization,
> shingles etc.
> What we want is a single ID to be a single term in the index.
>
> If you want to debug the relatedness calculation, take a look here :
> org.apache.solr.search.facet.RelatednessAgg#computeRelatedness
> The formula you mentioned is ok, but I would recommend remote
> debugging Solr and putting some breakpoints there to investigate if
> something doesn't look right.
>
> Let me know!
>
> --
> *Alessandro Benedetti*
> CEO @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io 
> LinkedIn  | Twitter
>  | Youtube
>  | Github
> 
>
>
> On Wed, 22 Jun 2022 at 08:37, Danilo Tomasoni  wrote:
>
> > Hello Dave, first of all thank you for your answer.
> >
> > I need to clarify that I've used separate (and quite good) NER  algorithms
> > offline and the results were imported to solr.
> >
> > Unfortunately the approach that you suggest using the morelikethis
> > functionality is not suitable for my needs since I need to discover
> > statistically significative relations between NER entities, while MLT will
> > give me NER entities "similar" to the ones I'm looking for, as far as I
> > understand.
> >
> > Anyone knows why the relatedness is high even if the foreground (and even
> > background) popularity is 0?
> >
> > Danilo Tomasoni
> >
> > Fondazione The Microsoft Research - University of Trento Centre for
> > Computational 

R: [suspected SPAM] Re: Semantic Knowledge Graph theoric question

2022-06-28 Thread Danilo Tomasoni

Thank you very much Michael for your answer.
Below the extra information you asked for, and a sample result

QUERY INFORMATION
query=covid
back query = *:*
fore query = mitochondria
sample gene id ="57506" / "54205"

facet code:
"json.facet": "{'titles_gene': {'type': 'terms', 'field': 
'titles_gene_pubtator_annotation_ids', 'limit': 10, 'sort': {'rtitles0': 
'desc'}, 'facet': {'rtitles0': 'relatedness($fore,$back)'}}

CARDINALITY
q=157120
back=157120
fore=321385
fore & back=343

"57506" & back=8
"57506" & fore & back=5

"54205" & back=5
"54205" & fore & back=5

RESULTS
titles_gene
"val": "57506",
"count": 8,
"rtitles0": {
  "relatedness": 0.69182,
  "foreground_popularity": 1e-05,
  "background_popularity": 1e-05
}
abstracts_gene
"val": "54205",
"count": 5,
"rabstracts0": {
  "relatedness": 0.94975,
  "foreground_popularity": 0.00025,
  "background_popularity": 0.00043
}



Here it looks like that the fpopularity and bpopularity are the same for 
titles_gene (but I expected 5/343 and 8/157120  instead..)
but the relatedness of 0.69182 (it should range between -1 and 1) gives me that 
"57506" is strongly "characteristic"
(meaning that it is occourring more in the fore than in the back, that is a 
superset of fore)
to the fore corpus with respect to the back corpus.


I would like to ask:

  1.  is my interpretation of relatedness correct?
  2.  why foreground_popularity and background_popularity are like this?
  3.  how should I change my json.facet query to require a min_popularity? 
should this solve the strange relatedness values?

thank you
D



Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for 
Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomas...@cosbi.eu
http://www.cosbi.eu

As for the European General Data Protection Regulation 2016/679 on the 
protection of natural persons with regard to the processing of personal data, 
we inform you that all the data we possess are object of treatment in the 
respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may 
ask for their correction, cancellation or you may oppose to their use by 
written request sent by recorded delivery to The Microsoft Research – 
University of Trento Centre for Computational and Systems Biology Scarl, Piazza 
Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to

Da: Michael Gibney 
Inviato: martedì 28 giugno 2022 14:50
A: users@solr.apache.org 
Oggetto: Re: [suspected SPAM] Re: Semantic Knowledge Graph theoric question

[CAUTION: EXTERNAL SENDER]
[Please check correspondence between Sender Display Name and Sender Email 
Address before clicking on any link or opening attachments]


It's hard to give a concrete answer without knowing the actual counts
involved, but iiuc significantTerms and relatedness are basically
equivalent (happy to be corrected here if I'm wrong).

> the relatedness function is iffy at best

? -- not sure what is meant by this. It's a function, and afaict based
on information available it's likely to be working as intended (and
most likely equivalent to what you'd get for an analogous config of
significantTerms streaming expression?). If you're getting a hit count
of 1 returned for any of these terms, then you're probably dealing
with numbers low enough that you're going to mostly be getting noise
-- indeed, that's the purpose of the min_popularity setting: to screen
out results that don't have enough instances (e.g., 1) to calculate a
meaningful correlation.

Of course feel free to experiment with the other suggestions above,
but unless you also screen out low-frequency terms (i.e., with
min_popularity) I'd be surprised if shingles would have a helpful
impact on your use case.

Ultimately though, to reiterate: I think it's going to be hard to
provide helpful feedback unless you're able to provide a sense of the
integer counts involved (background set size, and intersection size
(term of interest & foreground_set; term of interest &
background_set). It might also be helpful to have a sense of the
overall cardinality and types of values in the field in question.

On Tue, Jun 28, 2022 at 8:17 AM Danilo Tomasoni  wrote:
>
> Thank you very much Alessandro.
> I will look into the code.
>
>
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for 
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu

Currency field type with payloads

2022-06-28 Thread Geren White
Hello all,

I'm wondering if anyone has run into a scenario where they need a currency
field type with conversion support but also regional pricing? Right now we
are using the currency field type so we can support items in multiple
currencies and convert the price at query time. We'd like to add support
for multiple price types or regional pricing so we were looking at
leveraging payloads. Since payloads are supported through a filter on
TextField and the currency field doesn't support analyzers/filters it
doesn't seem possible.

Anyone run into this and have suggestions?

-- 
*Geren White | Senior Director, Engineering*
*(e)* ge...@1stdibs.com


SE fetch function

2022-06-28 Thread Kojo
I am upgrading from Solr 6.6 to 9.0

I have some SE working fine on 6.6, where I fetch some fields to enrich the
gatherNodes fucntion result-set.

Bellow is a much more simple example, where I am trying to fetch some
fields:

fetch(my_collection,
search(my_collection, qt="/export", q=*:*, fl="numero_processo",
sort="numero_processo asc", fq=(any_fq_exact:"any" AND
django_ct:"django_ct"), )
,fl="fomento_status_facet, data_inicio_ano ",
on="numero_processo=numero_processo",)


It was supposed to fetch the fields fl="fomento_status_facet,
data_inicio_ano", but the result-set does't carry this fields:

{
  "result-set": {
"docs": [
  {
"numero_processo": "05/51689-2"
  },
  {
"numero_processo": "13/07276-1"
  },
  {
"numero_processo": "13/07296-2"
  },
  {


Any hint?


Thank you!


Re: SE fetch function

2022-06-28 Thread Joel Bernstein
There were changes in the behavior of Solr's local param syntax which
affected the fetch expression. Check to see if there is a default defType
set in the solrconfig for the /select handler. For fetch to work the
qparser in. the /select handler needs to be the lucene qparser which is the
default if not overridden in the solrconfig. See this ticket for details:
https://issues.apache.org/jira/browse/SOLR-11501.


Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Jun 28, 2022 at 7:59 PM Kojo  wrote:

> I am upgrading from Solr 6.6 to 9.0
>
> I have some SE working fine on 6.6, where I fetch some fields to enrich the
> gatherNodes fucntion result-set.
>
> Bellow is a much more simple example, where I am trying to fetch some
> fields:
>
> fetch(my_collection,
> search(my_collection, qt="/export", q=*:*, fl="numero_processo",
> sort="numero_processo asc", fq=(any_fq_exact:"any" AND
> django_ct:"django_ct"), )
> ,fl="fomento_status_facet, data_inicio_ano ",
> on="numero_processo=numero_processo",)
>
>
> It was supposed to fetch the fields fl="fomento_status_facet,
> data_inicio_ano", but the result-set does't carry this fields:
>
> {
>   "result-set": {
> "docs": [
>   {
> "numero_processo": "05/51689-2"
>   },
>   {
> "numero_processo": "13/07276-1"
>   },
>   {
> "numero_processo": "13/07296-2"
>   },
>   {
>
>
> Any hint?
>
>
> Thank you!
>


using childFilter to restrict "child" docs by "grandchild" information

2022-06-28 Thread Noah Torp-Smith
To explain my question, first some domain background. We have a search engine 
where users can search for materials they can borrow at their local library.

Our top level documents are *works*. An example of a work could be "Harry 
Potter and the Philosopher's Stone". Examples of information stored at this 
level could be the title, the author of the work, and a genre.

At the second level, we have *manifestations" (we call these "pids"). It might 
be that a work exists as a physical book, an ebook, as an audiobook on CDs, an 
online audiobook, and there might be several editions of a book. Information 
stored at this level includes material type, year of publication, contributors 
(can be narrators, artists that have illustrated in a particular edition).

At the third level, we have *instances*. This includes information about the 
physical books, and in which libraries they are located, which department, and 
even down to locations within departments, if they are currently on loan, on 
the shelf.

Each document has a `doc_type` (which is either work, pid, or instance), works 
have a list of pids, and pids have a list of instances associated with them.

Our job is to formulate solr queries on behalf of users that belong to their 
local library, so that they can search for materials that is available to them. 
Given a query, we want to return works, along with the manifestations that 
match the query. A query can specify restrictions at all three levels; you 
might be interested in the (physical) book from last year written by Jussi 
Adler-Olsen, and it should be available at the local branch of the community 
library.

The way we find the appropriate works is pretty much in place. We use the 
`/query` endpoint of solr, and we formulate a json object where

* the `query` field contains the restrictions at the work level, something like 
`work.creator:'Jussi Adler-Olsen'`.
* To restrict to works where manifestations/pids apply to the restrictions at 
that level, we use a "parent which" construction in the `filter` part of the 
solr query. Something like `{!parent 
which='doc_type:work'}(pid.material_type:book AND  pid.year:(2021))`.
* To restrict to works where we can find a physical copy at the local library, 
we add another element to the `filter`. Something like `{!parent 
which='doc_type:work'}(instance.agency:94 AND 
instance.status:\"onShelf\")`, where 94 is the id of the local library.

That seems to work well. We get the works we are interested in. The question I 
have is, how do I restrict the manifestations we return? We use the field list 
and a `childFilter` to restrict manifestations, something like this: `"fields": 
"work.workid work.title work.creator, pids, id, pid.year, pid.material_type 
[child childFilter='pid.material_type:bog' limit=-1]"`. That part of the 
filtering also seems to work OK, but we get all the manifestations that match, 
from all libraries. We want to restrict to those manifestations, where the 
local library has a copy.

In other words, (I guess) we need to formulate a restriction in the `[child 
childFilter=...]` part of the field list, restricting the second-level 
documents on information stored at the third level. I am not sure how to do 
that. Can anyone help?

Thanks a lot in advance, and best regards.

/Noah


--

Noah Torp-Smith (n...@dbc.dk)


SOLR TRA Collection Question on recovery

2022-06-28 Thread Nikhilesh Jannu
Dear Users,

We are using the SOLR TRA collection for capturing the logs.  We  are
writing the logs to SOLR using the Rest API in a batch of 100 and also we
are using SOFT commit interval of 15000 and Hard commit interval of 6.

Solr Version : 8.11.1.

When we restart the SOLR node in the cloud the current day's collection
goes in recovery mode and we see the following logs. It takes a long time
for the recovery process to complete. Not sure how to avoid it. Any
suggestions ?

Sample of the logs below.

2022-06-29 06:03:27.500 INFO
 (recoveryExecutor-67-thread-1-processing-n:10.0.42.157:8983_solr
x:logs__TRA__2022-06-29_shard1_replica_n1 c:logs__TRA__2022-06-29 s:shard1
r:core_node2) [c:logs__TRA__2022-06-29 s:shard1 r:core_node2
x:logs__TRA__2022-06-29_shard1_replica_n1] o.a.s.u.UpdateLog log replay
status
tlog{file=/var/solr/data/logs__TRA__2022-06-29_shard1_replica_n1/data/tlog/tlog.243
refcount=3} active=false starting pos=0 current pos=1119002110 current
size=3287152529 % read=34.0

Regards,
Nikhilesh Jannu


R: [suspected SPAM] Re: Semantic Knowledge Graph theoric question

2022-06-28 Thread Danilo Tomasoni
Thank you very much Michael for your answer.
Below the extra information you asked for, and a sample result

QUERY INFORMATION
query=covid
back query = *:*
fore query = mitochondria
sample gene id ="57506" / "54205"

facet code:
"json.facet": "{'titles_gene': {'type': 'terms', 'field': 
'titles_gene_pubtator_annotation_ids', 'limit': 10, 'sort': {'rtitles0': 
'desc'}, 'facet': {'rtitles0': 'relatedness($fore,$back)'}}

CARDINALITY
q=157120
back=157120
fore=321385
fore & back=343

"57506" & back=8
"57506" & fore & back=5

"54205" & back=5
"54205" & fore & back=5

RESULTS
titles_gene
"val": "57506",
"count": 8,
"rtitles0": {
  "relatedness": 0.69182,
  "foreground_popularity": 1e-05,
  "background_popularity": 1e-05
}
abstracts_gene
"val": "54205",
"count": 5,
"rabstracts0": {
  "relatedness": 0.94975,
  "foreground_popularity": 0.00025,
  "background_popularity": 0.00043
}



Here it looks like that the fpopularity and bpopularity are the same for 
titles_gene (but I expected 5/343 and 8/157120  instead..)
but the relatedness of 0.69182 (it should range between -1 and 1) suggests me 
that "57506" is strongly "characteristic"
(meaning that it is occourring more in the fore than in the back, that is a 
superset of fore)
to the fore corpus with respect to the back corpus.


I would like to ask:

  1.  is my interpretation of relatedness correct?
  2.  why foreground_popularity and background_popularity are like this?
  3.  how should I change my json.facet query to require a min_popularity? 
should this solve the strange relatedness values?


thank you
D



Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for 
Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomas...@cosbi.eu
http://www.cosbi.eu

As for the European General Data Protection Regulation 2016/679 on the 
protection of natural persons with regard to the processing of personal data, 
we inform you that all the data we possess are object of treatment in the 
respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may 
ask for their correction, cancellation or you may oppose to their use by 
written request sent by recorded delivery to The Microsoft Research – 
University of Trento Centre for Computational and Systems Biology Scarl, Piazza 
Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to

Da: Michael Gibney 
Inviato: martedì 28 giugno 2022 14:50
A: users@solr.apache.org 
Oggetto: Re: [suspected SPAM] Re: Semantic Knowledge Graph theoric question

[CAUTION: EXTERNAL SENDER]
[Please check correspondence between Sender Display Name and Sender Email 
Address before clicking on any link or opening attachments]


It's hard to give a concrete answer without knowing the actual counts
involved, but iiuc significantTerms and relatedness are basically
equivalent (happy to be corrected here if I'm wrong).

> the relatedness function is iffy at best

? -- not sure what is meant by this. It's a function, and afaict based
on information available it's likely to be working as intended (and
most likely equivalent to what you'd get for an analogous config of
significantTerms streaming expression?). If you're getting a hit count
of 1 returned for any of these terms, then you're probably dealing
with numbers low enough that you're going to mostly be getting noise
-- indeed, that's the purpose of the min_popularity setting: to screen
out results that don't have enough instances (e.g., 1) to calculate a
meaningful correlation.

Of course feel free to experiment with the other suggestions above,
but unless you also screen out low-frequency terms (i.e., with
min_popularity) I'd be surprised if shingles would have a helpful
impact on your use case.

Ultimately though, to reiterate: I think it's going to be hard to
provide helpful feedback unless you're able to provide a sense of the
integer counts involved (background set size, and intersection size
(term of interest & foreground_set; term of interest &
background_set). It might also be helpful to have a sense of the
overall cardinality and types of values in the field in question.

On Tue, Jun 28, 2022 at 8:17 AM Danilo Tomasoni  wrote:
>
> Thank you very much Alessandro.
> I will look into the code.
>
>
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for 
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu

Re: using childFilter to restrict "child" docs by "grandchild" information

2022-06-28 Thread Mikhail Khludnev
Hello, Noah.
Could i be something like
[child childFilter=$pidfilter limit=-1]&pidfilter=+pid.material_type:bog +
instance.agency:94 +instance.status:onShelf
?

On Wed, Jun 29, 2022 at 8:57 AM Noah Torp-Smith  wrote:

> To explain my question, first some domain background. We have a search
> engine where users can search for materials they can borrow at their local
> library.
>
> Our top level documents are *works*. An example of a work could be "Harry
> Potter and the Philosopher's Stone". Examples of information stored at this
> level could be the title, the author of the work, and a genre.
>
> At the second level, we have *manifestations" (we call these "pids"). It
> might be that a work exists as a physical book, an ebook, as an audiobook
> on CDs, an online audiobook, and there might be several editions of a book.
> Information stored at this level includes material type, year of
> publication, contributors (can be narrators, artists that have illustrated
> in a particular edition).
>
> At the third level, we have *instances*. This includes information about
> the physical books, and in which libraries they are located, which
> department, and even down to locations within departments, if they are
> currently on loan, on the shelf.
>
> Each document has a `doc_type` (which is either work, pid, or instance),
> works have a list of pids, and pids have a list of instances associated
> with them.
>
> Our job is to formulate solr queries on behalf of users that belong to
> their local library, so that they can search for materials that is
> available to them. Given a query, we want to return works, along with the
> manifestations that match the query. A query can specify restrictions at
> all three levels; you might be interested in the (physical) book from last
> year written by Jussi Adler-Olsen, and it should be available at the local
> branch of the community library.
>
> The way we find the appropriate works is pretty much in place. We use the
> `/query` endpoint of solr, and we formulate a json object where
>
> * the `query` field contains the restrictions at the work level, something
> like `work.creator:'Jussi Adler-Olsen'`.
> * To restrict to works where manifestations/pids apply to the restrictions
> at that level, we use a "parent which" construction in the `filter` part of
> the solr query. Something like `{!parent
> which='doc_type:work'}(pid.material_type:book AND  pid.year:(2021))`.
> * To restrict to works where we can find a physical copy at the local
> library, we add another element to the `filter`. Something like `{!parent
> which='doc_type:work'}(instance.agency:94 AND
> instance.status:\"onShelf\")`, where 94 is the id of the local library.
>
> That seems to work well. We get the works we are interested in. The
> question I have is, how do I restrict the manifestations we return? We use
> the field list and a `childFilter` to restrict manifestations, something
> like this: `"fields": "work.workid work.title work.creator, pids, id,
> pid.year, pid.material_type [child childFilter='pid.material_type:bog'
> limit=-1]"`. That part of the filtering also seems to work OK, but we get
> all the manifestations that match, from all libraries. We want to restrict
> to those manifestations, where the local library has a copy.
>
> In other words, (I guess) we need to formulate a restriction in the
> `[child childFilter=...]` part of the field list, restricting the
> second-level documents on information stored at the third level. I am not
> sure how to do that. Can anyone help?
>
> Thanks a lot in advance, and best regards.
>
> /Noah
>
>
> --
>
> Noah Torp-Smith (n...@dbc.dk)
>


-- 
Sincerely yours
Mikhail Khludnev