Re: How to run Solr on two servers for redundancy

2022-03-14 Thread Sam Lee
On 2022/03/13 21:33:55 Dave wrote:
> You’re on the right idea, in my opinion. Three identical “slave”
> servers with one “master” ...

Thank you for the suggestion. I have a few questions:

* Are you suggesting to use standalone Solr instead of SolrCloud?
* Why does this setup require 4 servers (1 master + 3 slaves)?
  Note that I only have two servers (+ 1 low-spec server).

> ... with an nginx server on each one “slave” witth the servers
> augmented.

* How does Nginx come into the picture? What is it used for?

> N1 has a 2—>s3 n2 has n3->s2 n3 has s3->s1 and all three finally fall
> to master.  You can get 5 9’s like this.
>
> Pro tip keep all action on one until it falls, and never use over 31
> fb heap size
>
> Just is just a trial and error and complete success option snd no need
> of complications with zk
> -Dave

Your idea appears to be a promising one. It's just that I don't
completely understand it yet.

Thank you.


Re: How to run Solr on two servers for redundancy

2022-03-14 Thread Sam Lee
On 2022/03/13 22:22:48 Shawn Heisey wrote:
> Zookeeper has fairly low system requirements compared to Solr, so using
> a third machine with lower specs to just run the tie-breaker ZK is a
> good way to go.
>
> Note that you'll only have full redundancy at the client level with that
> setup if your client is ZK-aware.  The only Solr client I know about
> that's ZK aware is the Java client, which is part of Solr itself as well
> as being a standalone client.

Thank you for bringing this potential issue to my attention.

By "standalone client", do you mean that I could use SolrJ on a separate
server where no Solr instance is running? i.e. use the client to
remotely connect to SolrCloud.

By the way, the most popular Python client, pysolr, seems to support
SolrCloud mode. [1]

> For full redundancy with HTTP-only clients you'll need a virtual IP
> address that can be shared among the servers, and have a load balancer
> listening on the virtual IP.  Setting that up is done with software
> other than Solr and ZK, so it's not on-topic for this mailing list. 
> Depending on the capabilities of the third server, it could be the
> primary for load-balancing as well as the third machine for ZK. 
> That's what I would do with limited resources.

I think I will stick to ZooKeeper-aware clients if I choose to go the
SolrCloud route. Using the SolrJ "CloudSolrClient" looks like a much
simpler solution than setting up all the infrastructure required for
achieving high availability with HTTP-only clients.


  [1]: https://github.com/django-haystack/pysolr


RE: Question regarding the MoreLikeThis features

2022-03-14 Thread Marco D'Ambra
Hi Tim,

thank you very much for the answer, full of useful advice.
I will try to put into practice what you told me to improve the output of the 
calls.
Regarding the specific problem on the existence of a specific parameter to 
restrict the corpus of documents that are analyzed for the return of similar 
contents, I must admit that I have not yet figured out how to proceed.

Thank you very much and have a nice day,

Marco

-Original Message-
From: Tim Casey  
Sent: giovedì 10 marzo 2022 19:51
To: users@solr.apache.org
Subject: Re: Question regarding the MoreLikeThis features

Marco,

Finding 'similar' documents will end up being weighted by document length.
I would recommend, at the point of indexing, also indexing an ordered token set 
of the first 256, 1024 up to around 5k tokens (depending on document lengths).  
What this does is allow a vector to vector normalized comparison.  You could 
then query for similar possibile documents directly and build a normalized 
vector with respect to the query document.

Normalizing schemes in something like an inverted index will tend to weight the 
lower token count documents over higher token count documents.  So the above is 
an attempt to get at a normalized and comparable view between documents 
independent of size.  Next you end up normalizing by the inverse of a 
commonality.  That is, a more common token is weighted lower than a least 
common token.  (I would also discount tokens which have a raw frequency below 
5.). At the point you have a normalized vector, you can use that to find 
similarities weighted by more meaningful tokens.

tim

On Thu, Mar 10, 2022 at 9:18 AM Marco D'Ambra  wrote:

> Hi all,
> This is my first time writing to this mailing list and I would like to 
> thank you in advance for your attention.
> I am writing because I am having problems using the "MoreLikeThis"
> features.
> I am working in a Solr cluster (version 8.11.1) consisting of multiple 
> nodes, each of which contains multiple shards.
>
> It is a quite big cluster and data is sharded using implicit routing 
> and documents are distributed by date on monthly shards.
>
> Here are the fields that I'm using:
>
>   *   UniqueReference: the unique reference of a document
>   *   DocumentDate: the date of a document (in the standar Solr format)
>   *   DataType: the data type of the document (let's say that can be A or
> B)
>   *   Content: the content of a document (a string)
> Here is what my managed schema looks like ...
>  required="true" />
>
>  required="true" />
>
>  required="true" />
>
>  required="false" />
> ...
>
>
> The task that I want to perform is the following:
> Given the unique reference of a document of type A, I want to find the 
> documents of data type B and in a fixed time interval, that have the 
> most similar content.
> Here the first questions:
>
>   1.  Which is the best solr request to perform this task?
>   2.  Is there a parameter that allows me to restrict the corpus of 
> documents that are analyzed for the return of similar contents? it 
> should be noted that this corpus of documents may not contain the 
> initial document from which I am starting Initially I thought about 
> using the "mlt" endpoint, but since there was no parameter in the 
> documentation that would allow me to select the shard on which to 
> direct the query (I absolutely need it, otherwise I risk putting a 
> strain on my cluster), I opted to use the "select" endpoint, with the "mlt"
> parameter set to true, and the "shards" parameter.
> Those are the parameters that I am using:
>
>   *   q: "UniqueReference:doc_id"
>   *   fq: "(DocumentDate:[2022-01-22T00:00:00Z TO 2022-01-26T00:00:00Z]
> AND DataType:B) OR (UniqueReference:doc_id)"
>   *   mlt: true
>   *   mlt.fl: "Content"
>   *   shards: "shard_202201"
> I realize that the "fq" parameter is used in a bizarre way. In theory 
> it should be aimed at the documents of the main query (in my case the 
> source document). It is an attempt to solve problem (2) (which didn't 
> work, actually).
> Anyway, my doubts are not limited to this. What really surprises me is 
> the structure of the response that Solr returns to me.
> The content of response looks like this:
> {
> "response" : {
> "docs" : [],
> ...
> }
> "moreLikeThis" : ...
> }
> The weird stuff appear in the "moreLikeThis" part. Sometimes Solr is 
> returning me a list, other times a dictionary. Repeating the same call 
> several times the two possibilities occur repeatedly, apparently 
> without a logical pattern, and I have not been able to understand why.
> And to be precise, in both cases the documents contained in the answer 
> are not necessarily of data type B, as requested by me with the "fq" 
> parameter.
> In the "dictionary" case, there is only one key, which is the 
> UniqueReference of the source document and the corresponding value are 
> similar documents.
> In the "list" case, the second element contains the req

Re: How to run Solr on two servers for redundancy

2022-03-14 Thread Eric Pugh
Let me propose a slightly different approach ;-)

Since you don’t need Solrcloud to support scaling needs, but instead for 
redundancy, then I like to set things up where my indexer just sends the 
updates to TWO SEPARATE single server Solr nodes.  This is great for a number 
of reasons:

1) Green/Blue deployments.   I can upgrade one Solr and leave the other alone.
2) I can A/B test by deploying new relevance configs to one Solr and then 
compare results to the other.
3) If I am in the cloud, well I can drop one Solr on AWS and the other on GCP 
or another cloud provider.



Eric


> On Mar 14, 2022, at 1:28 AM, Sam Lee  wrote:
> 
> On 2022/03/13 22:22:48 Shawn Heisey wrote:
>> Zookeeper has fairly low system requirements compared to Solr, so using
>> a third machine with lower specs to just run the tie-breaker ZK is a
>> good way to go.
>> 
>> Note that you'll only have full redundancy at the client level with that
>> setup if your client is ZK-aware.  The only Solr client I know about
>> that's ZK aware is the Java client, which is part of Solr itself as well
>> as being a standalone client.
> 
> Thank you for bringing this potential issue to my attention.
> 
> By "standalone client", do you mean that I could use SolrJ on a separate
> server where no Solr instance is running? i.e. use the client to
> remotely connect to SolrCloud.
> 
> By the way, the most popular Python client, pysolr, seems to support
> SolrCloud mode. [1]
> 
>> For full redundancy with HTTP-only clients you'll need a virtual IP
>> address that can be shared among the servers, and have a load balancer
>> listening on the virtual IP.  Setting that up is done with software
>> other than Solr and ZK, so it's not on-topic for this mailing list. 
>> Depending on the capabilities of the third server, it could be the
>> primary for load-balancing as well as the third machine for ZK. 
>> That's what I would do with limited resources.
> 
> I think I will stick to ZooKeeper-aware clients if I choose to go the
> SolrCloud route. Using the SolrJ "CloudSolrClient" looks like a much
> simpler solution than setting up all the infrastructure required for
> achieving high availability with HTTP-only clients.
> 
> 
>  [1]: https://github.com/django-haystack/pysolr

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 


This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Joining 2 collections in SolrCloud setup

2022-03-14 Thread Mikhail Khludnev
Hello,  Venkat

Join is not proposed to return both sides, but only "to"-side. You can
either copy collection2 docs into collection1 docs (that ridiculous) or
apply
https://solr.apache.org/guide/6_6/transforming-result-documents.html#TransformingResultDocuments-_subquery_
. However, it applies to returned rows only and is really slow.

On Fri, Mar 11, 2022 at 6:35 PM Venkateswarlu Bommineni 
wrote:

> Hello All,
>
> I am sending one more email with more details.
>
> Solr is started in SlorCloud mode.
>
> I have 2 collections.
>
> collection1:
> {
> "id":"123",
> "name":"name",
> "description":"description"
> }
>
> collection2:
> {
> "id":"123",
> "stock":"inStock",
> "price":40
> }
>
> I am writing the joining as below and executing that query on
> *collection1*.
>
> {!join method="crossCollection" fromIndex="*collection2*" from="id" to="id"
> v="*:*"}
>
> I am getting the results but only with the data from *collection1*
>
> *Current Result:*
> {
> "id":"123",
> "name":"name",
> "description":"description"
> }
>
> Question: Is there any way we can get the data from both the collections ?
> *Expected Results:*
> {
> "id":"123",
> "name":"name",
> "description":"description",
> "stock":"inStock",
> "price":40
> }
>
> Thanks,
> Venkat.
>


-- 
Sincerely yours
Mikhail Khludnev


Re: Solr Collections Join

2022-03-14 Thread Mikhail Khludnev
Hi, Venkat.
No way. Sorry.

On Fri, Mar 11, 2022 at 4:59 PM Venkateswarlu Bommineni 
wrote:

> Yes, it is solrcloud setup.
>
> On Fri, Mar 11, 2022 at 12:57 AM Srijan  wrote:
>
> > Is this a SolrCloud setup?
> >
> > On Thu, Mar 10, 2022, 22:25 Venkateswarlu Bommineni 
> > wrote:
> >
> > > Hello All,
> > >
> > > I have a requirement to join 2 collections and get fields from both the
> > > collections.
> > >
> > > I have got the join query as below, when i run below join query I am
> > > getting the fields of Collection1 only.
> > >
> > > is There any way I can get the fields from collection2 as well ?
> > >
> > > Running below query on Collection1.
> > > {!join method="crossCollection" fromIndex="collection2" from="id"
> to="id"
> > > v="*:*"}
> > >
> > >
> > > Any help here is much appreciated !!
> > >
> > > Thanks,
> > > Venkat.
> > >
> >
>


-- 
Sincerely yours
Mikhail Khludnev


Re: How to run Solr on two servers for redundancy

2022-03-14 Thread Shawn Heisey

On 3/13/22 23:28, Sam Lee wrote:

By "standalone client", do you mean that I could use SolrJ on a separate
server where no Solr instance is running? i.e. use the client to
remotely connect to SolrCloud.


SolrJ is an inherent part of Solr.  But it is also a complete library by 
itself, which Java programmers can use to add Solr support to their 
programs.



By the way, the most popular Python client, pysolr, seems to support
SolrCloud mode. [1]


Be aware that all clients other than SolrJ are third-party -- not 
produced or maintained by the Solr project.  The pysolr client may be 
the most popular Python client ... I couldn't say because it was made by 
somebody else, not this project.  Nice that they support zookeeper 
connections ... I wasn't aware of that.


Thanks,
Shawn



Re: Sorting nested document

2022-03-14 Thread Mikhail Khludnev
Hello,
May
https://solr.apache.org/guide/8_0/function-queries.html#childfield-field-function
work in your case?

On Tue, Mar 8, 2022 at 3:11 PM Pranaya Behera 
wrote:

> I have the documents as parent-child relationship:
> https://gist.github.com/shadow-fox/fb525a7efe66622230e61d6253b6cfa9
>
> How to sort the parent (type_s:product)s based on the grandchildren
> (type_s:vendor) field value ?
>
> |parent = type_s: product children = type_s: sku grandchildren =
> type_s:storage type_s:vendor example parent document with all the
> children. |
>
> | 10 product
> Nike 
>  11 sku
> Red XL   13  name="type_s">storage CA  name="QTY_i">10   14  name="type_s">storage NY  name="QTY_i">0   15  name="type_s">vendor Bob  name="PRICE_i">20   16
> vendor Alice
> 22 name="id">12 sku  name="COLOR_s">Blue XL   17  name="type_s">storage CA  name="QTY_i">0   18  name="type_s">storage NY  name="QTY_i">100   19  name="type_s">vendor Bob  name="PRICE_i">25   20
> vendor Alice
> 28 |
>
> The query:
>
> |{!parent which=type_s:product} +COLOR_s:Blue +{!parent which=type_s:sku
> v='+QTY_i:[10 TO *] +STATE_s:CA'} |
>
> The result:
>
> |[ { "id": "21", "type_s": "product", "BRAND_s": "Nike", "_version_":
> 1726713699507372000 }, { "id": "32", "type_s": "product", "BRAND_s":
> "Puma", "_version_": 1726713699562946600 } ] |
>
> I want the results to be sorted by the matched |type_s:sku document|'s
> |children document type_s:vendor AND NAME_s:Alice| docs |PRICE_i| field
> in descending.
>
> I have tried:
>
> |{!parent which=type_s:product score=max v=’+type_s:sku
> +{!func}PRICE_i’} desc |
>
> |{!parent which=type_s:product score=max v='{!parent which=type_s:sku
> v='type_s:vendor AND NAME_s:Alice'}+{!func}PRICE_i' asc |
>
> However it gives "error in sort".
>


-- 
Sincerely yours
Mikhail Khludnev


Re: Question regarding the MoreLikeThis features

2022-03-14 Thread Tim Casey
Hi,

> Regarding the specific problem on the existence of a specific parameter
to restrict the corpus of documents that are analyzed for the return of
similar contents

If you can get this to be a query, and one which might be ordered in a
useful way, then you are very likely to see what you need in the top 500
results.  This would be enough for most usage.
The 'likely' would need to be computed and measured as you produce results.


In any event, to restrict the corpus you build a query bit set and use that
as a filter.  This is fairly easy to code so you can see the results and
give yourself a way to experiment on what you would do, before deciding
how/what to do any one particular way.

Or, you directly query and allow solr to do the needed computations within
each shard.  At this point, I would recommend people who are more versed in
solr specifics for this kind of computation.

On Mon, Mar 14, 2022 at 12:56 AM Marco D'Ambra  wrote:

> Hi Tim,
>
> thank you very much for the answer, full of useful advice.
> I will try to put into practice what you told me to improve the output of
> the calls.
> Regarding the specific problem on the existence of a specific parameter to
> restrict the corpus of documents that are analyzed for the return of
> similar contents, I must admit that I have not yet figured out how to
> proceed.
>
> Thank you very much and have a nice day,
>
> Marco
>
> -Original Message-
> From: Tim Casey 
> Sent: giovedì 10 marzo 2022 19:51
> To: users@solr.apache.org
> Subject: Re: Question regarding the MoreLikeThis features
>
> Marco,
>
> Finding 'similar' documents will end up being weighted by document length.
> I would recommend, at the point of indexing, also indexing an ordered
> token set of the first 256, 1024 up to around 5k tokens (depending on
> document lengths).  What this does is allow a vector to vector normalized
> comparison.  You could then query for similar possibile documents directly
> and build a normalized vector with respect to the query document.
>
> Normalizing schemes in something like an inverted index will tend to
> weight the lower token count documents over higher token count documents.
> So the above is an attempt to get at a normalized and comparable view
> between documents independent of size.  Next you end up normalizing by the
> inverse of a commonality.  That is, a more common token is weighted lower
> than a least common token.  (I would also discount tokens which have a raw
> frequency below 5.). At the point you have a normalized vector, you can use
> that to find similarities weighted by more meaningful tokens.
>
> tim
>
> On Thu, Mar 10, 2022 at 9:18 AM Marco D'Ambra  wrote:
>
> > Hi all,
> > This is my first time writing to this mailing list and I would like to
> > thank you in advance for your attention.
> > I am writing because I am having problems using the "MoreLikeThis"
> > features.
> > I am working in a Solr cluster (version 8.11.1) consisting of multiple
> > nodes, each of which contains multiple shards.
> >
> > It is a quite big cluster and data is sharded using implicit routing
> > and documents are distributed by date on monthly shards.
> >
> > Here are the fields that I'm using:
> >
> >   *   UniqueReference: the unique reference of a document
> >   *   DocumentDate: the date of a document (in the standar Solr format)
> >   *   DataType: the data type of the document (let's say that can be A or
> > B)
> >   *   Content: the content of a document (a string)
> > Here is what my managed schema looks like ...
> >  > required="true" />
> >
> >  > required="true" />
> >
> >  > required="true" />
> >
> >  > required="false" />
> > ...
> >
> >
> > The task that I want to perform is the following:
> > Given the unique reference of a document of type A, I want to find the
> > documents of data type B and in a fixed time interval, that have the
> > most similar content.
> > Here the first questions:
> >
> >   1.  Which is the best solr request to perform this task?
> >   2.  Is there a parameter that allows me to restrict the corpus of
> > documents that are analyzed for the return of similar contents? it
> > should be noted that this corpus of documents may not contain the
> > initial document from which I am starting Initially I thought about
> > using the "mlt" endpoint, but since there was no parameter in the
> > documentation that would allow me to select the shard on which to
> > direct the query (I absolutely need it, otherwise I risk putting a
> > strain on my cluster), I opted to use the "select" endpoint, with the
> "mlt"
> > parameter set to true, and the "shards" parameter.
> > Those are the parameters that I am using:
> >
> >   *   q: "UniqueReference:doc_id"
> >   *   fq: "(DocumentDate:[2022-01-22T00:00:00Z TO 2022-01-26T00:00:00Z]
> > AND DataType:B) OR (UniqueReference:doc_id)"
> >   *   mlt: true
> >   *   mlt.fl: "Content"
> >   *   shards: "shard_202201"
> > I realize that the "fq" parameter is used in a bizarre way. In

Re: Solr Collections Join

2022-03-14 Thread Norbert Kutasi
hi Venkat,

We are on SOLR 8.5.2 and actually it's possible to join 2 or multiple
collections and get fields from them similarly, what a Left Outer join does
in an RDBMS. It will create nested documents rather than attribute merges
which might not be what you are looking for.

If you tried *subquery *document transformer:
https://solr.apache.org/guide/8_5/transforming-result-documents.html#subquery
you can skip the rest of my message :)

In your question, first you have referred to !join that is actually query
parser allowing you to filter on Collection A based on documents in
Collection B.
It's quite well explained here:
https://solr.apache.org/guide/8_5/other-parsers.html#join-query-parser

We rely on this as well as part of our Access Control List implementation.

Here is a JSON POST sample against another collection e.g Clients where we
enforce Users who are part of the collection called: ACL get to retrieve
Clients of their own Teams.

/Clients/query
{
  "query" : "*:*",
  "filter" : ["{!join from=teamId fromIndex=ACL to=teamId}userId:ABC123"],
  "fields" : "*",
  "sort"   : "id desc",
  "limit"  : "10"
}

But this is not what you have really asked about.

When we came across the requirement of cross collection data retrieval, we
tried to implement it on SOLR side rather than in our custom API layer.

Let's take 3 Collections: Clients, Managers, Products and how we can
use *subquery
*transformer.

Constraints:

   - One core of the Collection on the "to" side(s) ie:  A->B has to exist
   on the node and must have a single shard and a replica on all Solr nodes
   where the collection. At least this is what we experienced on 8.5
   - In the syntax of referring to Collection B in subquery, it actually
   corresponds to the core_name and not the Collection. Cores are called
   differently in your SOLRCloud varying on replica by replica like :
   Managers_shard1_replica_n1,  Managers_shard1_replica_n2 etc. You have to
   assign them a generic name.


Here we want to retrieve Clients and their Managers and Product Details
that Clients purchased.

Clients
Managers
Products

/Clients/query
{
"query":"*:*",
"filter":["region:NA"],
"fields": "*,managerDetails:[subquery
fromIndex=Managers],productDetails:[subquery fromIndex=Products]",
"limit": 100,
"offset":0,
"sort":"id asc",
"params":{
"managerDetails.fl":"*",
"managerDetails.q": "*",
"managerDetails.fq": "{!term f=managerList
v=$row.managerId}","managerDetails.rows": 10,
"productDetails.fl":"*",
"productDetails.q": "*",
"productDetails.fq": "{!term f=productPurchasedList v=$row.productId}",
"productDetails.rows": 10}}

In order to make this work you need to rename Managers and Product cores to
a generic one.

https://solr.apache.org/guide/8_5/coreadmin-api.html#coreadmin-rename
admin/cores?action=RENAME&core=Managers_shard1_replica_n1&other=Managers
admin/cores?action=RENAME&core=Products_shard1_replica_n1&other=Products

Managers and Products has to exists on the Node that you hit with
/Clients/query

It's also possible to create deeply nested documents, where the 3rd level
may capture Products that the Managers (in general) responsible for.

Clients
Managers
Products

/Clients/query
{
"query":"*:*",
"filter":["region:NA"],
"fields": "*,managerDetails:[subquery fromIndex=Managers]",
"limit": 100,
"offset":0,
"sort":"id asc",
"params":{
"managerDetails.fl":"*,productDetails:[subquery fromIndex=Products]",
"managerDetails.q": "*",
"managerDetails.fq": "{!term f=managerList v=$row.managerId}",
"managerDetails.rows": 10,
"managerDetails.productDetails.fl":"*",
"managerDetails.productDetails.q": "*",
"managerDetails.productDetails.fq": "{!term f=productIdExpertise
v=$row.productId}",
"managerDetails.productDetails.rows": 10}}

Using "fl" you can narrow down the attributes wanted to bring in from other
collections.

Subquery works for you even within a single collection to generate
arbitrary document hierarchies.

Using the example above, when a collection can host Customers, Products and
Managers type documents you can miss the fromIndex part.

The reason we decided to employ *subquery *was in order to provide the
utmost flexibility including to host these Collections as separate
endpoints and avoid enforcing any modelling constraints early on like
parent / child hierarchy.
We started out as a single collection of 60 millions documents. After
noticing some scalability issues when returning a high number of objects
(like 2'000 Clients) we learnt that breaking them into discrete ones would
provide better performance.

Regards,
Norbert

On Mon, 14 Mar 2022 at 14:06, Mikhail Khludnev  wrote:

> Hi, Venkat.
> No way. Sorry.
>
> On Fri, Mar 11, 2022 at 4:59 PM Venkateswarlu Bommineni 
> wrote:
>
> > Yes, it is solrcloud setup.
> >
> > On Fri, Mar 11, 2022 at 12:57 AM Srijan  wrote:
> >
> > > Is this a SolrCloud setup?
> > >
> > > On Thu, Mar 10, 2022, 22:25 Venkateswarlu Bommineni 
> > > wrote:
> > >
> > > > Hello All,
> > > >
> > > > I have a requirement to join 2 co