Help with stopwords filter

2022-04-04 Thread Arif Shaon
Hello list,

I am trying the following two queries, which should return the same result.
However, the first contains a stop word "is" and as a result its returning
0 result. So it seems to me that the stopword filter is not working as
expected. Could someone please look at the debug reports of the two queries
and advise what I am doing wrong?  Any help would be appreciated.

Query 1:

"rawquerystring":"thim day is gone",
  "querystring":"thim day is gone",
  "parsedquery":"(+(+DisjunctionMaxQuery(((i18n_content_ar:thim)^3.0 |
(i18n_content_en:thim)^3.0 | (i18n_label_ar:thim)^5.0 |
(i18n_label_en:thim)^5.0 | (shelf_mark:thim)^80.0 |
(content:thim)^0.04)~0.01) +DisjunctionMaxQuery(((i18n_content_ar:day)^3.0
| (i18n_content_en:day)^3.0 | (i18n_label_ar:day)^5.0 |
(i18n_label_en:day)^5.0 | (shelf_mark:day)^80.0 | (content:day)^0.04)~0.01)
+DisjunctionMaxQuery(((i18n_content_ar:is)^3.0 | (i18n_label_ar:is)^5.0 |
(shelf_mark:is)^80.0)~0.01)
+DisjunctionMaxQuery(((i18n_content_ar:gone)^3.0 |
(i18n_content_en:gone)^3.0 | (i18n_label_ar:gone)^5.0 |
(i18n_label_en:gone)^5.0 | (shelf_mark:gone)^80.0 |
(content:gone)^0.04)~0.01)) (+DisjunctionMaxQuery(((content:\"thim day ?
gone\"~10)^2.0)~0.01)) (+record_type:logical^15.0)
(+record_type:essay^17.0))/no_coord",
  "parsedquery_toString":"+(+((i18n_content_ar:thim)^3.0 |
(i18n_content_en:thim)^3.0 | (i18n_label_ar:thim)^5.0 |
(i18n_label_en:thim)^5.0 | (shelf_mark:thim)^80.0 |
(content:thim)^0.04)~0.01 +((i18n_content_ar:day)^3.0 |
(i18n_content_en:day)^3.0 | (i18n_label_ar:day)^5.0 |
(i18n_label_en:day)^5.0 | (shelf_mark:day)^80.0 | (content:day)^0.04)~0.01
+((i18n_content_ar:is)^3.0 | (i18n_label_ar:is)^5.0 |
(shelf_mark:is)^80.0)~0.01 +((i18n_content_ar:gone)^3.0 |
(i18n_content_en:gone)^3.0 | (i18n_label_ar:gone)^5.0 |
(i18n_label_en:gone)^5.0 | (shelf_mark:gone)^80.0 |
(content:gone)^0.04)~0.01) (+((content:\"thim day ? gone\"~10)^2.0)~0.01)
(+(record_type:logical)^15.0) (+(record_type:essay)^17.0)",
  "facet-debug":{
 "elapse":0,


Query 2:

"rawquerystring":"thim day gone",
  "querystring":"thim day gone",
  "parsedquery":"(+(+DisjunctionMaxQuery(((i18n_content_ar:thim)^3.0 |
(i18n_content_en:thim)^3.0 | (i18n_label_ar:thim)^5.0 |
(i18n_label_en:thim)^5.0 | (shelf_mark:thim)^80.0 |
(content:thim)^0.04)~0.01) +DisjunctionMaxQuery(((i18n_content_ar:day)^3.0
| (i18n_content_en:day)^3.0 | (i18n_label_ar:day)^5.0 |
(i18n_label_en:day)^5.0 | (shelf_mark:day)^80.0 | (content:day)^0.04)~0.01)
+DisjunctionMaxQuery(((i18n_content_ar:gone)^3.0 |
(i18n_content_en:gone)^3.0 | (i18n_label_ar:gone)^5.0 |
(i18n_label_en:gone)^5.0 | (shelf_mark:gone)^80.0 |
(content:gone)^0.04)~0.01)) (+DisjunctionMaxQuery(((content:\"thim day
gone\"~10)^2.0)~0.01)) (+record_type:logical^15.0)
(+record_type:essay^17.0))/no_coord",
  "parsedquery_toString":"+(+((i18n_content_ar:thim)^3.0 |
(i18n_content_en:thim)^3.0 | (i18n_label_ar:thim)^5.0 |
(i18n_label_en:thim)^5.0 | (shelf_mark:thim)^80.0 |
(content:thim)^0.04)~0.01 +((i18n_content_ar:day)^3.0 |
(i18n_content_en:day)^3.0 | (i18n_label_ar:day)^5.0 |
(i18n_label_en:day)^5.0 | (shelf_mark:day)^80.0 | (content:day)^0.04)~0.01
+((i18n_content_ar:gone)^3.0 | (i18n_content_en:gone)^3.0 |
(i18n_label_ar:gone)^5.0 | (i18n_label_en:gone)^5.0 |
(shelf_mark:gone)^80.0 | (content:gone)^0.04)~0.01) (+((content:\"thim day
gone\"~10)^2.0)~0.01) (+(record_type:logical)^15.0)
(+(record_type:essay)^17.0)",
  "facet-debug":{
 "elapse":1,

Many thanks in advance.

Best
Arif


Solr as a dedicated data store?

2022-04-04 Thread Srijan
Hi All,

I am working on designing a Solr based enterprise search solution. One
requirement I have is to track crawled data from various different data
sources with metadata like crawled date, indexing status and so on. I am
looking into using Solr itself as my data store and not adding a separate
database to my stack. Has anyone used Solr as a dedicated data store? How
did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl
DB - can someone here share some insight into how Fusion is using this
'DB'? My store will need to track millions of objects and be able to handle
parallel adds/updates. Do you think Solr is a good tool for this or am I
better off depending on a database service?

Thanks a bunch.


Re: Solr as a dedicated data store?

2022-04-04 Thread Dave
NO. I know it’s tempting but solr is a search engine not a database. You should 
at any point be able to destroy the search index and rebuild it from the 
database.   Most any rdbms can do what you want, or go the nosql mongo route 
which is becoming popular, but never use a search engine in this way, you could 
use it as an intermediate data store for queries and speed but it’s not the 
purpose. 

> On Apr 4, 2022, at 7:53 AM, Srijan  wrote:
> 
> Hi All,
> 
> I am working on designing a Solr based enterprise search solution. One
> requirement I have is to track crawled data from various different data
> sources with metadata like crawled date, indexing status and so on. I am
> looking into using Solr itself as my data store and not adding a separate
> database to my stack. Has anyone used Solr as a dedicated data store? How
> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl
> DB - can someone here share some insight into how Fusion is using this
> 'DB'? My store will need to track millions of objects and be able to handle
> parallel adds/updates. Do you think Solr is a good tool for this or am I
> better off depending on a database service?
> 
> Thanks a bunch.


Re: Solr as a dedicated data store?

2022-04-04 Thread matthew sporleder
Agreed. We get messages on this list pretty regularly about data locked in old 
versions of solr with no good way out. 

Even if reindexing takes a week on a big cluster and is hard to do, and means 
un-glaciering stuff from s3, etc make sure you can do it!

> On Apr 4, 2022, at 7:57 AM, Dave  wrote:
> 
> NO. I know it’s tempting but solr is a search engine not a database. You 
> should at any point be able to destroy the search index and rebuild it from 
> the database.   Most any rdbms can do what you want, or go the nosql mongo 
> route which is becoming popular, but never use a search engine in this way, 
> you could use it as an intermediate data store for queries and speed but it’s 
> not the purpose. 
> 
>> On Apr 4, 2022, at 7:53 AM, Srijan  wrote:
>> 
>> Hi All,
>> 
>> I am working on designing a Solr based enterprise search solution. One
>> requirement I have is to track crawled data from various different data
>> sources with metadata like crawled date, indexing status and so on. I am
>> looking into using Solr itself as my data store and not adding a separate
>> database to my stack. Has anyone used Solr as a dedicated data store? How
>> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl
>> DB - can someone here share some insight into how Fusion is using this
>> 'DB'? My store will need to track millions of objects and be able to handle
>> parallel adds/updates. Do you think Solr is a good tool for this or am I
>> better off depending on a database service?
>> 
>> Thanks a bunch.


Re: Solr as a dedicated data store?

2022-04-04 Thread Dominique Bejean
Hi,

A best practice for performances and ressources usage is to store and/or
index and/or docValues only data required for your search features.
However, in order to implement or modify new or existing features in an
index you will need to reindex all the data in this index.

I propose 2 solutions :

   - The first one is to store the full original JSON data into the _str_
   fields of the index.

   
https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default


   - The second and the best solution in my opinion is to store the JSON
   data into an intermediate feature neutral data store as a file simple file
   system or better a MongoDB database. This way will allow you to use your
   data in several indexes (one index for search, one index for suggesters,
   ...)  without duplicating data into _src_ fields in each index. A uuid in
   each index will allow you to get the full JSON object in MongoDB.


Obviously a key point is the backup strategy of your data store according
to the solution you choose : either Solr indexes or the file system or the
MongoDB database.

Dominique

















Le lun. 4 avr. 2022 à 13:53, Srijan  a écrit :

> Hi All,
>
> I am working on designing a Solr based enterprise search solution. One
> requirement I have is to track crawled data from various different data
> sources with metadata like crawled date, indexing status and so on. I am
> looking into using Solr itself as my data store and not adding a separate
> database to my stack. Has anyone used Solr as a dedicated data store? How
> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl
> DB - can someone here share some insight into how Fusion is using this
> 'DB'? My store will need to track millions of objects and be able to handle
> parallel adds/updates. Do you think Solr is a good tool for this or am I
> better off depending on a database service?
>
> Thanks a bunch.
>


Re: Help with stopwords filter

2022-04-04 Thread Dominique Bejean
Hi,

Are you sure "is" is defined as a stopword at both index and query type in
your analyzers ?

Dominique

Le lun. 4 avr. 2022 à 09:09, Arif Shaon  a écrit :

> Hello list,
>
> I am trying the following two queries, which should return the same result.
> However, the first contains a stop word "is" and as a result its returning
> 0 result. So it seems to me that the stopword filter is not working as
> expected. Could someone please look at the debug reports of the two queries
> and advise what I am doing wrong?  Any help would be appreciated.
>
> Query 1:
>
> "rawquerystring":"thim day is gone",
>   "querystring":"thim day is gone",
>   "parsedquery":"(+(+DisjunctionMaxQuery(((i18n_content_ar:thim)^3.0 |
> (i18n_content_en:thim)^3.0 | (i18n_label_ar:thim)^5.0 |
> (i18n_label_en:thim)^5.0 | (shelf_mark:thim)^80.0 |
> (content:thim)^0.04)~0.01) +DisjunctionMaxQuery(((i18n_content_ar:day)^3.0
> | (i18n_content_en:day)^3.0 | (i18n_label_ar:day)^5.0 |
> (i18n_label_en:day)^5.0 | (shelf_mark:day)^80.0 | (content:day)^0.04)~0.01)
> +DisjunctionMaxQuery(((i18n_content_ar:is)^3.0 | (i18n_label_ar:is)^5.0 |
> (shelf_mark:is)^80.0)~0.01)
> +DisjunctionMaxQuery(((i18n_content_ar:gone)^3.0 |
> (i18n_content_en:gone)^3.0 | (i18n_label_ar:gone)^5.0 |
> (i18n_label_en:gone)^5.0 | (shelf_mark:gone)^80.0 |
> (content:gone)^0.04)~0.01)) (+DisjunctionMaxQuery(((content:\"thim day ?
> gone\"~10)^2.0)~0.01)) (+record_type:logical^15.0)
> (+record_type:essay^17.0))/no_coord",
>   "parsedquery_toString":"+(+((i18n_content_ar:thim)^3.0 |
> (i18n_content_en:thim)^3.0 | (i18n_label_ar:thim)^5.0 |
> (i18n_label_en:thim)^5.0 | (shelf_mark:thim)^80.0 |
> (content:thim)^0.04)~0.01 +((i18n_content_ar:day)^3.0 |
> (i18n_content_en:day)^3.0 | (i18n_label_ar:day)^5.0 |
> (i18n_label_en:day)^5.0 | (shelf_mark:day)^80.0 | (content:day)^0.04)~0.01
> +((i18n_content_ar:is)^3.0 | (i18n_label_ar:is)^5.0 |
> (shelf_mark:is)^80.0)~0.01 +((i18n_content_ar:gone)^3.0 |
> (i18n_content_en:gone)^3.0 | (i18n_label_ar:gone)^5.0 |
> (i18n_label_en:gone)^5.0 | (shelf_mark:gone)^80.0 |
> (content:gone)^0.04)~0.01) (+((content:\"thim day ? gone\"~10)^2.0)~0.01)
> (+(record_type:logical)^15.0) (+(record_type:essay)^17.0)",
>   "facet-debug":{
>  "elapse":0,
>
>
> Query 2:
>
> "rawquerystring":"thim day gone",
>   "querystring":"thim day gone",
>   "parsedquery":"(+(+DisjunctionMaxQuery(((i18n_content_ar:thim)^3.0 |
> (i18n_content_en:thim)^3.0 | (i18n_label_ar:thim)^5.0 |
> (i18n_label_en:thim)^5.0 | (shelf_mark:thim)^80.0 |
> (content:thim)^0.04)~0.01) +DisjunctionMaxQuery(((i18n_content_ar:day)^3.0
> | (i18n_content_en:day)^3.0 | (i18n_label_ar:day)^5.0 |
> (i18n_label_en:day)^5.0 | (shelf_mark:day)^80.0 | (content:day)^0.04)~0.01)
> +DisjunctionMaxQuery(((i18n_content_ar:gone)^3.0 |
> (i18n_content_en:gone)^3.0 | (i18n_label_ar:gone)^5.0 |
> (i18n_label_en:gone)^5.0 | (shelf_mark:gone)^80.0 |
> (content:gone)^0.04)~0.01)) (+DisjunctionMaxQuery(((content:\"thim day
> gone\"~10)^2.0)~0.01)) (+record_type:logical^15.0)
> (+record_type:essay^17.0))/no_coord",
>   "parsedquery_toString":"+(+((i18n_content_ar:thim)^3.0 |
> (i18n_content_en:thim)^3.0 | (i18n_label_ar:thim)^5.0 |
> (i18n_label_en:thim)^5.0 | (shelf_mark:thim)^80.0 |
> (content:thim)^0.04)~0.01 +((i18n_content_ar:day)^3.0 |
> (i18n_content_en:day)^3.0 | (i18n_label_ar:day)^5.0 |
> (i18n_label_en:day)^5.0 | (shelf_mark:day)^80.0 | (content:day)^0.04)~0.01
> +((i18n_content_ar:gone)^3.0 | (i18n_content_en:gone)^3.0 |
> (i18n_label_ar:gone)^5.0 | (i18n_label_en:gone)^5.0 |
> (shelf_mark:gone)^80.0 | (content:gone)^0.04)~0.01) (+((content:\"thim day
> gone\"~10)^2.0)~0.01) (+(record_type:logical)^15.0)
> (+(record_type:essay)^17.0)",
>   "facet-debug":{
>  "elapse":1,
>
> Many thanks in advance.
>
> Best
> Arif
>


Is there anything other than NSSM or AlwaysUp that will create a windows service or similar for Solr?

2022-04-04 Thread Heller, George A III CTR (USA)
Does anyone know if there is anything other than NSSM or AlwaysUp that will 
create a windows service or similar for Solr so that it will automatically 
start when the server is rebooted?

 

NSSM failed our vulnerability scan and AlwaysUp may work, but not sure if they 
can do the paperwork to approve the license cost for AlwaysUp. I would like to 
see if there are more alternatives out there.

 

 



smime.p7s
Description: S/MIME cryptographic signature


Re: Is there anything other than NSSM or AlwaysUp that will create a windows service or similar for Solr?

2022-04-04 Thread Eric Pugh
I think any service tool that your org will accept should work, it’s just a 
basic Java app….  Nothing special about how Solr starts that would require a 
specific windows service app.



> On Apr 4, 2022, at 2:36 PM, Heller, George A III CTR (USA) 
>  wrote:
> 
> Does anyone know if there is anything other than NSSM or AlwaysUp that will 
> create a windows service or similar for Solr so that it will automatically 
> start when the server is rebooted?
>  
> NSSM failed our vulnerability scan and AlwaysUp may work, but not sure if 
> they can do the paperwork to approve the license cost for AlwaysUp. I would 
> like to see if there are more alternatives out there.

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 


This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Is there anything other than NSSM or AlwaysUp that will create a windows service or similar for Solr?

2022-04-04 Thread Walter Underwood
It has been ages since I ran a Windows server, but we just used the built-in 
service manager. That worked fine for a couple of products I worked on, a 
search engine and a database.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 4, 2022, at 11:36 AM, Heller, George A III CTR (USA) 
>  wrote:
> 
> Does anyone know if there is anything other than NSSM or AlwaysUp that will 
> create a windows service or similar for Solr so that it will automatically 
> start when the server is rebooted?
>  
> NSSM failed our vulnerability scan and AlwaysUp may work, but not sure if 
> they can do the paperwork to approve the license cost for AlwaysUp. I would 
> like to see if there are more alternatives out there.



Re: Is there anything other than NSSM or AlwaysUp that will create a windows service or similar for Solr?

2022-04-04 Thread dmitri maziuk

On 2022-04-04 1:41 PM, Eric Pugh wrote:

I think any service tool that your org will accept should work, it’s just a 
basic Java app….  Nothing special about how Solr starts that would require a 
specific windows service app.


It's windows, not Solr, that requires a specific app.

Everyone I know is using NSSM, if OP's IT security is failing it, 
there's a good chance the problem is their IT security. There is no 
software fix for that.


Dima



Solr as a service: how to monitor users, query metrics?

2022-04-04 Thread Victoria Stuart (VictoriasJourney.com)
2022-04-04

Hello: I am a self-taught programmer / developer, so my apologies if the 
following questions seem amateurish.

I have a Solr-based document search / knowledge engine, currently running on my 
home localserver / webserver (Arch Linux OS).

I'd like to offer this as a pay-as-you-go / metered SaaS.  I'll probably use 
Stripe.com to manage / bill clients.

My questions relate to advice / recommendations for the following items 
(especially at this time items 3, 4 below).

1. Deployment of Apache Solr on a VPS.

2. Securing that product. Other unanticipated issues (CORS? ...).

3. Management of clients (Solr users): account setup, management, ...

4. With the intention of providing a metered SaaS, how can I monitor per-user 
Solr queries (API calls) / metrics?

I am fairly comfortable with Solr at home (solrconfig.xml; schema.xml; ...), 
but I have never deployed a software service.

As mentioned my home OS is Arch Linux. I regard myself a Linux superuser; 
moderately-capable Python programmer; moderately capable webdev 
(HTML/Javascript/CSS); LAMP novice.

==


Re: Solr as a dedicated data store?

2022-04-04 Thread Shawn Heisey

On 4/4/2022 5:52 AM, Srijan wrote:

I am working on designing a Solr based enterprise search solution. One
requirement I have is to track crawled data from various different data
sources with metadata like crawled date, indexing status and so on. I am
looking into using Solr itself as my data store and not adding a separate
database to my stack. Has anyone used Solr as a dedicated data store? How
did it compare to an RDBMS?


As you've been told, Solr is NOT a database.  It is most definitely not 
equivalent in any way to an RDBMS.  If you want the kinds of things an 
RDBMS is good for, you should use an RDBMS, not Solr.


Handling ever-changing search requirements in Solr is typically going to 
require the kinds of schema changes that need a full reindex.  So you 
probably wouldn't be able to use the same Solr index for your data 
storage as you do for searching anyway.


If you're going to need to set up two Solr installs to handle your 
needs, you should probably NOT use Solr for the storage role.  Use 
something that has been tested and hardened against data loss. Solr does 
do its best to never lose data, but guaranteed data durability is not 
one of its design goals.  The changes that would be required to make 
that guarantee would most likely have an extremely adverse effect on 
search performance.


Solr's core functionality has always been search.  Search is what it's 
good at, and that's what will be optimized in future versions ... not 
any kind of database functionality.


Thanks,
Shawn



Re: Solr as a dedicated data store?

2022-04-04 Thread Tim Casey
Srijan,

Comments off the top of my head, so buyer beware.

Almost always you want to be able to reindex your data from a 'source'.
This makes things like indexes not good as a data store, or a source of
truth.  The reasons for this vary.  Indexes age out data because there is
frequently a weight towards more recent items, indexes need to be reindexed
for new info to index/issues during indexing/processing, and the list would
go on.

I have built an index data POJO store in lucene a *long* time ago.  It is
doable to hydrate a stored object into a language level object, such as a
java object instance.  It is fairly straightforward to data model from a
'common' type of data model into an index as a data model.  But, it is not
quite the same query expectations and so on.  It is is not that far, but
again, this is not what the primary focus of an invertible index is.  The
primary focus is to take unstructured language data and return results in a
hopefully well ordered list.

So, the first you might do is treat the different sources of data as
different clusters with a different topology.  You might stripe the data
less and have it be more nodes than you might otherwise because you will do
less indexing with it, than you might a normal index.  Once you make a
decision to separate out the data, then you have to contend with two
different indexes having references to the same 'documents' with some id to
tie them together and you would lose the ability to do any form of in-index
join using document ids.  If you keep all the data in the same index, then
you might be in a situation where the common answer is reindex and you
would not know what to do about the "metadata".

I strongly suspect what you want is to have a way to either maintain the
metadata within the index and use it simply as you would along with the
documents.  As you spider, keep the info about the document with the
document contents.  I cannot think of a reason to keep all of the data in a
kinda weird separate space.If you want to be more sophisticated, then
you can build an ETL which takes documents and forms indexable units, store
the indexable units for reindexing.  This is usually pretty quick and
separates out the crawling, ETL and indexing/query pieces, for all that
means.   This is more complicated, but would be a bit more standard in how
people think about it.

tim



On Mon, Apr 4, 2022 at 7:32 PM Shawn Heisey  wrote:

> On 4/4/2022 5:52 AM, Srijan wrote:
> > I am working on designing a Solr based enterprise search solution. One
> > requirement I have is to track crawled data from various different data
> > sources with metadata like crawled date, indexing status and so on. I am
> > looking into using Solr itself as my data store and not adding a separate
> > database to my stack. Has anyone used Solr as a dedicated data store? How
> > did it compare to an RDBMS?
>
> As you've been told, Solr is NOT a database.  It is most definitely not
> equivalent in any way to an RDBMS.  If you want the kinds of things an
> RDBMS is good for, you should use an RDBMS, not Solr.
>
> Handling ever-changing search requirements in Solr is typically going to
> require the kinds of schema changes that need a full reindex.  So you
> probably wouldn't be able to use the same Solr index for your data
> storage as you do for searching anyway.
>
> If you're going to need to set up two Solr installs to handle your
> needs, you should probably NOT use Solr for the storage role.  Use
> something that has been tested and hardened against data loss. Solr does
> do its best to never lose data, but guaranteed data durability is not
> one of its design goals.  The changes that would be required to make
> that guarantee would most likely have an extremely adverse effect on
> search performance.
>
> Solr's core functionality has always been search.  Search is what it's
> good at, and that's what will be optimized in future versions ... not
> any kind of database functionality.
>
> Thanks,
> Shawn
>
>