Re: Lucene Fili Fingerprint & ML

2021-04-30 Thread Alessandro Benedetti
In addition:
*1) Machine Learning*
It is possible to integrate Learning To Rank models for Solr reranking :
https://solr.apache.org/guide/8_8/learning-to-rank.html
https://www.techatbloomberg.com/blog/bloomberg-integrated-learning-rank-apache-solr/

https://sease.io/tag/learning-to-rank


It is possible to use Lucene/Solr as classifiers:
https://sease.io/2015/07/lucene-document-classification.html


*2) Desktop Application*
Lucene is a Java library
Apache Solr is a REST server now(originally it was a web application)

Cheers
--
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io


On Fri, 30 Apr 2021 at 01:34, Furkan KAMACI  wrote:

> Hi Ecem,
>
> 1.a) What do you mean with File Fingerprint? If that is hash of a file, you
> should check here:
>
> https://solr.apache.org/guide/8_8/de-duplication.html
>
> 1.b) Solr has a limited capability of Machine Learning applications. You
> can check the list from here:
>
> https://solr.apache.org/guide/8_8/machine-learning.html
>
> 2) Solr/Lucene are libraries which are not desktop applications. You can
> use them as dependencies at your desktop applications.
>
> Kind Regards,
> Furkan KAMACI
>
>
> On Tue, Apr 20, 2021 at 4:42 PM ECEM YAMAN <16008117...@ogr.bozok.edu.tr>
> wrote:
>
> > Hi,
> > I just started working on Solr and Lucene. I have some questions on my
> > mind, can you help me?
> >
> > - How to apply File Fingerprint and Machine learning applications to
> > lucene?
> > - What are the dependencies in the Solr and Lucene desktop application?
> >
> > Also I would appreciate if you recommend a book.
> >
> > Best regards,
> > Ecem
> >
>


What is the most effective way to boost according to a distribution?

2021-04-30 Thread Taisuke Miyazaki
What is the most efficient way to boost a field with possible values
ranging from 0 to 5000, scoring it according to its distribution?

Hi,

For example, suppose the range of values has the following distribution
25th percentile: 100
50th percentile: 1000
75th percentile: 2000
100th percentile (max): 5000

Then, I want to sort them by score as follows
0 ~ 100: 1 point
100 ~ 1000: 2 points
1000 ~ 2000: 3 points
2000 ~ 5000: 4 points

In this example, I've divided it into 4 parts, but in reality, I want to
divide it into 100 parts and score them on a 100-point scoring scale.

The current idea is to use the bf of eDisMax to force the score, and the bq
to force the score.

Also, although I haven't tried it yet, I think it would be faster to
implement and use something like the staircase function, as it would reduce
the number of function calls and make it easier to cache.


I am trying to find out if it is possible to perform the above calculations
on multiple fields and eventually add them together to achieve different
searches for different individuals.

Thanks.

Translated with www.DeepL.com/Translator (free version)


Re: Solr 8.6 Indexing Issue

2021-04-30 Thread Anuj Bhargava
Just added the following line and it started indexing again.

*useSSL="false"*




On Thu, 29 Apr 2021 at 21:40, Charlie Hull 
wrote:

> I meant can you run it on the database directly, without Solr, and what
> happens?
>
> Best
>
> Charlie
>
> On 29/04/2021 14:00, Anuj Bhargava wrote:
> > { "responseHeader":{ "status":0, "QTime":11, "params":{ "q":"*:*", "_":
> > "1619701169621"}},
> "response":{"numFound":179573,"start":0,"numFoundExact":
> > true,"docs":[ { "country":["AU"], "date_c":"2019-03-14T18:30:00Z",
> >
> > On Thu, 29 Apr 2021 at 17:18, Charlie Hull <
> ch...@opensourceconnections.com>
> > wrote:
> >
> >> What happens if you run exactly that SELECT query on your source
> database?
> >>
> >> Charlie
> >>
> >> On 29/04/2021 12:08, Anuj Bhargava wrote:
> >>> Ever since installing 8.6 a couple of months ago, the indexing was
> >> working
> >>> fine. All of a sudden getting the following error -
> >>>
> >>> 4/29/2021, 12:01:59 PM
> >>> ERROR false
> >>> DocBuilder
> >>> Exception while processing: hotels document : SolrInputDocument(fields:
> >>> []):org.apache.solr.handler.dataimport.DataImportHandlerException:
> Unable
> >>> to execute query: SELECT * FROM hotels WHERE country = 'IN' OR country
> >>> ='PK' OR country ='BD' OR country ='AF' OR country ='NP' OR country
> ='LK'
> >>> OR country ='MV' OR country ='BT' Processing Document # 1
> >>>
> >>> 4/29/2021, 12:01:59 PM
> >>> ERROR false
> >>> DataImporter
> >>> Full Import failed:java.lang.RuntimeException:
> >> java.lang.RuntimeException:
> >>> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
> to
> >>> execute query: SELECT * FROM hotels WHERE country = 'IN' OR country
> ='PK'
> >>> OR country ='BD' OR country ='AF' OR country ='NP' OR country ='LK' OR
> >>> country ='MV' OR country ='BT' Processing Document # 1
> >>>
> >>> Please help
> >>>
> >> --
> >> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> >> 
> >> Founding member of The Search Network 
> >> and co-author of Searching the Enterprise
> >> 
> >> tel/fax: +44 (0)8700 118334
> >> mobile: +44 (0)7767 825828
> >>
>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> 
> Founding member of The Search Network 
> and co-author of Searching the Enterprise
> 
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>


Solr 8.6.2 - single query on 2 cores ?

2021-04-30 Thread Anuj Bhargava
I have 2 cores '*live*' and '*archive*' with exactly the same fields.

I want to query on a unique id - '*posting_id*'. First it should check
*live* and if not found then in *archive* and show the results.

The following is doing a search on *live* and not on *archive*
http://xxx:8983/solr/live/select?q=*:*&fq=posting_id:41009261&indent=true&shards=archive

The following gives an error -
http://xxx.yyy.zzz.aaa:8983/solr/live/select?q=*:*&fq=posting_id:41009261&indent=true&shards=xxx.yyy.zzz.aaa:8983/solr/archive



401
10

*:*
xxx.yyy.zzz.aaa:8983/solr/archive
true
posting_id:41009261
xml




org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException
org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException

Error from server at null: Expected mime type
application/octet-stream but got text/html.Error
401 Unauthorized  HTTP ERROR 401 Unauthorized
 URI:/solr/archive/select
STATUS:401
MESSAGE:Unauthorized
SERVLET:default
401



How can I do a single query on 2 cores

Have added the following in solr.in.sh - SOLR_OPTS="$SOLR_OPTS
-Dsolr.disable.shardsWhitelist=true"


AW: join with big 2nd collection

2021-04-30 Thread Jens Viebig
Tried something today which seems promising.

I have put all documents from both cores in the same collection, sharded the 
collection to 8 shards, routing the documents so all documents with the same 
contentId_s end up on the same shard. 
To distinguish between document types we used a string field with an identifier 
doctype_s:col1  doctype_s:col2 
(Btw. What would be the best data type for a doc identifier that is fast to 
filter on ?)
Seems join inside the same core is a) much more efficient and b) seems to work 
with sharded index
We are currently still running this on a single instance and have reasonable 
response times ~1 sec which would be Ok for us and a big improvement over the 
old state.

- Why is the join that much faster ? Is it because of the sharding or also 
because of the same core ?
- How can we expect this to scale when adding more documents (probably with 
adding solr instances/shards/replicas on additional servers ) ? 
Doubling/tripling/... the amount of docs
- Would you expect query times to improve with additional servers and solr 
instances ?
- What would be the best data type for a doc identifier that is fast to filter 
on to distinguish between different document types on the same collection ?

What I don't like about this solution is that we loose the possibility 
completely reindex the a "document type". For example collection1 was pretty 
fast to completely reindex and possibly the schema changes more often, while 
collection2 is "index once / delete after x days" and is heavy to reindex.

Best Regards
Jens

-Ursprüngliche Nachricht-
Von: Jens Viebig  
Gesendet: Mittwoch, 28. April 2021 19:12
An: users@solr.apache.org
Betreff: join with big 2nd collection

Hi List,

We have a join perfomance issue and are not sure in which direction we should 
look to solve the issue.
We currently only have a single node setup

We have 2 collections where we do join querys, joined by a "primary key" string 
field contentId_s Each dataset for a single contentId_s consists of multiple 
timecode based documents in both indexes which makes this a many to many query.

collection1 - contains generic metadata and timecode based content (think 
timecode based comments)
Documents: 382.872
Unique contentId_s: 16715
~ 160MB size
single shard

collection2 - contains timecode based GPS data (gps posititon, field of 
view...timecodes are not related to timecodes in collection1, so flatten the 
structure would blow up the number of documents to incredible numbers) :
Documents: 695.887.875
Unique contenId_s: 10199
~ 300 GB size
single shard

Hardware is a HP DL360 with 32gb of ram (also tried on a machine with 64gb with 
not much improvement) and 1TB SSD for the index

In our use case there is lots of indexing/deletion traffic on both indexes and 
only few queries fired against the server.

We are constantly indexing new content and deleting old documents. This was 
already getting problematic with HDDs so we switched to SDDs, now indexing 
speed is fine for now (Might need also to scale this up in the future to allow 
more throughput).

But search speed suffers when we need to join with the big collection2 (taking 
up to 30sec for the query to succeed). We had some success experimenting with 
score join queries when collection2 results only returns a few unique Ids, but 
we can't predict that this is always the case, and if a lot of documents are 
hit in collection2, performance is 10x worse than with original normal join.

Sample queries look like this (simplified, but more complex queries are not 
much slower):

Sample1:
query: coll1field:someval OR {!join from=contentId_s to=contentId_s 
fromIndex=collection2 v='coll2field:someval}
filter: {!collapse field=contentId_s min=timecode_f}

Sample 2:
query: coll1field:someval
filter: {!join from=contentId_s to=contentId_s fromIndex=collection2 
v='coll2field:otherval}
filter: {!collapse field=contentId_s min=timecode_f}


I experimented with running the query on collection2 alone first only to get 
the numdocs (collapsing on contentId_s) to see how much results we get so we 
could choose the right join query, but then with many hits in collection2 this 
almost takes the same time as doing the join, so slow queries would get even 
slower

Caches also seem to not help much since almost every query fired is different 
and the index is mostly changing between requests anyways.

We are open to anything, adding nodes/hardware/shards/changing the index 
structure...
Currently we don't know how to get around the big join

Any advice in which direction we should look ?


Re:[VOTE] Solr Operator Logo

2021-04-30 Thread Christine Poerschke (BLOOMBERG/ LONDON)
Thanks Mona for submitting the entries and thanks Houston for organising the 
vote!

(binding)

logo vote: L-1 L-3 L-4
icon vote: I-4

(notes)

L-1 as first choice for the 'o' in Solr the kubernetes logo but in Solr color
L-3 as second choice with the 'o' in Solr still having much in common with the 
kubernetes logo and being in Solr color
L-4 as third choice in case there are constraints with the other choices (in 
the https://solr.apache.org/theme/images/identity/Solr_Styleguide.pdf there is 
guidance on Solr logo use, I don't know if similar exists for the Kubernetes 
logo use though someone has probably already considered that)

I-4 as first choice and i especially like how the kubernetes logo is rotated so 
that the spokes(?) align with the rays(?) of the solr logo

L-2 and I-1 I-2 I-3 not on my shortly list, i like the orange color gradient 
but it is color scheme of the previous Solr logo and visually rather different 
from the current Solr logo

From: users@solr.apache.org At: 04/27/21 22:00:36To:  users@solr.apache.org,  
d...@solr.apache.org
Subject: [VOTE] Solr Operator Logo

Hello Solr users & devs,

The Solr Operator is the first subproject under Apache Solr and thus needs
a distinguishing logo.

We have multiple options to choose from for both the Solr Operator logo and
icon. These designs have been based off of the Solr and Kubernetes logos:

   - https://solr.apache.org/logos-and-assets.html
   - https://github.com/kubernetes/kubernetes/tree/master/logo

The winning Logo and Icon will be finalized with dark & light options, in
SVG format after they are selected. So please don't factor in blurry lines
into your vote.

*Please read the following rules carefully* before submitting your vote.

*Who can vote?*

Anyone is welcome to cast a vote in support of their favorite
submission(s). Note that only Solr PMC member's votes are binding. If you
are a Solr PMC member, please indicate with your vote that the vote is
binding, to ease collection of votes. In tallying the votes, I will attempt
to verify only those marked as binding.


*How do I vote?*
Votes can be cast simply by replying to this email. It is a ranked-choice
vote [rank-choice-voting]. Multiple selections may be made, where the order
of preference must be specified. If an entry gets more than half the votes,
it is the winner. Otherwise, the entry with the lowest number of votes is
removed, and the votes are retallied, taking into account the next
preferred entry for those whose first entry was removed. This process
repeats until there is a winner.

The entries are broken up by type, then entry. The entry identifiers are
first either "L" for logo or "I" for icon, followed by an id for that
entry. You can vote for multiple entries, with your highest preference
sorted first.

Please vote for the logo and icon separately.

(binding)
logo vote: L-1, L-3, L-4
icon vote: I-3, I-2

*Entries*

All entries are submitted by Mona Chang.

The entries are as follows:

Logos:

[L-1]
https://user-images.githubusercontent.com/64094885/116154447-49f14100-a6ae-11eb-
9d05-b4fa7aeb94e6.png
[L-2]
https://user-images.githubusercontent.com/64094885/116311387-d19f8400-a770-11eb-
96da-65891a6df5aa.png
L-2 is merely an alternative color for L-1.
[L-3]
https://user-images.githubusercontent.com/64094885/116153995-bf104680-a6ad-11eb-
8fee-c29d4612a86c.png
[L-4]
https://user-images.githubusercontent.com/64094885/116154426-452c8d00-a6ae-11eb-
8dc2-080d7b5e04af.png

Icons:

All icons are similar, with different colors and gradients used.

[I-1]
https://user-images.githubusercontent.com/64094885/114226789-ad9f0e80-9939-11eb-
97d6-945bf79b8ce7.png
[I-2]
https://user-images.githubusercontent.com/64094885/116154026-c899ae80-a6ad-11eb-
8e47-900adfde9654.png
[I-3]
https://user-images.githubusercontent.com/64094885/116311474-e9770800-a770-11eb-
9402-74e6c1263290.png
[I-4]
https://user-images.githubusercontent.com/64094885/116154021-c6cfeb00-a6ad-11eb-
83c6-384d66500f49.png

Please vote for one logo and one icon from the above choices. This vote
will close about one week from today, Tue, May 4, 2021 at 11:59PM.

- Houston Putman

This vote is based off of the Lucene logo contest, run by Ryan Ernst.




Re: Solr 8.6 Indexing Issue

2021-04-30 Thread Shawn Heisey

On 4/29/2021 5:08 AM, Anuj Bhargava wrote:

Ever since installing 8.6 a couple of months ago, the indexing was working
fine. All of a sudden getting the following error -


Can you use a file sharing site to provide us with links to the following?

* The solrconfig.xml file for the problem index.
* The dataimport handler config file that is referenced in solrconfig.xml.
* A copy of the solr.log file that contains the moment of failure.
* A count of how many in mysql match that SELECT statement.

For the last one, you can simply replace "*" in the query to "COUNT(*)" 
and run it manually as Charlie suggested.


The DIH config probably contains a password.  Feel free to redact 
sensitive data from the file, but do not change it in any other way.


The reason that I have asked you to use a file sharing site is because 
the mailing list tends to eat all attachments.  That is a very 
unreliable way to send us files.


Thanks,
Shawn


Re: SecureRandom algorithm 'NativePRNG' is in use

2021-04-30 Thread lamine lamine
 I struggled on this for two days (using Intellij/gradle).
If you're using Gradle, you will need to pass the option through a test task 
(in build.gradle)

test {   
systemProperty ' test.solr.allowed.securerandom', ' NativePRNG ' 
}
Lamine


Le jeudi 29 avril 2021 à 17:49:00 UTC−5, Chris Hostetter 
 a écrit :  
 
 
: > I intermittently face this issue sometimes while running the unit tests.

How exactly are you running the tests? ant? IDE? ... It's very strange 
that this would be an intermittent problem.

Can you please post the actual log details from the test so we can see the 
INFO & WARN level logging from assertNonBlockingRandomGeneratorAvailable() 
(just before this assertion would fail) ... I'm very curious what it was 
your java.security.egd value is (and where/why/how it's getting set).

: One more thing,  -Dtest.solr.allowed.securerandom=NativePRNG doesn't seem
: to help and I haven't tried the other option yet.

If test.solr.allowed.securerandom is being set properly (so that the 
forked Test JVM is getting it) then that assertion can't even be run (but 
a diff assertion is to vet that what you specify is what your JVM is 
using) ... one thing that may not be well explained in the docs is that 
when running tests from ant, you need to use '-Dargs=...' in order to pass 
"extra" arguments to the forked test VMs...

ant test -Dtestcase=SampleTest 
-Dargs='-Dtest.solr.allowed.securerandom=BogusPRNG'
...
  [junit4]  2> 1420 INFO  (SUITE-SampleTest-seed#[DDDB05C007992358]-worker) [   
 ] o.a.s.SolrTestCaseJ4 SecureRandom sanity checks: 
test.solr.allowed.securerandom=BogusPRNG & java.security.egd=file:/dev/./urandom
...
  [junit4]    > Throwable #1: org.junit.ComparisonFailure: Algorithm specified 
using test.solr.allowed.securerandom system property does not match actual 
algorithm expected:<[Bogus]PRNG> but was:<[SHA1]PRNG>





-Hoss
http://www.lucidworks.com/
  

RE: Solr 8.6.2 - single query on 2 cores ?

2021-04-30 Thread ufuk yılmaz
Hi Anuj,

First solution that comes to my mind is using streaming expressions. 

leftOuterJoin can do this:

https://solr.apache.org/guide/8_6/stream-decorator-reference.html#leftouterjoin

Example:

leftOuterJoin(
search(archive, q=”postings_id:123”, qt=”/select”, fl=”postings_id”, 
sort=”postings_id asc”)
search(live, q=”postings_id:123”, qt=”/select”, fl=”postings_id”, 
sort=”postings_id asc”)
on=”postings_id”
)

If a document with postings_id:123 is found on live, it is used, otherwise 
document from archive collection is returned because documents from “right” 
(live) overwrites values onto “left” (archive). Add other required fields to fl 
parameter.

Hope it helps

Sent from Mail for Windows 10

From: Anuj Bhargava
Sent: 30 April 2021 16:25
To: solr-u...@lucene.apache.org
Subject: Solr 8.6.2 - single query on 2 cores ?

I have 2 cores '*live*' and '*archive*' with exactly the same fields.

I want to query on a unique id - '*posting_id*'. First it should check
*live* and if not found then in *archive* and show the results.

The following is doing a search on *live* and not on *archive*
http://xxx:8983/solr/live/select?q=*:*&fq=posting_id:41009261&indent=true&shards=archive

The following gives an error -
http://xxx.yyy.zzz.aaa:8983/solr/live/select?q=*:*&fq=posting_id:41009261&indent=true&shards=xxx.yyy.zzz.aaa:8983/solr/archive



401
10

*:*
xxx.yyy.zzz.aaa:8983/solr/archive
true
posting_id:41009261
xml




org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException
org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException

Error from server at null: Expected mime type
application/octet-stream but got text/html.Error
401 Unauthorized  HTTP ERROR 401 Unauthorized
 URI:/solr/archive/select
STATUS:401
MESSAGE:Unauthorized
SERVLET:default
401



How can I do a single query on 2 cores

Have added the following in solr.in.sh - SOLR_OPTS="$SOLR_OPTS
-Dsolr.disable.shardsWhitelist=true"