About Using Hadoop in SolrCloud

2023-02-23 Thread Zara Parst
Hi,

I read at many places about using Hadoop in solrCloud. I try to find the
reason why to use Hadoop in place of a local file system. Can someone
briefly explain why to use Hadoop with SolrCloud when solr is just using
Hadoop for indexing and storing logs in Hadoop. Is there any compelling
reason to do that?

Is Hadoop having any advantage over the local file system with solr, since
I can achieve cloud mod storing index in the local file system and can
still use shard and replica.  So my question is what advantage Hadoop will
give me, does Hadoop do indexing fast, does Hadoop take less space to store
index, is that distributed file system is better in Hadoop, like sharding,
replication etc. Or does it take backup automatically?

Please do answer this question as much as possible,


Re: About Using Hadoop in SolrCloud

2023-02-23 Thread Eric Pugh
I am replying, but just to the users mailing list, as it’s not appropriate for 
dev@.

I think the short answer is that if you are already super into the Hadoop 
ecosystem, then you already have strong reasons why, and you can answer all of 
your questions listed already ;-).  You then look at Solr on Hadoop as “hey, it 
works with what I am already doing” at my enterprise.  

If you aren’t already in the Hadoop ecosystem, then there isn’t any special 
Solr specific reason to go this way, and indeed many reasons NOT to.   Hadoop 
isn’t for the faint of heart….  

Not an answer per se…. 

> On Feb 23, 2023, at 5:57 AM, Zara Parst  wrote:
> 
> Hi,
> 
> I read at many places about using Hadoop in solrCloud. I try to find the
> reason why to use Hadoop in place of a local file system. Can someone
> briefly explain why to use Hadoop with SolrCloud when solr is just using
> Hadoop for indexing and storing logs in Hadoop. Is there any compelling
> reason to do that?
> 
> Is Hadoop having any advantage over the local file system with solr, since
> I can achieve cloud mod storing index in the local file system and can
> still use shard and replica.  So my question is what advantage Hadoop will
> give me, does Hadoop do indexing fast, does Hadoop take less space to store
> index, is that distributed file system is better in Hadoop, like sharding,
> replication etc. Or does it take backup automatically?
> 
> Please do answer this question as much as possible,

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 


This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Solr copyfield error

2023-02-23 Thread Paul Ryder
Hi All,

Getting errors on Solr such as "copyfield dest: 'fred_str' is not an explicit 
field and doesn't match a dynamic field"

The solrconfig has an entry


*_str
256


This seems to be relation to Guessing field types

The schema does not contain an dynamic field entry for _str  (it only has _s 
for string type)

So if Solr is copying fields to its guessed type and also to _str (according to 
docs) then shouldn't the dynamic field *_str be in schema?

OR

Should I edit the copyField entry "*_str" to be "*_s" instead?

Thanks, Paul




RE: Solr copyfield error

2023-02-23 Thread Paul Ryder
Should have said, this is Solr 8.1.1 running under Sitecore managed_schema... 


-Original Message-
From: Paul Ryder  
Sent: 23 February 2023 13:55
To: users@solr.apache.org
Subject: Solr copyfield error

Hi All,

Getting errors on Solr such as "copyfield dest: 'fred_str' is not an explicit 
field and doesn't match a dynamic field"

The solrconfig has an entry


*_str
256


This seems to be relation to Guessing field types

The schema does not contain an dynamic field entry for _str  (it only has _s 
for string type)

So if Solr is copying fields to its guessed type and also to _str (according to 
docs) then shouldn't the dynamic field *_str be in schema?

OR

Should I edit the copyField entry "*_str" to be "*_s" instead?

Thanks, Paul




Re: About Using Hadoop in SolrCloud

2023-02-23 Thread Zara Parst
I think I was looking for someone to tell me, hey don't do something fancy
unless you must have. I am somehow at ease now. Now I will leave Hadoop for
some other project.

On Thu, Feb 23, 2023 at 6:16 PM Eric Pugh 
wrote:

> I am replying, but just to the users mailing list, as it’s not appropriate
> for dev@.
>
> I think the short answer is that if you are already super into the Hadoop
> ecosystem, then you already have strong reasons why, and you can answer all
> of your questions listed already ;-).  You then look at Solr on Hadoop as
> “hey, it works with what I am already doing” at my enterprise.
>
> If you aren’t already in the Hadoop ecosystem, then there isn’t any
> special Solr specific reason to go this way, and indeed many reasons NOT
> to.   Hadoop isn’t for the faint of heart….
>
> Not an answer per se….
>
> > On Feb 23, 2023, at 5:57 AM, Zara Parst  wrote:
> >
> > Hi,
> >
> > I read at many places about using Hadoop in solrCloud. I try to find the
> > reason why to use Hadoop in place of a local file system. Can someone
> > briefly explain why to use Hadoop with SolrCloud when solr is just using
> > Hadoop for indexing and storing logs in Hadoop. Is there any compelling
> > reason to do that?
> >
> > Is Hadoop having any advantage over the local file system with solr,
> since
> > I can achieve cloud mod storing index in the local file system and can
> > still use shard and replica.  So my question is what advantage Hadoop
> will
> > give me, does Hadoop do indexing fast, does Hadoop take less space to
> store
> > index, is that distributed file system is better in Hadoop, like
> sharding,
> > replication etc. Or does it take backup automatically?
> >
> > Please do answer this question as much as possible,
>
> ___
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>


Re: Return field aliasing for id field does not work

2023-02-23 Thread Mikhail Khludnev
Hello, Rajani.
I build and launched fresh [main] branch. fl aliasing works there, here's
the proof pic
https://pasteboard.co/KNFAyGPzBKFP.png


On Thu, Feb 23, 2023 at 5:06 AM Rajani Maski  wrote:

> Hi,
>
>  Solr 9.x, aliasing "id" field in return fields list with a different field
> value does not work the same as 7.x.  fl=unique_key:id,id:some_field. This
> returns an empty docs array in the response. Any alternative?
>
> Thanks,
> rajani
>


-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!


Re: About Using Hadoop in SolrCloud

2023-02-23 Thread Eric Pugh
;-).


> On Feb 23, 2023, at 9:34 AM, Zara Parst  wrote:
> 
> I think I was looking for someone to tell me, hey don't do something fancy
> unless you must have. I am somehow at ease now. Now I will leave Hadoop for
> some other project.
> 
> On Thu, Feb 23, 2023 at 6:16 PM Eric Pugh  >
> wrote:
> 
>> I am replying, but just to the users mailing list, as it’s not appropriate
>> for dev@.
>> 
>> I think the short answer is that if you are already super into the Hadoop
>> ecosystem, then you already have strong reasons why, and you can answer all
>> of your questions listed already ;-).  You then look at Solr on Hadoop as
>> “hey, it works with what I am already doing” at my enterprise.
>> 
>> If you aren’t already in the Hadoop ecosystem, then there isn’t any
>> special Solr specific reason to go this way, and indeed many reasons NOT
>> to.   Hadoop isn’t for the faint of heart….
>> 
>> Not an answer per se….
>> 
>>> On Feb 23, 2023, at 5:57 AM, Zara Parst  wrote:
>>> 
>>> Hi,
>>> 
>>> I read at many places about using Hadoop in solrCloud. I try to find the
>>> reason why to use Hadoop in place of a local file system. Can someone
>>> briefly explain why to use Hadoop with SolrCloud when solr is just using
>>> Hadoop for indexing and storing logs in Hadoop. Is there any compelling
>>> reason to do that?
>>> 
>>> Is Hadoop having any advantage over the local file system with solr,
>> since
>>> I can achieve cloud mod storing index in the local file system and can
>>> still use shard and replica.  So my question is what advantage Hadoop
>> will
>>> give me, does Hadoop do indexing fast, does Hadoop take less space to
>> store
>>> index, is that distributed file system is better in Hadoop, like
>> sharding,
>>> replication etc. Or does it take backup automatically?
>>> 
>>> Please do answer this question as much as possible,
>> 
>> ___
>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>> http://www.opensourceconnections.com <
>> http://www.opensourceconnections.com/ 
>> > | My Free/Busy <
>> http://tinyurl.com/eric-cal >
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>  
>> >
>> 
>> This e-mail and all contents, including attachments, is considered to be
>> Company Confidential unless explicitly stated otherwise, regardless of
>> whether attachments are marked as such.

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 


This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: About Using Hadoop in SolrCloud

2023-02-23 Thread David Smiley
I agree with Eric, but wish to add one point:  Separation of compute from
storage to get: better redundancy (HDFS or S3 will do it better, maybe
cheaper), better elasticity (since Solr nodes become stateless; easy to add
more nodes), better cost?  Sacrifice indexing performance and a bit of
query.  Admittedly I don't have real experience here but this is my
thinking.  The most annoying thing about Solr's HDFS support is that
SolrCloud's replication is quite redundant/wasteful with that at the
storage layer, thus adding cost inefficiency. There is potential for
improvements there.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Feb 23, 2023 at 7:45 AM Eric Pugh 
wrote:

> I am replying, but just to the users mailing list, as it’s not appropriate
> for dev@.
>
> I think the short answer is that if you are already super into the Hadoop
> ecosystem, then you already have strong reasons why, and you can answer all
> of your questions listed already ;-).  You then look at Solr on Hadoop as
> “hey, it works with what I am already doing” at my enterprise.
>
> If you aren’t already in the Hadoop ecosystem, then there isn’t any
> special Solr specific reason to go this way, and indeed many reasons NOT
> to.   Hadoop isn’t for the faint of heart….
>
> Not an answer per se….
>
> > On Feb 23, 2023, at 5:57 AM, Zara Parst  wrote:
> >
> > Hi,
> >
> > I read at many places about using Hadoop in solrCloud. I try to find the
> > reason why to use Hadoop in place of a local file system. Can someone
> > briefly explain why to use Hadoop with SolrCloud when solr is just using
> > Hadoop for indexing and storing logs in Hadoop. Is there any compelling
> > reason to do that?
> >
> > Is Hadoop having any advantage over the local file system with solr,
> since
> > I can achieve cloud mod storing index in the local file system and can
> > still use shard and replica.  So my question is what advantage Hadoop
> will
> > give me, does Hadoop do indexing fast, does Hadoop take less space to
> store
> > index, is that distributed file system is better in Hadoop, like
> sharding,
> > replication etc. Or does it take backup automatically?
> >
> > Please do answer this question as much as possible,
>
> ___
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>


using dense vector search with Solr

2023-02-23 Thread Till Kinstler

Hi,

I've been playing with "neural" / dense vector search in Solr 9 a bit 
and find it very promising.
Currently I am calculating the vectors outside of Solr at indexing and 
search time with a bunch of scripts using NLP models (text in, vectors 
out...). Especially at search time, that's not exactly a handy solution, 
because every client application would have to do this (or some sort of 
proxy application between client applications and Solr, that would 
manipulate requests (search terms out, vector in) on their way to Solr). 
That's ok for my very basic prototype, but nothing else.
How are others solving this? Are there any best practices? Or even plans 
to make Solr talk directly to ML models?
In Solr's traditional logic, I would imagine something like an analyzer, 
that does the "dense vector creation" at indexing and search time. It 
would have to use a ML model, pass data/searches in, get vectors out and 
put them into a DenseVectorField. Just as traditional analyzers work. 
The model could be a configurable ONNX model?
Is someone working on something like this? (I only found some related 
comments in https://github.com/apache/solr/pull/1213)


Till

--
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinst...@gbv.de, http://www.gbv.de/


Re: using dense vector search with Solr

2023-02-23 Thread Derek C
Hi Till & all,

We are using KNN search in a SOLR cloud 9.0.0 cluster.  We had played
around with Spotify's Annoy but,
apart from obviously # of documents vs memory size issues, the really big
thing about KNN in SOLR is
that we can mix queries (I'm still not 100% on how this really works
because we have to really push TopK when mixing
KNN queries with other variables but it does mostly work).  We can't
really go wild on this yet because it's just
too CPU intensive right now (we;ve got enough RAM, 64GB, in every EC2
instance to cache the entire 2.5M documents
which must be helping).  The embeddings I'm holding in the dense vector
fields are 512 in size so I'm thinking about trying to
reduce to something like 64 using PCA maybe (I'm a bit afraid of losing
accuracy but maybe the
reduction will make it feasible to use KNN for more mainstream searches).
Our embeddings are
entirely inferenced outside of SOLR and then included in the SOLR record.

Derek

On Thu, Feb 23, 2023 at 9:44 PM Till Kinstler  wrote:

> Hi,
>
> I've been playing with "neural" / dense vector search in Solr 9 a bit
> and find it very promising.
> Currently I am calculating the vectors outside of Solr at indexing and
> search time with a bunch of scripts using NLP models (text in, vectors
> out...). Especially at search time, that's not exactly a handy solution,
> because every client application would have to do this (or some sort of
> proxy application between client applications and Solr, that would
> manipulate requests (search terms out, vector in) on their way to Solr).
> That's ok for my very basic prototype, but nothing else.
> How are others solving this? Are there any best practices? Or even plans
> to make Solr talk directly to ML models?
> In Solr's traditional logic, I would imagine something like an analyzer,
> that does the "dense vector creation" at indexing and search time. It
> would have to use a ML model, pass data/searches in, get vectors out and
> put them into a DenseVectorField. Just as traditional analyzers work.
> The model could be a configurable ONNX model?
> Is someone working on something like this? (I only found some related
> comments in https://github.com/apache/solr/pull/1213)
>
> Till
>
> --
> Till Kinstler
> Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> Platz der Göttinger Sieben 1, D 37073 Göttingen
> kinst...@gbv.de, http://www.gbv.de/
>


-- 
-- 
Derek Conniffe
Harvey Software Systems Ltd T/A HSSL
Telephone (IRL): 086 856 3823
Telephone (US): (650) 443 8285
Skype: dconnrt
Email: de...@hssl.ie


*Disclaimer:* This email and any files transmitted with it are confidential
and intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please delete it
(if you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of
this information is strictly prohibited).
*Warning*: Although HSSL have taken reasonable precautions to ensure no
viruses are present in this email, HSSL cannot accept responsibility for
any loss or damage arising from the use of this email or attachments.
P For the Environment, please only print this email if necessary.


Re: About Using Hadoop in SolrCloud

2023-02-23 Thread Zara Parst
David, you made a point. Is it true we can keep indexes to S3? I mean index
under use not the backup ?

On Fri, Feb 24, 2023 at 1:11 AM David Smiley  wrote:

> I agree with Eric, but wish to add one point:  Separation of compute from
> storage to get: better redundancy (HDFS or S3 will do it better, maybe
> cheaper), better elasticity (since Solr nodes become stateless; easy to add
> more nodes), better cost?  Sacrifice indexing performance and a bit of
> query.  Admittedly I don't have real experience here but this is my
> thinking.  The most annoying thing about Solr's HDFS support is that
> SolrCloud's replication is quite redundant/wasteful with that at the
> storage layer, thus adding cost inefficiency. There is potential for
> improvements there.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Thu, Feb 23, 2023 at 7:45 AM Eric Pugh  >
> wrote:
>
> > I am replying, but just to the users mailing list, as it’s not
> appropriate
> > for dev@.
> >
> > I think the short answer is that if you are already super into the Hadoop
> > ecosystem, then you already have strong reasons why, and you can answer
> all
> > of your questions listed already ;-).  You then look at Solr on Hadoop as
> > “hey, it works with what I am already doing” at my enterprise.
> >
> > If you aren’t already in the Hadoop ecosystem, then there isn’t any
> > special Solr specific reason to go this way, and indeed many reasons NOT
> > to.   Hadoop isn’t for the faint of heart….
> >
> > Not an answer per se….
> >
> > > On Feb 23, 2023, at 5:57 AM, Zara Parst  wrote:
> > >
> > > Hi,
> > >
> > > I read at many places about using Hadoop in solrCloud. I try to find
> the
> > > reason why to use Hadoop in place of a local file system. Can someone
> > > briefly explain why to use Hadoop with SolrCloud when solr is just
> using
> > > Hadoop for indexing and storing logs in Hadoop. Is there any compelling
> > > reason to do that?
> > >
> > > Is Hadoop having any advantage over the local file system with solr,
> > since
> > > I can achieve cloud mod storing index in the local file system and can
> > > still use shard and replica.  So my question is what advantage Hadoop
> > will
> > > give me, does Hadoop do indexing fast, does Hadoop take less space to
> > store
> > > index, is that distributed file system is better in Hadoop, like
> > sharding,
> > > replication etc. Or does it take backup automatically?
> > >
> > > Please do answer this question as much as possible,
> >
> > ___
> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> > http://www.opensourceconnections.com <
> > http://www.opensourceconnections.com/> | My Free/Busy <
> > http://tinyurl.com/eric-cal>
> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >
> >
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless of
> > whether attachments are marked as such.
> >
> >
>