hl.method=original

2023-11-11 Thread Oakley, Craig (NIH/NLM/NCBI) [C]
In Solr 9.2.1, is there a way to tweak solrconfig.xml so that 
hl.method=original is (once again) the default?


Re: hl.method=original

2023-11-11 Thread Eric Pugh
Wouldn’t using an invariant definition work?  
https://solr.apache.org/guide/solr/latest/configuration-guide/requesthandlers-searchcomponents.html#invariants



> On Nov 11, 2023, at 12:18 PM, Oakley, Craig (NIH/NLM/NCBI) [C] 
>  wrote:
> 
> In Solr 9.2.1, is there a way to tweak solrconfig.xml so that 
> hl.method=original is (once again) the default?

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 


This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



How to do fastest loading and indexing

2023-11-11 Thread Vince McMahon
Hi,

I have a CVS file with 200 fields and 100 million rows of historical and
latest data.

The current processing is taking 20+ hours.

The schema is liked:

...



...



In terms of hardware, I have 3 identical servers.  One of them is used to
load this CSV to create a core.

What is the fastest way to load and index this large and wide CSV file?  It
is taking too long, 20+ hours, now.


Re: How to do fastest loading and indexing

2023-11-11 Thread Benedict Holland
Are you using copy from?

On Sat, Nov 11, 2023, 2:33 PM Vince McMahon 
wrote:

> Hi,
>
> I have a CVS file with 200 fields and 100 million rows of historical and
> latest data.
>
> The current processing is taking 20+ hours.
>
> The schema is liked:
> 
> ...
> 
> 
> 
> ...
> 
> 
>
> In terms of hardware, I have 3 identical servers.  One of them is used to
> load this CSV to create a core.
>
> What is the fastest way to load and index this large and wide CSV file?  It
> is taking too long, 20+ hours, now.
>


Re: How to do fastest loading and indexing

2023-11-11 Thread Vince McMahon
I'm not querying with catch_all at the moment, but, other developers may.

I am new.  Mind sharing how it matters, esp. How it makes loading n idx
fast?



On Sat, Nov 11, 2023, 3:05 PM Benedict Holland 
wrote:

> Are you using copy from?
>
> On Sat, Nov 11, 2023, 2:33 PM Vince McMahon  >
> wrote:
>
> > Hi,
> >
> > I have a CVS file with 200 fields and 100 million rows of historical and
> > latest data.
> >
> > The current processing is taking 20+ hours.
> >
> > The schema is liked:
> > 
> > ...
> > 
> > 
> > 
> > ...
> > 
> > 
> >
> > In terms of hardware, I have 3 identical servers.  One of them is used to
> > load this CSV to create a core.
> >
> > What is the fastest way to load and index this large and wide CSV file?
> It
> > is taking too long, 20+ hours, now.
> >
>


Re: How to do fastest loading and indexing

2023-11-11 Thread Benedict Holland
It really depends on how you are loafing data. If you are going line by
line then it's going to be very slow. Upi shpuld load datasets like this
with a copy from. If you have any issues with your csv file though, it's
going to be a problem. Things like commas without quotes tend to make up
the most common problem I deal with. Ypu could also split it up into
smaller files for processing so if one fails then you have a record of
where pick it up.

Check out this

https://www.postgresql.org/docs/current/sql-copy.html

Thanks,
Ben

On Sat, Nov 11, 2023, 3:55 PM Vince McMahon 
wrote:

> I'm not querying with catch_all at the moment, but, other developers may.
>
> I am new.  Mind sharing how it matters, esp. How it makes loading n idx
> fast?
>
>
>
> On Sat, Nov 11, 2023, 3:05 PM Benedict Holland <
> benedict.m.holl...@gmail.com>
> wrote:
>
> > Are you using copy from?
> >
> > On Sat, Nov 11, 2023, 2:33 PM Vince McMahon <
> sippingonesandze...@gmail.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > I have a CVS file with 200 fields and 100 million rows of historical
> and
> > > latest data.
> > >
> > > The current processing is taking 20+ hours.
> > >
> > > The schema is liked:
> > > 
> > > ...
> > > 
> > > 
> > > 
> > > ...
> > > 
> > > 
> > >
> > > In terms of hardware, I have 3 identical servers.  One of them is used
> to
> > > load this CSV to create a core.
> > >
> > > What is the fastest way to load and index this large and wide CSV file?
> > It
> > > is taking too long, 20+ hours, now.
> > >
> >
>


Re: How to do fastest loading and indexing

2023-11-11 Thread Benedict Holland
Oh, also, this matters because copy from is a batch job that will stream
your data into the table. It's extremely fast. Your indexes are not the
problem. They are extremely efficient. The problem is likely how you are
loading the data.

So actually, how are you loading the data?

Thanks,
Ben

On Sat, Nov 11, 2023, 3:55 PM Vince McMahon 
wrote:

> I'm not querying with catch_all at the moment, but, other developers may.
>
> I am new.  Mind sharing how it matters, esp. How it makes loading n idx
> fast?
>
>
>
> On Sat, Nov 11, 2023, 3:05 PM Benedict Holland <
> benedict.m.holl...@gmail.com>
> wrote:
>
> > Are you using copy from?
> >
> > On Sat, Nov 11, 2023, 2:33 PM Vince McMahon <
> sippingonesandze...@gmail.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > I have a CVS file with 200 fields and 100 million rows of historical
> and
> > > latest data.
> > >
> > > The current processing is taking 20+ hours.
> > >
> > > The schema is liked:
> > > 
> > > ...
> > > 
> > > 
> > > 
> > > ...
> > > 
> > > 
> > >
> > > In terms of hardware, I have 3 identical servers.  One of them is used
> to
> > > load this CSV to create a core.
> > >
> > > What is the fastest way to load and index this large and wide CSV file?
> > It
> > > is taking too long, 20+ hours, now.
> > >
> >
>


Re: How to do fastest loading and indexing

2023-11-11 Thread Shawn Heisey

On 11/11/2023 12:32, Vince McMahon wrote:

What is the fastest way to load and index this large and wide CSV file?  It
is taking too long, 20+ hours, now.


I am assuming here that you are sending the CSV data directly to Solr 
and letting Solr parse it into documents.  If that is incorrect, please 
fully describe your indexing software.


How many total documents are being indexed in those 20 hours?

How many threads do you have indexing simultaneously?  How many CSV 
lines are you sending in each batch?


When I was maintaining large-ish Solr installs, I was doing the indexing 
single-threaded and it would do about 1000 docs per second.  Indexing 
with multiple threads is the secret to making Solr index quickly.


Thanks,
Shawn



Re: How to do fastest loading and indexing

2023-11-11 Thread Vince McMahon
>From the Solr UI, how can I tell the number of threads are set for indexing
?

On Sat, Nov 11, 2023, 5:31 PM Shawn Heisey 
wrote:

> On 11/11/2023 12:32, Vince McMahon wrote:
> > What is the fastest way to load and index this large and wide CSV file?
> It
> > is taking too long, 20+ hours, now.
>
> I am assuming here that you are sending the CSV data directly to Solr
> and letting Solr parse it into documents.  If that is incorrect, please
> fully describe your indexing software.
>
> How many total documents are being indexed in those 20 hours?
>
> How many threads do you have indexing simultaneously?  How many CSV
> lines are you sending in each batch?
>
> When I was maintaining large-ish Solr installs, I was doing the indexing
> single-threaded and it would do about 1000 docs per second.  Indexing
> with multiple threads is the secret to making Solr index quickly.
>
> Thanks,
> Shawn
>
>


Re: How to do fastest loading and indexing

2023-11-11 Thread Vince McMahon
Benedict,

Thanks for your replies.

I am trying to load to Solr Core and index there, not postgres.

Would you happen to know the fastest way to load and index Solr Core?

Thanks.


On Sat, Nov 11, 2023, 4:41 PM Benedict Holland 
wrote:

> Oh, also, this matters because copy from is a batch job that will stream
> your data into the table. It's extremely fast. Your indexes are not the
> problem. They are extremely efficient. The problem is likely how you are
> loading the data.
>
> So actually, how are you loading the data?
>
> Thanks,
> Ben
>
> On Sat, Nov 11, 2023, 3:55 PM Vince McMahon  >
> wrote:
>
> > I'm not querying with catch_all at the moment, but, other developers may.
> >
> > I am new.  Mind sharing how it matters, esp. How it makes loading n idx
> > fast?
> >
> >
> >
> > On Sat, Nov 11, 2023, 3:05 PM Benedict Holland <
> > benedict.m.holl...@gmail.com>
> > wrote:
> >
> > > Are you using copy from?
> > >
> > > On Sat, Nov 11, 2023, 2:33 PM Vince McMahon <
> > sippingonesandze...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a CVS file with 200 fields and 100 million rows of historical
> > and
> > > > latest data.
> > > >
> > > > The current processing is taking 20+ hours.
> > > >
> > > > The schema is liked:
> > > > 
> > > > ...
> > > > 
> > > > 
> > > > 
> > > > ...
> > > > 
> > > > 
> > > >
> > > > In terms of hardware, I have 3 identical servers.  One of them is
> used
> > to
> > > > load this CSV to create a core.
> > > >
> > > > What is the fastest way to load and index this large and wide CSV
> file?
> > > It
> > > > is taking too long, 20+ hours, now.
> > > >
> > >
> >
>


Re: How to do fastest loading and indexing

2023-11-11 Thread Vince McMahon
Shawn,

Thanks for helping me out.   Solr documentation has a lot of bells and
whistles and I am overwhelmed.

The total number of documents is 200 millions.  Each line of the csv will
be a document.  There are 200 million lines.

I have the 2 options on load-n-index

The current way of getting data is using API liked https://
.../200mmCsvCore/dataimport?
command="full-import"
&clean=true
&commit=true
&optimize=true
&wt=json
&indent=true
&verbose=false
&debug=false

I am thinking of csv because another remote location also wants to use Solr
and my gut feeling is that fetching a large single csv file over the
network will keep data consistent across the two places.

I didn't think about the parsing of the csv file with double quotes and
delimiter.  Will json file be faster?

I am not aware of a way to split the 200 million lines CSV to batches of
loads.  Will smaller batches be faster?  Could you give me an example of
how to split?

>From the Solr UI, how can I tell the number of threads are set for indexing
?


is it in mill-sec or seconds

2023-11-11 Thread Vince McMahon
Hi,

I would like to find out the unit of the QUERY./dataImport.totalTime.  Is
it in mill-seconds or seconds?



[image: image.png]