Re: How to do fastest loading and indexing

Mikhail Khludnev Sun, 12 Nov 2023 09:40:30 -0800

>
>  using the
> following will load and will index:

indeed


On Sun, Nov 12, 2023 at 6:53 PM Vince McMahon <sippingonesandze...@gmail.com>
wrote:

> Hi, Mikhai.
>
> I am very encouraged by your reply.
>
> I will split the csv into smaller ones and give this a try.
>
> https://solr.apache.org/guide/7_1/uploading-data-with-index-handlers.html#csv-formatted-index-updates
>
> Could you confirm my understanding of Solr's language?  Is Solr Indexing
> referring to both Loading and Indexing operations?  Another word, using the
> following will load and will index:
>
> curl 'http://localhost:8983/solr/my_collection/update?commit=true'
> --data-binary @example/exampledocs/books.csv -H
> 'Content-type:application/csv'
>
> Right?
>
> Thanks.
>
>
> On Sun, Nov 12, 2023 at 8:20 AM Mikhail Khludnev <m...@apache.org> wrote:
>
> > > .../200mmCsvCore/dataimport?
> > refers to
> >
> >
> https://solr.apache.org/guide/8_6/uploading-structured-data-store-data-with-the-data-import-handler.html
> > which was extracted into
> https://github.com/SearchScale/dataimporthandler
> > It's usually slow. The symptoms of non-efficient slowness is low cpu
> > utilization, you can check load average during indexing.
> > There are two approaches for reaching higher throughput:
> >
> >    1. send data in bulks (when /dataimport writes data one by one)
> >    2. index in parallel (where /dataimport runs single thread)
> >
> > Perhaps you can put /dataimport a side, break csv files onto smaller
> pages,
> > about 1k rows in each, and invoke
> >
> >
> https://solr.apache.org/guide/7_1/uploading-data-with-index-handlers.html#csv-formatted-index-updates
> > in a few parallel threads, it should provide better utilisation.
> > Another option is to build an app using concurrent update client with
> SolrJ
> >
> >
> https://solr.apache.org/guide/solr/latest/deployment-guide/solrj.html#types-of-solrclients
> >
> >
> > On Sun, Nov 12, 2023 at 4:23 AM Vince McMahon <
> > sippingonesandze...@gmail.com>
> > wrote:
> >
> > > Shawn,
> > >
> > > Thanks for helping me out.   Solr documentation has a lot of bells and
> > > whistles and I am overwhelmed.
> > >
> > > The total number of documents is 200 millions.  Each line of the csv
> will
> > > be a document.  There are 200 million lines.
> > >
> > > I have the 2 options on load-n-index
> > >
> > > The current way of getting data is using API liked https://
> > > .../200mmCsvCore/dataimport?
> > > command="full-import"
> > > &clean=true
> > > &commit=true
> > > &optimize=true
> > > &wt=json
> > > &indent=true
> > > &verbose=false
> > > &debug=false
> > >
> > > I am thinking of csv because another remote location also wants to use
> > Solr
> > > and my gut feeling is that fetching a large single csv file over the
> > > network will keep data consistent across the two places.
> > >
> > > I didn't think about the parsing of the csv file with double quotes and
> > > delimiter.  Will json file be faster?
> > >
> > > I am not aware of a way to split the 200 million lines CSV to batches
> of
> > > loads.  Will smaller batches be faster?  Could you give me an example
> of
> > > how to split?
> > >
> > > From the Solr UI, how can I tell the number of threads are set for
> > indexing
> > > ?
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>


-- 
Sincerely yours
Mikhail Khludnev

Re: How to do fastest loading and indexing

Reply via email to