Hi, Mikhai.

I am very encouraged by your reply.

I will split the csv into smaller ones and give this a try.
https://solr.apache.org/guide/7_1/uploading-data-with-index-handlers.html#csv-formatted-index-updates

Could you confirm my understanding of Solr's language?  Is Solr Indexing
referring to both Loading and Indexing operations?  Another word, using the
following will load and will index:

curl 'http://localhost:8983/solr/my_collection/update?commit=true'
--data-binary @example/exampledocs/books.csv -H
'Content-type:application/csv'

Right?

Thanks.


On Sun, Nov 12, 2023 at 8:20 AM Mikhail Khludnev <m...@apache.org> wrote:

> > .../200mmCsvCore/dataimport?
> refers to
>
> https://solr.apache.org/guide/8_6/uploading-structured-data-store-data-with-the-data-import-handler.html
> which was extracted into https://github.com/SearchScale/dataimporthandler
> It's usually slow. The symptoms of non-efficient slowness is low cpu
> utilization, you can check load average during indexing.
> There are two approaches for reaching higher throughput:
>
>    1. send data in bulks (when /dataimport writes data one by one)
>    2. index in parallel (where /dataimport runs single thread)
>
> Perhaps you can put /dataimport a side, break csv files onto smaller pages,
> about 1k rows in each, and invoke
>
> https://solr.apache.org/guide/7_1/uploading-data-with-index-handlers.html#csv-formatted-index-updates
> in a few parallel threads, it should provide better utilisation.
> Another option is to build an app using concurrent update client with SolrJ
>
> https://solr.apache.org/guide/solr/latest/deployment-guide/solrj.html#types-of-solrclients
>
>
> On Sun, Nov 12, 2023 at 4:23 AM Vince McMahon <
> sippingonesandze...@gmail.com>
> wrote:
>
> > Shawn,
> >
> > Thanks for helping me out.   Solr documentation has a lot of bells and
> > whistles and I am overwhelmed.
> >
> > The total number of documents is 200 millions.  Each line of the csv will
> > be a document.  There are 200 million lines.
> >
> > I have the 2 options on load-n-index
> >
> > The current way of getting data is using API liked https://
> > .../200mmCsvCore/dataimport?
> > command="full-import"
> > &clean=true
> > &commit=true
> > &optimize=true
> > &wt=json
> > &indent=true
> > &verbose=false
> > &debug=false
> >
> > I am thinking of csv because another remote location also wants to use
> Solr
> > and my gut feeling is that fetching a large single csv file over the
> > network will keep data consistent across the two places.
> >
> > I didn't think about the parsing of the csv file with double quotes and
> > delimiter.  Will json file be faster?
> >
> > I am not aware of a way to split the 200 million lines CSV to batches of
> > loads.  Will smaller batches be faster?  Could you give me an example of
> > how to split?
> >
> > From the Solr UI, how can I tell the number of threads are set for
> indexing
> > ?
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Reply via email to