> .../200mmCsvCore/dataimport?
refers to
https://solr.apache.org/guide/8_6/uploading-structured-data-store-data-with-the-data-import-handler.html
which was extracted into https://github.com/SearchScale/dataimporthandler
It's usually slow. The symptoms of non-efficient slowness is low cpu
utilization, you can check load average during indexing.
There are two approaches for reaching higher throughput:

   1. send data in bulks (when /dataimport writes data one by one)
   2. index in parallel (where /dataimport runs single thread)

Perhaps you can put /dataimport a side, break csv files onto smaller pages,
about 1k rows in each, and invoke
https://solr.apache.org/guide/7_1/uploading-data-with-index-handlers.html#csv-formatted-index-updates
in a few parallel threads, it should provide better utilisation.
Another option is to build an app using concurrent update client with SolrJ
https://solr.apache.org/guide/solr/latest/deployment-guide/solrj.html#types-of-solrclients


On Sun, Nov 12, 2023 at 4:23 AM Vince McMahon <sippingonesandze...@gmail.com>
wrote:

> Shawn,
>
> Thanks for helping me out.   Solr documentation has a lot of bells and
> whistles and I am overwhelmed.
>
> The total number of documents is 200 millions.  Each line of the csv will
> be a document.  There are 200 million lines.
>
> I have the 2 options on load-n-index
>
> The current way of getting data is using API liked https://
> .../200mmCsvCore/dataimport?
> command="full-import"
> &clean=true
> &commit=true
> &optimize=true
> &wt=json
> &indent=true
> &verbose=false
> &debug=false
>
> I am thinking of csv because another remote location also wants to use Solr
> and my gut feeling is that fetching a large single csv file over the
> network will keep data consistent across the two places.
>
> I didn't think about the parsing of the csv file with double quotes and
> delimiter.  Will json file be faster?
>
> I am not aware of a way to split the 200 million lines CSV to batches of
> loads.  Will smaller batches be faster?  Could you give me an example of
> how to split?
>
> From the Solr UI, how can I tell the number of threads are set for indexing
> ?
>


-- 
Sincerely yours
Mikhail Khludnev

Reply via email to