> .../200mmCsvCore/dataimport? refers to https://solr.apache.org/guide/8_6/uploading-structured-data-store-data-with-the-data-import-handler.html which was extracted into https://github.com/SearchScale/dataimporthandler It's usually slow. The symptoms of non-efficient slowness is low cpu utilization, you can check load average during indexing. There are two approaches for reaching higher throughput:
1. send data in bulks (when /dataimport writes data one by one) 2. index in parallel (where /dataimport runs single thread) Perhaps you can put /dataimport a side, break csv files onto smaller pages, about 1k rows in each, and invoke https://solr.apache.org/guide/7_1/uploading-data-with-index-handlers.html#csv-formatted-index-updates in a few parallel threads, it should provide better utilisation. Another option is to build an app using concurrent update client with SolrJ https://solr.apache.org/guide/solr/latest/deployment-guide/solrj.html#types-of-solrclients On Sun, Nov 12, 2023 at 4:23 AM Vince McMahon <sippingonesandze...@gmail.com> wrote: > Shawn, > > Thanks for helping me out. Solr documentation has a lot of bells and > whistles and I am overwhelmed. > > The total number of documents is 200 millions. Each line of the csv will > be a document. There are 200 million lines. > > I have the 2 options on load-n-index > > The current way of getting data is using API liked https:// > .../200mmCsvCore/dataimport? > command="full-import" > &clean=true > &commit=true > &optimize=true > &wt=json > &indent=true > &verbose=false > &debug=false > > I am thinking of csv because another remote location also wants to use Solr > and my gut feeling is that fetching a large single csv file over the > network will keep data consistent across the two places. > > I didn't think about the parsing of the csv file with double quotes and > delimiter. Will json file be faster? > > I am not aware of a way to split the 200 million lines CSV to batches of > loads. Will smaller batches be faster? Could you give me an example of > how to split? > > From the Solr UI, how can I tell the number of threads are set for indexing > ? > -- Sincerely yours Mikhail Khludnev