Hi, Mikhai. I am very encouraged by your reply.
I will split the csv into smaller ones and give this a try. https://solr.apache.org/guide/7_1/uploading-data-with-index-handlers.html#csv-formatted-index-updates Could you confirm my understanding of Solr's language? Is Solr Indexing referring to both Loading and Indexing operations? Another word, using the following will load and will index: curl 'http://localhost:8983/solr/my_collection/update?commit=true' --data-binary @example/exampledocs/books.csv -H 'Content-type:application/csv' Right? Thanks. On Sun, Nov 12, 2023 at 8:20 AM Mikhail Khludnev <m...@apache.org> wrote: > > .../200mmCsvCore/dataimport? > refers to > > https://solr.apache.org/guide/8_6/uploading-structured-data-store-data-with-the-data-import-handler.html > which was extracted into https://github.com/SearchScale/dataimporthandler > It's usually slow. The symptoms of non-efficient slowness is low cpu > utilization, you can check load average during indexing. > There are two approaches for reaching higher throughput: > > 1. send data in bulks (when /dataimport writes data one by one) > 2. index in parallel (where /dataimport runs single thread) > > Perhaps you can put /dataimport a side, break csv files onto smaller pages, > about 1k rows in each, and invoke > > https://solr.apache.org/guide/7_1/uploading-data-with-index-handlers.html#csv-formatted-index-updates > in a few parallel threads, it should provide better utilisation. > Another option is to build an app using concurrent update client with SolrJ > > https://solr.apache.org/guide/solr/latest/deployment-guide/solrj.html#types-of-solrclients > > > On Sun, Nov 12, 2023 at 4:23 AM Vince McMahon < > sippingonesandze...@gmail.com> > wrote: > > > Shawn, > > > > Thanks for helping me out. Solr documentation has a lot of bells and > > whistles and I am overwhelmed. > > > > The total number of documents is 200 millions. Each line of the csv will > > be a document. There are 200 million lines. > > > > I have the 2 options on load-n-index > > > > The current way of getting data is using API liked https:// > > .../200mmCsvCore/dataimport? > > command="full-import" > > &clean=true > > &commit=true > > &optimize=true > > &wt=json > > &indent=true > > &verbose=false > > &debug=false > > > > I am thinking of csv because another remote location also wants to use > Solr > > and my gut feeling is that fetching a large single csv file over the > > network will keep data consistent across the two places. > > > > I didn't think about the parsing of the csv file with double quotes and > > delimiter. Will json file be faster? > > > > I am not aware of a way to split the 200 million lines CSV to batches of > > loads. Will smaller batches be faster? Could you give me an example of > > how to split? > > > > From the Solr UI, how can I tell the number of threads are set for > indexing > > ? > > > > > -- > Sincerely yours > Mikhail Khludnev >