> > using the > following will load and will index: indeed
On Sun, Nov 12, 2023 at 6:53 PM Vince McMahon <sippingonesandze...@gmail.com> wrote: > Hi, Mikhai. > > I am very encouraged by your reply. > > I will split the csv into smaller ones and give this a try. > > https://solr.apache.org/guide/7_1/uploading-data-with-index-handlers.html#csv-formatted-index-updates > > Could you confirm my understanding of Solr's language? Is Solr Indexing > referring to both Loading and Indexing operations? Another word, using the > following will load and will index: > > curl 'http://localhost:8983/solr/my_collection/update?commit=true' > --data-binary @example/exampledocs/books.csv -H > 'Content-type:application/csv' > > Right? > > Thanks. > > > On Sun, Nov 12, 2023 at 8:20 AM Mikhail Khludnev <m...@apache.org> wrote: > > > > .../200mmCsvCore/dataimport? > > refers to > > > > > https://solr.apache.org/guide/8_6/uploading-structured-data-store-data-with-the-data-import-handler.html > > which was extracted into > https://github.com/SearchScale/dataimporthandler > > It's usually slow. The symptoms of non-efficient slowness is low cpu > > utilization, you can check load average during indexing. > > There are two approaches for reaching higher throughput: > > > > 1. send data in bulks (when /dataimport writes data one by one) > > 2. index in parallel (where /dataimport runs single thread) > > > > Perhaps you can put /dataimport a side, break csv files onto smaller > pages, > > about 1k rows in each, and invoke > > > > > https://solr.apache.org/guide/7_1/uploading-data-with-index-handlers.html#csv-formatted-index-updates > > in a few parallel threads, it should provide better utilisation. > > Another option is to build an app using concurrent update client with > SolrJ > > > > > https://solr.apache.org/guide/solr/latest/deployment-guide/solrj.html#types-of-solrclients > > > > > > On Sun, Nov 12, 2023 at 4:23 AM Vince McMahon < > > sippingonesandze...@gmail.com> > > wrote: > > > > > Shawn, > > > > > > Thanks for helping me out. Solr documentation has a lot of bells and > > > whistles and I am overwhelmed. > > > > > > The total number of documents is 200 millions. Each line of the csv > will > > > be a document. There are 200 million lines. > > > > > > I have the 2 options on load-n-index > > > > > > The current way of getting data is using API liked https:// > > > .../200mmCsvCore/dataimport? > > > command="full-import" > > > &clean=true > > > &commit=true > > > &optimize=true > > > &wt=json > > > &indent=true > > > &verbose=false > > > &debug=false > > > > > > I am thinking of csv because another remote location also wants to use > > Solr > > > and my gut feeling is that fetching a large single csv file over the > > > network will keep data consistent across the two places. > > > > > > I didn't think about the parsing of the csv file with double quotes and > > > delimiter. Will json file be faster? > > > > > > I am not aware of a way to split the 200 million lines CSV to batches > of > > > loads. Will smaller batches be faster? Could you give me an example > of > > > how to split? > > > > > > From the Solr UI, how can I tell the number of threads are set for > > indexing > > > ? > > > > > > > > > -- > > Sincerely yours > > Mikhail Khludnev > > > -- Sincerely yours Mikhail Khludnev