70 million can be a lot or a little. Doc count is not even half the story. How much storage space do these documents occupy in the database? Is the text tweet sized, or multi-megabyte sized clobs, or links files on a file store that need to be fetched and parsed (or OCR'd or converted from audio/video to transcripts)? IOT type docs with very minimal text can be indexed much faster than 50 page pdf documents. With very large clusters and indexing systems distributing work across a spark cluster I've seen as high as 1.3M docs/sec... and 70M would be trivial for that system (they had hundreds of billions). But text documents are typically much, much slower than that, especially if the text must be extracted from dirty formats such as pdf or word data, or complex custom analysis is involved, or additional fetching of files or data to merge into the doc is required.
As for the two formats: If you are indexing with java code, choose Java Binary. If you are using a non-java language you can use JSON. The rare case of JSON from java would be if your data was already in JSON format... then it depends on whether solr is limiting you (do work on the indexers and use java bin so it has less parsing to do) or your indexing machines are limiting you (use JSON so your indexers don't have to do the conversion). Like many things in search "It depends" :) On Thu, Sep 29, 2022 at 4:07 AM Shankar R <iamrav...@gmail.com> wrote: > Hi, > We are having nearly 70-80 millions of data which need to be indexed in > solr 8.6.1. > We want to choose between Java BInary format or direct JSON format. > Our source data is DBMS which is a structured data. > > Regards > Ravi > -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)