70 million can be a lot or a little. Doc count is not even half the story.
How much storage space do these documents occupy in the database? Is the
text tweet sized, or multi-megabyte sized clobs, or links files on a file
store that need to be fetched and parsed (or OCR'd or converted from
audio/video to transcripts)? IOT type docs with very minimal text can be
indexed much faster than 50 page pdf documents. With very large clusters
and indexing systems distributing work across a spark cluster I've seen as
high as 1.3M docs/sec... and 70M would be trivial for that system (they had
hundreds of billions). But text documents are typically much, much slower
than that, especially if the text must be extracted from dirty formats such
as pdf or word data, or complex custom analysis is involved, or additional
fetching of files or data to merge into the doc is required.

As for the two formats: If you are indexing with java code, choose Java
Binary. If you are using a non-java language you can use JSON. The rare
case of JSON from java would be if your data was already in JSON format...
then it depends on whether solr is limiting you (do work on the indexers
and use java bin so it has less parsing to do) or your indexing machines
are limiting you (use JSON so your indexers don't have to do the
conversion). Like many things in search "It depends" :)

On Thu, Sep 29, 2022 at 4:07 AM Shankar R <iamrav...@gmail.com> wrote:

> Hi,
>  We are having nearly 70-80 millions of data which need to be indexed in
> solr 8.6.1.
>  We want to choose between Java BInary format or direct JSON format.
>  Our source data is DBMS which is a structured data.
>
> Regards
> Ravi
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Reply via email to