> On Nov 12, 2023, at 9:16 AM, Vince McMahon <sippingonesandze...@gmail.com> 
> wrote:
> 
> So, if I split the single cvs into two and using two programs sending each
> of the splits, Solr will handle the parallel loading with multiple
> threads.  I don't have to make changes to Solr, right?


Yes, that's correct.

We were loading 40M records in about 8 hours through the DIH. That's about 5M 
records per hour, which is roughly what you are getting (100M records in 20 
hours).

When the DIH was removed from core Solr, it gave us the impetus to switch over 
to the update handlers. Switching to the update handler let us run multiple 
importers at a time.  Now, if I run 10 importers simultaneously, importing 
about 4M records each, we can load those 40M records in about 90 minutes.  
That's about 25M rows per hour.  Note that 10 importers didn't speed things up 
10x.  It sped up about 5x.  

I don't know what kind of speed target you're trying to hit. If you're hoping 
to do 100M rows in 30 minutes, that may not be possible. It may be that down 
the road after experimenting with different levels of concurrency and JVM and 
tuning and whatnot, you find that the best that you can do is 100M rows in, 
say, 3 hours, and you'll have to be OK with that.  Or your boss may have to be 
OK with that. There's a joke that says "If you tell a programmer they have to 
run a mile in 3 minutes, the programmer will start putting on his running 
shoes", without considering "Is what I'm being asked to do even possible."

If you're trying to speed up a process, you're going to need to run a lot of 
tests and track a lot of numbers. Try it with 5 indexers, and see what kind of 
throughput you get. Then try it with 10 and see what happenes. Measure measure 
measure.

Also, the best way to make things go faster is to do less work. Are all the 
fields you're creating necessary? Can you turn some of them into non-indexed 
fields? Do you really have to do all 100M records every time? What if only 20M 
of those records change each time. Maybe you write some code that determines 
which 20M rows need to be updated, and only index those. You'll immediately get 
a 5x speedup because you're only doing 1/5th the work.

For example, sometimes we have to do a bulk load and I have a program that 
queries each record in the Oracle database against what is indexed in Solr, and 
compares them. The records that differ get dumped in to a file and that's the 
file that gets loaded. If it takes 20 minutes to run that process, but I find I 
only need to load 10% of the data, then that's a win.

An excellent book that I'm currently reading is "How To Make Things Faster" and 
it's filled with all sorts of tips and lessons about things like this: 
https://www.amazon.com/How-Make-Things-Faster-Performance/dp/1098147065

Finally, somewhere you asked if JSON would be faster than CSV to load. I have 
not measured, but I am certain that the bottleneck in the indexing process is 
not in the parsing of the input data.  So, no, CSV vs. JSON doesn't matter.

Andy

Reply via email to