> On Nov 12, 2023, at 9:16 AM, Vince McMahon <sippingonesandze...@gmail.com>
> wrote:
>
> So, if I split the single cvs into two and using two programs sending each
> of the splits, Solr will handle the parallel loading with multiple
> threads. I don't have to make changes to Solr, right?
Yes, that's correct.
We were loading 40M records in about 8 hours through the DIH. That's about 5M
records per hour, which is roughly what you are getting (100M records in 20
hours).
When the DIH was removed from core Solr, it gave us the impetus to switch over
to the update handlers. Switching to the update handler let us run multiple
importers at a time. Now, if I run 10 importers simultaneously, importing
about 4M records each, we can load those 40M records in about 90 minutes.
That's about 25M rows per hour. Note that 10 importers didn't speed things up
10x. It sped up about 5x.
I don't know what kind of speed target you're trying to hit. If you're hoping
to do 100M rows in 30 minutes, that may not be possible. It may be that down
the road after experimenting with different levels of concurrency and JVM and
tuning and whatnot, you find that the best that you can do is 100M rows in,
say, 3 hours, and you'll have to be OK with that. Or your boss may have to be
OK with that. There's a joke that says "If you tell a programmer they have to
run a mile in 3 minutes, the programmer will start putting on his running
shoes", without considering "Is what I'm being asked to do even possible."
If you're trying to speed up a process, you're going to need to run a lot of
tests and track a lot of numbers. Try it with 5 indexers, and see what kind of
throughput you get. Then try it with 10 and see what happenes. Measure measure
measure.
Also, the best way to make things go faster is to do less work. Are all the
fields you're creating necessary? Can you turn some of them into non-indexed
fields? Do you really have to do all 100M records every time? What if only 20M
of those records change each time. Maybe you write some code that determines
which 20M rows need to be updated, and only index those. You'll immediately get
a 5x speedup because you're only doing 1/5th the work.
For example, sometimes we have to do a bulk load and I have a program that
queries each record in the Oracle database against what is indexed in Solr, and
compares them. The records that differ get dumped in to a file and that's the
file that gets loaded. If it takes 20 minutes to run that process, but I find I
only need to load 10% of the data, then that's a win.
An excellent book that I'm currently reading is "How To Make Things Faster" and
it's filled with all sorts of tips and lessons about things like this:
https://www.amazon.com/How-Make-Things-Faster-Performance/dp/1098147065
Finally, somewhere you asked if JSON would be faster than CSV to load. I have
not measured, but I am certain that the bottleneck in the indexing process is
not in the parsing of the input data. So, no, CSV vs. JSON doesn't matter.
Andy