I’d add that committing less frequently, especially not committing in every update request would speed things up, if you don’t need to search at the same time (both soft and hard commit).
~ufuk yilmaz — > On 12 Nov 2023, at 20:51, Andy Lester <a...@petdance.com> wrote: > > > >> On Nov 12, 2023, at 9:16 AM, Vince McMahon <sippingonesandze...@gmail.com> >> wrote: >> >> So, if I split the single cvs into two and using two programs sending each >> of the splits, Solr will handle the parallel loading with multiple >> threads. I don't have to make changes to Solr, right? > > > Yes, that's correct. > > We were loading 40M records in about 8 hours through the DIH. That's about 5M > records per hour, which is roughly what you are getting (100M records in 20 > hours). > > When the DIH was removed from core Solr, it gave us the impetus to switch > over to the update handlers. Switching to the update handler let us run > multiple importers at a time. Now, if I run 10 importers simultaneously, > importing about 4M records each, we can load those 40M records in about 90 > minutes. That's about 25M rows per hour. Note that 10 importers didn't > speed things up 10x. It sped up about 5x. > > I don't know what kind of speed target you're trying to hit. If you're hoping > to do 100M rows in 30 minutes, that may not be possible. It may be that down > the road after experimenting with different levels of concurrency and JVM and > tuning and whatnot, you find that the best that you can do is 100M rows in, > say, 3 hours, and you'll have to be OK with that. Or your boss may have to > be OK with that. There's a joke that says "If you tell a programmer they have > to run a mile in 3 minutes, the programmer will start putting on his running > shoes", without considering "Is what I'm being asked to do even possible." > > If you're trying to speed up a process, you're going to need to run a lot of > tests and track a lot of numbers. Try it with 5 indexers, and see what kind > of throughput you get. Then try it with 10 and see what happenes. Measure > measure measure. > > Also, the best way to make things go faster is to do less work. Are all the > fields you're creating necessary? Can you turn some of them into non-indexed > fields? Do you really have to do all 100M records every time? What if only > 20M of those records change each time. Maybe you write some code that > determines which 20M rows need to be updated, and only index those. You'll > immediately get a 5x speedup because you're only doing 1/5th the work. > > For example, sometimes we have to do a bulk load and I have a program that > queries each record in the Oracle database against what is indexed in Solr, > and compares them. The records that differ get dumped in to a file and that's > the file that gets loaded. If it takes 20 minutes to run that process, but I > find I only need to load 10% of the data, then that's a win. > > An excellent book that I'm currently reading is "How To Make Things Faster" > and it's filled with all sorts of tips and lessons about things like this: > https://www.amazon.com/How-Make-Things-Faster-Performance/dp/1098147065 > > Finally, somewhere you asked if JSON would be faster than CSV to load. I have > not measured, but I am certain that the bottleneck in the indexing process is > not in the parsing of the input data. So, no, CSV vs. JSON doesn't matter. > > Andy >