Hi, Luca. Thanks for getting back to me so quickly. In answer to your questions:
1 & 2. Yes, those numbers are from using "remote" protocol and 3 servers on 3 different boxes. 3. Yes, default configuration. Apart from adding an index for ACCOUNTS, I did nothing further. 4. Good question. With real data, we expect it to be as you suggest: some nodes with the majority of the payments (eg, supermarkets). However, for the test data, payments were assigned randomly and, therefore, should be uniformly distributed. Regarding your suggestions: 1. Good idea. I'm OoO today as I am ill. I'll try tomorrow. 2. Yes, I tried plocal minutes after posting (d'oh!). I saw a good improvement. It started about 3 times faster and got faster still (about 10 times faster) by the time I checked this morning on a job running overnight. However, even though it is now running at about 7k transactions per second, a billion edges is still going to take about 40 hours. So, I ask myself: is there anyway I can make it faster still? I assume when I start the servers up in distributed mode once more, the data will then be distributed across all nodes in the cluster? 3. I'll return to concurrent, remote inserts when this job has finished. Hopefully, a smaller batch size will mean there is no degradation in performance either... FYI: with a somewhat unscientific approach, I was polling the server JVM with JStack and saw only a single thread doing all the work and it *seemed* to spend a lot of its time in ODirtyManager on collection manipulation. I totally appreciate that performance tuning is an empirical science, but do you have any opinions as to which would probably be faster: single-threaded plocal or multithreaded remote? Regards, Phillip On Wednesday, September 14, 2016 at 3:48:56 PM UTC+1, Phillip Henry wrote: > > Hi, guys. > > I'm conducting a proof-of-concept for a large bank (Luca, we had a 'phone > conf on August 5...) and I'm trying to bulk insert a humongous amount of > data: 1 million vertices and 1 billion edges. > > Firstly, I'm impressed about how easy it was to configure a cluster. > However, the performance of batch inserting is bad (and seems to get > considerably worse as I add more data). It starts at about 2k > vertices-and-edges per second and deteriorates to about 500/second after > only about 3 million edges have been added. This also takes ~ 30 minutes. > Needless to say that 1 billion payments (edges) will take over a week at > this rate. > > This is a show-stopper for us. > > My data model is simply payments between accounts and I store it in one > large file. It's just 3 fields and looks like: > > FROM_ACCOUNT TO_ACCOUNT AMOUNT > > In the test data I generated, I had 1 million accounts and 1 billion > payments randomly distributed between pairs of accounts. > > I have 2 classes in OrientDB: ACCOUNTS (extending V) and PAYMENT > (extending E). There is a UNIQUE_HASH_INDEX on ACCOUNTS for the account > number (a string). > > We're using OrientDB 2.2.7. > > My batch size is 5k and I am using the "remote" protocol to connect to our > cluster. > > I'm using JDK 8 and my 3 boxes are beefy machines (32 cores each) but > without SSDs. I wrote the importing code myself but did nothing 'clever' (I > think) and used the Graph API. This client code has been given lots of > memory and using jstat I can see it is not excessively GCing. > > So, my questions are: > > 1. what kind of performance can I realistically expect and can I improve > what I have at the moment? > > 2. what kind of degradation should I expect as the graph grows? > > Thanks, guys. > > Phillip > > > > -- --- You received this message because you are subscribed to the Google Groups "OrientDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
