Re: Scaling Pipelines & Workflows

Diego Mainou Tue, 18 Oct 2022 04:51:59 -0700

Sorry 

Big thumb issue...


Here : " Additionally, it copies a document to the very same directory." 
I've seen endless issues with that behaviour. Particularly in cloud based 
solutions. 

Try copying to a different folder 

[ https://www.bizcubed.com.au/ | 
                                
                         ]      Diego Mainou Project Delivery Manager 
M. +61 415 152 091 
E. [ mailto:diego.mai...@bizcubed.com.au | diego.mai...@bizcubed.com.au ] 
[ https://www.bizcubed.com.au/ | www.bizcubed.com.au ] 


From: "Hans Van Akelyen" <hans.van.akel...@gmail.com> 
To: "users" <users@hop.apache.org> 
Sent: Tuesday, 18 October, 2022 9:21:09 PM 
Subject: Re: Scaling Pipelines & Workflows 

Hi Jochen, 

Hop Server runs in a JVM, you can edit the hop-server.sh script to add extra 
options to the JVM (allocate more memory to the JVM default is 2048MB). 

Parallelisation and scaling, each Hop transform creates its own thread and will 
consume records on the input side and place them to the output side after 
processing. You can increase the amount of instances/threads of a transform by 
clicking on it and changing the “number of copies”. One thing to keep in mind 
is that when you for example add more copies to a table input that it will 
execute the query “x” times and will result in x times the same rows unless you 
add logic in your query to distribute the data over these multiple instances. 
( you can use ${Internal.Transform.CopyNr} and a mod function on an ID column 
for example). 

What we usually see in the field is that CPU is not the bottleneck of 
pipelines, usually IO is a limiting factor. 
When looking at the status of a pipeline via the UI in Hop Server there are 
indications to what is the bottleneck, you have a field containing the 
input/output buffers of each transform. The transform that has max rows 
(default: 10000) on input and 0 on output is your bottleneck. If you see no 
data pile-up in the buffers it means that it is processing the data just as 
fast as it is receiving it (your database can’t feed rows faster than it does). 

It might be that the pipeline can’t go faster because the DB does not deliver 
records any faster, or that the XML writer can’t write faster to disk than it 
does. 

When dealing with performance issues: 
- The transform metrics will show you who is the culprit 
- Look at Memory/CPU usage (as you are already doing) 
- Increase (copies/threads) but be mindful of the implications as the rows will 
be split over multiple instances (input,output,sorting,grouping) 

Hope this helps, 
Hans 



On 18 October 2022 at 11:39:21, Jochen Gatternig ( [ 
mailto:jochen.gatter...@adebo.ch | jochen.gatter...@adebo.ch ] ) wrote: 




Dear all 



Are there option/parameters in the Hop server that allow parallelization and 
scaling of the processing? 

We tested it with a pipeline configuration which read data from a source table, 
created XMLs and write them to a filesystem. Additionally, it copies a document 
to the very same directory. 

Our server has 8 cores (VM). 



When running it with a single job, the system caps at 400-450%. 

However, we then thought to modify the where-clause and run 2-4 jobs 
separately. However, each job seems to be capped at 100-150% CPU load. 



Any idea how to increase performance? 



Regards 

Jochen 



Beste Grüsse 



Jochen Gatternig 

Head of Advisory 

Telefon +41 76 431 00 94 

[ mailto:christian.bernh...@adebo.ch | jochen.gatter...@adebo.ch ]

Re: Scaling Pipelines & Workflows

Reply via email to