Sorry Big thumb issue...
Here : " Additionally, it copies a document to the very same directory." I've seen endless issues with that behaviour. Particularly in cloud based solutions. Try copying to a different folder [ https://www.bizcubed.com.au/ | ] Diego Mainou Project Delivery Manager M. +61 415 152 091 E. [ mailto:diego.mai...@bizcubed.com.au | diego.mai...@bizcubed.com.au ] [ https://www.bizcubed.com.au/ | www.bizcubed.com.au ] From: "Hans Van Akelyen" <hans.van.akel...@gmail.com> To: "users" <users@hop.apache.org> Sent: Tuesday, 18 October, 2022 9:21:09 PM Subject: Re: Scaling Pipelines & Workflows Hi Jochen, Hop Server runs in a JVM, you can edit the hop-server.sh script to add extra options to the JVM (allocate more memory to the JVM default is 2048MB). Parallelisation and scaling, each Hop transform creates its own thread and will consume records on the input side and place them to the output side after processing. You can increase the amount of instances/threads of a transform by clicking on it and changing the “number of copies”. One thing to keep in mind is that when you for example add more copies to a table input that it will execute the query “x” times and will result in x times the same rows unless you add logic in your query to distribute the data over these multiple instances. ( you can use ${Internal.Transform.CopyNr} and a mod function on an ID column for example). What we usually see in the field is that CPU is not the bottleneck of pipelines, usually IO is a limiting factor. When looking at the status of a pipeline via the UI in Hop Server there are indications to what is the bottleneck, you have a field containing the input/output buffers of each transform. The transform that has max rows (default: 10000) on input and 0 on output is your bottleneck. If you see no data pile-up in the buffers it means that it is processing the data just as fast as it is receiving it (your database can’t feed rows faster than it does). It might be that the pipeline can’t go faster because the DB does not deliver records any faster, or that the XML writer can’t write faster to disk than it does. When dealing with performance issues: - The transform metrics will show you who is the culprit - Look at Memory/CPU usage (as you are already doing) - Increase (copies/threads) but be mindful of the implications as the rows will be split over multiple instances (input,output,sorting,grouping) Hope this helps, Hans On 18 October 2022 at 11:39:21, Jochen Gatternig ( [ mailto:jochen.gatter...@adebo.ch | jochen.gatter...@adebo.ch ] ) wrote: Dear all Are there option/parameters in the Hop server that allow parallelization and scaling of the processing? We tested it with a pipeline configuration which read data from a source table, created XMLs and write them to a filesystem. Additionally, it copies a document to the very same directory. Our server has 8 cores (VM). When running it with a single job, the system caps at 400-450%. However, we then thought to modify the where-clause and run 2-4 jobs separately. However, each job seems to be capped at 100-150% CPU load. Any idea how to increase performance? Regards Jochen Beste Grüsse Jochen Gatternig Head of Advisory Telefon +41 76 431 00 94 [ mailto:christian.bernh...@adebo.ch | jochen.gatter...@adebo.ch ]