Re:

Timo Walther Tue, 08 Sep 2020 07:07:24 -0700

Hi Violeta,

can you share your connector code with us? The plan looks quitecomplicated given the relatively simple query. Maybe there is someoptimization potential. But before we dive deeper, I see a `Map(to:Row)` which indicates that we might work with a legacy sink connector.

Did you try to run the same pipeline with only an outer join (withoutaggregation) or only aggregation (without the outer join)? It would beinteresting to find the bottleneck. It might make sense to increaseFlink's managed memory which is used for sorting. Also the JVM heap sizelooks pretty tiny.


Regards,
Timo


On 08.09.20 08:30, Violeta Milanović wrote:

Hi all, I'm currently testing PyFlink against PySpark for batch jobs.The query I'm using is:/select
max(c.first_name),
max(c.last_name),
avg(transaction_amount) as avg_ta,
avg(salary+bonus) as avg_income,
avg(salary+bonus) - avg(transaction_amount) as spending
from transactions t left join customers c on t.customer_id = c.customer_id
group by c.customer_id
/It's really simple one, and I have 1 million data in customers and 5million in transactions. My output table should contains around 690krows. But I'm facing an issue, because when I want to sink into CSV 690krows, it's too slow comparing to PySpark. It's not problem when sinktable is one row, or even 100k rows, executions are way better thanPySpark. PySpark execution time is 50 seconds for this query and sink,and PyFlink is around 120 seconds. I'm running it both in dockercontainer with default settings, and I increased parallelism for PyFlinkto 6, and this is execution with 6 cpus. I'm new at Flink's memory, andas I understood to increase taskmanager.memory.process.size andjobmanager.memory.process.size, which I tried, but it didn't help.
My configuration are default, created from official docker image:

jobmanager.heap.size 1024m
taskmanager.memory.size 1568m
cpu cores:6
physical memory 11.7
jvm head size: 512m
flink managed memory: 512m

This is how my execution plan looks:

1.png



Thanks!
Violeta

Re:

Reply via email to