[jira] [Created] (FLINK-27729) Support constructing StartCursor and StopCursor from MessageId

2022-05-21 Thread Dian Fu (Jira)
Dian Fu created FLINK-27729:
---

 Summary: Support constructing StartCursor and StopCursor from 
MessageId 
 Key: FLINK-27729
 URL: https://issues.apache.org/jira/browse/FLINK-27729
 Project: Flink
  Issue Type: Improvement
  Components: API / Python, Connectors / Pulsar
Reporter: Dian Fu
 Fix For: 1.16.0


Currently, StartCursor.fromMessageId and StopCursor.fromMessageId are still not 
supported in Python pulsar connectors. I think we could leverage the [pulsar 
Python library|https://pulsar.apache.org/api/python/#pulsar.MessageId] to 
implement these methods.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27730) Kafka connector document code sink has an error

2022-05-21 Thread liuwei (Jira)
liuwei created FLINK-27730:
--

 Summary: Kafka connector document code sink has an error
 Key: FLINK-27730
 URL: https://issues.apache.org/jira/browse/FLINK-27730
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.14.4
 Environment: Flink 1.14.4
Reporter: liuwei
 Fix For: 1.14.4
 Attachments: kafka-sink.png

Kafka Sink document sample code API call error.

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Request for Review: FLINK-27507 and FLINK-27509

2022-05-21 Thread Shubham Bansal
Hi Everyone,

I am not sure who to reach out for the reviews of these changesets, so I am
putting this on the mailing list here.

I have raised the review for
FLINK-27507 - https://github.com/apache/flink-playgrounds/pull/30
FLINK-27509 - https://github.com/apache/flink-playgrounds/pull/29

I would appreciate it if somebody can review these change sets as I want to
make the similar changes for 1.15 version after that.
Let me know if you need more information regarding this PR, I would be
happy to connect with you and explain the changes.

Thanks,
Shubham Bansal


Contributing to Apache Flink

2022-05-21 Thread Shubham Bansal
Hi Everyone,

I have been looking at some starter tagged JIRA in the last couple of weeks.

https://issues.apache.org/jira/browse/FLINK-27506
https://issues.apache.org/jira/browse/FLINK-27508
https://issues.apache.org/jira/browse/FLINK-27507
https://issues.apache.org/jira/browse/FLINK-27509

One of the changes is checked in, and others are in review.
I wanted to be more involved in the code contribution process and would be
happy to look at non-starter issues which can give me more insight into the
codebase. Eventually wanting to move to core components over time.

My Background: I have worked in backend API, Infrastructure, Cloud Storage
and System Security for the past 6 years.

I skimmed through different components on the JIRA and found the following
to be very interesting.

Runtime / Coordination
Runtime / Queryable State
Runtime / State Backends
Runtime/ Task
Stateful functions
Table SQL / Planner
Table SQL/ Runtime
Table Store
FileSystems
Library / CEP
Library / Graph Processing
Runtime / Checkpointing

I would be happy to start somewhere in any of the above areas or anything
which you think would be more suitable.

Please let me know.

Thanks,
Shubham


Re: Flatmap node at 100%

2022-05-21 Thread Zain Haider Nemati
Hi Yuxia,
I did increase the parallelism to 16 but that is causing memory overflowing
issues. Task manager heap memory collapses after a certain point when the
job has run.
I'm attaching the metrics, the flatmap converts jsons and parses them to
comma separated strings. Could you suggest how to optimize it?

[image: image.png]

On Fri, May 20, 2022 at 2:39 PM yuxia  wrote:

> HI, I think you can increase the parallelism of the flat map operator. For
> SQL job, you can refer the doc[1] to set parallelism. For datastream job,
> you can set the parallelism in your code.
>
>
> Also, if possible,  you can try optimize  your code in the flatmap node .
>
> [1]:
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/config/#table-exec-resource-default-parallelism
>
> Best regards,
> Yuxia
>
> --
> *发件人: *"Zain Haider Nemati" 
> *收件人: *"User" , "dev" 
> *发送时间: *星期五, 2022年 5 月 20日 下午 4:51:14
> *主题: *Flatmap node at 100%
>
> Hi,
> Im seeing this behaviour in my flink job, what can I do to remove this
> bottleneck
>
> [image: image.png]
>
>


Re: Job Logs - Yarn Application Mode

2022-05-21 Thread Zain Haider Nemati
Hi,
Thanks for your response folks. Is it possible to access logs via flink UI
in yarn application mode? Similar to how we can access them in standalone
mode

On Fri, May 20, 2022 at 11:06 AM Shengkai Fang  wrote:

> Thanks for Biao's explanation.
>
> Best,
> Shengkai
>
> Biao Geng  于2022年5月20日周五 11:16写道:
>
>> Hi there,
>> @Zain, Weihua's suggestion should be able to fulfill the request to check
>> JM logs. If you do want to use YARN cli for running Flink applications, it
>> is possible to check JM's log with the YARN command like:
>> *yarn logs -applicationId application_xxx_yyy -am -1 -logFiles
>> jobmanager.log*
>> For TM log, command would be like:
>> * yarn logs -applicationId  -containerId 
>> -logFiles taskmanager.log*
>> Note, it is not super easy to find the container id of TM. Some
>> workaround would be to check JM's log first and get the container id for TM
>> from that. You can also learn more about the details of above commands from 
>> *yarn
>> logs -help*
>>
>> @Shengkai, yes, you are right the actual JM address is managed by YARN.
>> To access the JM launched by YARN, users need to access YARN web ui to find
>> the YARN application by applicationId and then click 'application master
>> url' of that application to be redirected to Flink web ui.
>>
>> Best,
>> Biao Geng
>>
>> Shengkai Fang  于2022年5月20日周五 10:59写道:
>>
>>> Hi.
>>>
>>> I am not familiar with the YARN application mode. Because the job
>>> manager is started when submit the jobs. So how can users know the address
>>> of the JM? Do we need to look up the Yarn UI to search the submitted job
>>> with the JobID?
>>>
>>> Best,
>>> Shengkai
>>>
>>> Weihua Hu  于2022年5月20日周五 10:23写道:
>>>
 Hi,
 You can get the logs from Flink Web UI if job is running.
 Best,
 Weihua

 2022年5月19日 下午10:56,Zain Haider Nemati  写道:

 Hey All,
 How can I check logs for my job when it is running in application mode
 via yarn





Task Manager Heap Memory Overflowing

2022-05-21 Thread Zain Haider Nemati
Hi,
I'm running a job with kafka source pulling in million of records with
parallelism 16 and Im seeing heap memory on task manager overflowing after
the job as ran for some time, no matter how much I increase the memory
allocation. Can someone help me out in this regards?

Job Stats and memory configs:

[image: image.png]

[image: image.png]


Json Deserialize in DataStream API with array length not fixed

2022-05-21 Thread Zain Haider Nemati
Hi Folks,
I have data coming in this format:

{
“data”: {
“oid__id”:  “61de4f26f01131783f162453”,
“array_coordinates”:“[ { \“speed\” : \“xxx\“, \“accuracy\” :
\“xxx\“, \“bearing\” : \“xxx\“, \“altitude\” : \“xxx\“, \“longitude\” :
\“xxx\“, \“latitude\” : \“xxx\“, \“dateTimeStamp\” : \“xxx\“, \“_id\” : {
\“$oid\” : \“xxx\” } }, { \“speed\” : \“xxx\“, \“isFromMockProvider\” :
\“false\“, \“accuracy\” : \“xxx\“, \“bearing\” : \“xxx\“, \“altitude\” :
\“xxx\“, \“longitude\” : \“xxx\“, \“latitude\” : \“xxx\“, \“dateTimeStamp\”
: \“xxx\“, \“_id\” : { \“$oid\” : \“xxx\” } }]“,
“batchId”:  “xxx",
“agentId”:  “xxx",
“routeKey”: “40042-12-01-2022",
“__v”:  0
},
“metadata”: {
“timestamp”:“2022-05-02T18:49:52.619827Z”,
“record-type”:  “data”,
“operation”:“load”,
“partition-key-type”:   “primary-key”,
“schema-name”:  “xxx”,
“table-name”:   “xxx”
}
}

Where length of array coordinates array varies is not fixed in the source
is their any way to define a json deserializer for this? If so would really
appreciate if I can get some help on this


Re: Flatmap node at 100%

2022-05-21 Thread JasonLee
Hi


Your picture is not visible here. You can put it on one of the Drawing bed, can 
you send out the memory stack information 


Best
JasonLee


 Replied Message 
| From | Zain Haider Nemati |
| Date | 05/22/2022 13:41 |
| To | yuxia |
| Cc | User ,
dev |
| Subject | Re: Flatmap node at 100% |
Hi Yuxia,
I did increase the parallelism to 16 but that is causing memory overflowing 
issues. Task manager heap memory collapses after a certain point when the job 
has run.
I'm attaching the metrics, the flatmap converts jsons and parses them to comma 
separated strings. Could you suggest how to optimize it?






On Fri, May 20, 2022 at 2:39 PM yuxia  wrote:

HI, I think you can increase the parallelism of the flat map operator. For SQL 
job, you can refer the doc[1] to set parallelism. For datastream job, you can 
set the parallelism in your code.




Also, if possible,  you can try optimize  your code in the flatmap node .


[1]: 
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/config/#table-exec-resource-default-parallelism


Best regards,
Yuxia


发件人: "Zain Haider Nemati" 
收件人: "User" , "dev" 
发送时间: 星期五, 2022年 5 月 20日 下午 4:51:14
主题: Flatmap node at 100%



Hi, 

Im seeing this behaviour in my flink job, what can I do to remove this 
bottleneck





[jira] [Created] (FLINK-27731) Cannot build documentation with Hugo docker image

2022-05-21 Thread Xintong Song (Jira)
Xintong Song created FLINK-27731:


 Summary: Cannot build documentation with Hugo docker image
 Key: FLINK-27731
 URL: https://issues.apache.org/jira/browse/FLINK-27731
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.16.0
Reporter: Xintong Song


Flink provides 2 ways for building the documentation: 1) using a Hugo docker 
image, and 2) using a local Hugo installation.

Currently, 1) is broken due to the `setup_docs.sh` script requires a local Hugo 
installation.

This was introduced in FLINK-27394.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)