Hybrid Source with Parquet Files from GCS + KafkaSource

2021-12-09 Thread Meghajit Mazumdar
Hello, We have a requirement as follows: We want to stream events from 2 sources: Parquet files stored in a GCS Bucket, and a Kafka topic. With the release of Hybrid Source in Flink 1.14, we were able to construct a Hybrid Source which produces events from two sources: a FileSource which reads da

Re: Hybrid Source with Parquet Files from GCS + KafkaSource

2021-12-14 Thread Meghajit Mazumdar
BulkFormat-org.apache.flink.core.fs.Path...- > [2] > > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/gcs/ > > Regards, > Roman > > On Thu, Dec 9, 2021 at 1:04 PM Meghajit Mazumdar > wrote: > > > > Hello, > > > &

Converting parquet MessageType to flink RowType

2022-01-05 Thread Meghajit Mazumdar
Hello, We want to read and process Parquet Files using a FileSource and the DataStream API. Currently, as referenced from the documentation

Re: Converting parquet MessageType to flink RowType

2022-01-06 Thread Meghajit Mazumdar
> question is about the performance, because only the required columns should > be read, therefore the column names should be given by the user. The > fieldTypes are required in case the given fields could not be found in the > parquet footer, like for example typo. > > B

Re: Converting parquet MessageType to flink RowType

2022-01-06 Thread Meghajit Mazumdar
e (mandatorily) > required with current implementation because, when the given fields could > not be found *by the ParquetVectorizedInputFormat *in the parquet footer, > a type info is still needed to build the projected schema. > > Best regards > Jing > > On Thu, Ja

RowType for complex types in Parquet File

2022-01-06 Thread Meghajit Mazumdar
Hello, Flink documentation mentions this

Compatible alternative for ParquetInputFormat in Flink > 1.14.0

2022-01-10 Thread Meghajit Mazumdar
In flink-parquet_2.12 version 1.13.0, there used to be a class called as *org.apache.flink.formats.parquet.ParquetInputFormat . *This class's constructor used to accept

FileSource Usage

2022-01-19 Thread Meghajit Mazumdar
Hello, We are using FileSource to process Parquet Files and had a few doubts around it. Would really appreciate if somebody can help answer them: 1. For a given file, does FileSource read the contents inside it in order ? In o

Re: FileSource Usage

2022-01-20 Thread Meghajit Mazumdar
[3] > https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/ > Best, > Guowei > > > On Wed, Jan 19, 2022 at 6:06 PM Meghajit Mazumdar < > meghajit.mazum...@gojek.com> wrote: > >> Hello, >> >> We

Watermarking with FileSource

2022-01-24 Thread Meghajit Mazumdar
I had a doubt regarding watermarking w.r.t streaming with FileSource. Will really appreciate it if somebody can explain this behavior. Consider a filesystem with a root folder containing date wise sub folders such as *D1* , *D2*, … and so on. Each of these date folders further has 24 sub-folders

Does FileSource download all remote files for generating splits

2022-01-27 Thread Meghajit Mazumdar
Hello, I had a question about the FileSource in Flink 1.14 . Considering FileSource is set to read from a remote GCS URL, I could read and understand that the FileEnumerator is actu

Re: Does FileSource download all remote files for generating splits

2022-01-27 Thread Meghajit Mazumdar
so that it can split the work across > all task managers. Corresponding file readers, in the other hand, lives in > task managers and perform the exact reading work. They accept file splits > assigned to them and read the contents corresponding to these splits. > > Meghajit M

Joining Flink tables with different watermark delay

2022-02-13 Thread Meghajit Mazumdar
Hello, We are creating two data streams in our Flink application. Both of them are then formed into two Tables. The first data stream has a watermark delay of 24 hours while the second stream has a watermark delay of 60 minutes. The watermark used is of BoundedOutOfOrderness strategy and uses a pa

Re: Joining Flink tables with different watermark delay

2022-02-15 Thread Meghajit Mazumdar
dow-agg/#windowing-tvfs> > * You can directly use Window joins > <https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/sql/queries/window-join/> > as well for your query, as they're meant exactly to cover your use case > * Any particular reason

FileSource SourceReader failure scenario

2022-05-31 Thread Meghajit Mazumdar
Hello, I had a question with regards to the behaviour of FileSource and SourceReader in cases of failures. Let me know if I missed something conceptually. We are running a Parquet File Source. Let's say, we supply the source with a directory path containing 5 files and the Flink job is configured

Re: FileSource SourceReader failure scenario

2022-05-31 Thread Meghajit Mazumdar
merator. A split won’t > be duplicately assigned or read under this pattern. > > Hope this is helpful! > > [1] > https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/api/connector/source/SplitEnumerator.html > > Cheers, > > Qingsheng > >

Metrics for FileSource

2022-06-10 Thread Meghajit Mazumdar
Hello, We are working on a Flink project which uses FileSource to discover and read Parquet Files from GCS. ( using Flink 1.14) As part of this, we wanted to implement some health metrics around the code. I wanted to know whether Flink gathers some metrics by itself around FileSource, e;g, number

Re: Metrics for FileSource

2022-06-12 Thread Meghajit Mazumdar
e the metrics you >> need. You can implement your own source, and register custom metrics via >> `SplitEnumeratorContext#metricGroup` and `SourceReaderContext#metricGroup`. >> >> Best, >> Lijie >> >> Meghajit Mazumdar 于2022年6月10日周五 16:36写道: >> >>&

Re: Metrics for FileSource

2022-06-19 Thread Meghajit Mazumdar
t; wrote: >>> >>>> Hi Meghajit, >>>> >>>> As far as I know, currently, the FileSource does not have the metrics >>>> you need. You can implement your own source, and register custom metrics >>>> via `SplitEnumeratorContext#metricGrou

Job uptime metric in Flink Operator managed cluster

2022-10-12 Thread Meghajit Mazumdar
Hello, I recently deployed a Flink Operator in Kubernetes and wrote a simple FlinkDeployment CRD to run it in application mode following this . I noticed that, even after I edited the CRD and marked the sp

Re: Job uptime metric in Flink Operator managed cluster

2022-10-12 Thread Meghajit Mazumdar
anager > for suspended resources but we currently use this is a way to guarantee > more resiliency for the operator flow. > > Cheers, > Gyula > > On Wed, Oct 12, 2022 at 3:56 PM Meghajit Mazumdar < > meghajit.mazum...@gojek.com> wrote: > >> Hello,

Getting error stacktrace during job submission on Flink Operator

2022-11-04 Thread Meghajit Mazumdar
Hello folks, We in our team are currently running Flink clusters in Standalone, Session mode using Kubernetes. ( on Flink 1.14.3) We want to migrate towards *Flink Operator* + Application mode deployment setup. ( still continue using Flink 1.14.3) In the current setup, we upload a jar once and t

Missing Flink Operator version

2023-01-10 Thread Meghajit Mazumdar
Hello team, It appears that Flink Operator 1.1.0 is removed from the repo index https://downloads.apache.org/flink/ However, it is still mentioned as a stable release in the list of downloads here

Imbalance in Garbage Collection for TaskManagers

2023-02-16 Thread Meghajit Mazumdar
Hello, We have a Flink session cluster deployment in Kubernetes of around 100 TaskManagers. It processes around 20-30 Kafka Source jobs at the moment. The jobs run are all using the same jar and only differ in the SQL query used and other UDFs. We are using the official flink:1.14.3 image. We obs

Re: Imbalance in Garbage Collection for TaskManagers

2023-02-16 Thread Meghajit Mazumdar
eck if TaskManager with heavy gc is > running more tasks than others. If so, we can enable > "cluster.evenly-spread-out-slots=true" to balance tasks in all task > managers. > > Best, > Weihua > > > On Thu, Feb 16, 2023 at 10:52 PM Meghajit Mazumdar < > meg