Re: Please help, I need to bootstrap keyed state into a stream

2020-08-10 Thread Marco Villalobos
First, thank you. I want to believe you, I don't see how that is possible. All of the code is self-contained, and at the bottom of all the code, I print out the non-null values before I attempt to put in the map state. All of the debug output before and after indicates that there is a null val

Re: Two Queries and a Kafka Topic

2020-08-10 Thread Marco Villalobos
d some more light for us here. > > Best regards > Theo > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/state_processor_api.html > > <https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/state_processor_api.html> > > Von:

Re: Please help, I need to bootstrap keyed state into a stream

2020-08-10 Thread Marco Villalobos
Thank you. Your instruction was helpful in my solving this. You can read about my solution at https://github.com/minmay/flink-patterns/tree/master/bootstrap-keyed-state-into-stream > On Aug 10, 2020, at 4:

Re: Please help, I need to bootstrap keyed state into a stream

2020-08-10 Thread Marco Villalobos
I think there is a bug in Flink when running locally without a cluster. My code worked in a cluster, but failed when run locally. My code does not save null values in Map State. > On Aug 9, 2020, at 11:27 PM, Tzu-Li Tai wrote: > > Hi, > > For the NullPointerException, what seems to be happeni

Question about ParameterTool

2020-08-11 Thread Marco Villalobos
What are the dangers of not using the ParameterTool for parsing command line parameters? I have been using Picocli (https://picocli.info/). Will this be a mistake? Are there any side-effects that I should be aware of?

Re: Question about ParameterTool

2020-08-11 Thread Marco Villalobos
or global job > parameters: > https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/best_practices.html#register-the-parameters-globally > > On Tue, Aug 11, 2020 at 7:13 PM Marco Villalobos < > mvillalo...@kineteque.com> wrote: > >> What are the dan

Is there a way to start a timer without ever receiving an event?

2020-08-11 Thread Marco Villalobos
In the Stream API KeyedProcessFunction,is there a way to start a timer without ever receiving a stream event?

What async database library does the asyncio code example use?

2020-08-12 Thread Marco Villalobos
I would like to enrich my stream with database calls as documented at: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/stream/operators/asyncio.html What async database library does

Re: What async database library does the asyncio code example use?

2020-08-12 Thread Marco Villalobos
So, I searched for an async DatabaseClient class, and I found r2dbc. Is that it? https://docs.spring.io/spring-data/r2dbc/docs/1.1.3.RELEASE/reference/html On Wed, Aug 12, 2020 at 9:31 AM Marco Villalobos wrote: > I would like to enrich my stream with database calls as documented

Re: Please help, I need to bootstrap keyed state into a stream

2020-08-12 Thread Marco Villalobos
t be at least 128. I've opened a ticket to track and > resolve this issue. > > Seth > > On Mon, Aug 10, 2020 at 6:38 PM Marco Villalobos <mailto:mvillalo...@kineteque.com>> wrote: > I think there is a bug in Flink when running locally without a cluster. > >

Re: What async database library does the asyncio code example use?

2020-08-13 Thread Marco Villalobos
Thank you! This was very helpful. Sincerely, Marco A. Villalobos > On Aug 13, 2020, at 1:24 PM, Arvid Heise wrote: > > Hi Marco, > > you don't need to use an async library; you could simply write your code in > async fashion. > > I'm trying to sketch the basic idea using any JDBC driver i

coordination of sinks

2020-08-14 Thread Marco Villalobos
Given a source that goes into a tumbling window with a process function that yields two side outputs, in addition to the main data stream, is it possible to coordinate the order of completion of sink 1, sink 2, and sink 3 as data leaves the tumbling window? source -> tumbling window -> process fun

Re: OpenJDK or Oracle JDK: 8 or 11?

2020-08-22 Thread Marco Villalobos
Hi Pankaj, I highly recommend that you use an OpenJDK version 11 because each JDK upgrade has a performance improvement, and also because the Oracle JDK and OpenJDK are based off the same code-base. The main difference between Oracle and OpenJDK is the branding and price. > On Aug 22, 2020, a

Is a flink-1.11.2 compiled job run on flink-1.11.0 cluster?

2020-11-03 Thread Marco Villalobos
SITUATION Amazon EMR 6.1 supports Flink 1.11.0 and JDK 11. Docker Flink images support for Flink 1.11.0 only supports JDK 8. Much of the testing that I perform uses Docker and Docker compose. It's much easier for me to develop with Docker with Flink 1.11.2 and JDK 11. QUESTION Would a flink

CLI help, documentation is confusing...

2020-11-09 Thread Marco Villalobos
The flink CLI documentation says that the -m option is to specify the job manager. but the examples are passing in an execution target. I am quite confused by this. ./bin/flink run -m yarn-cluster \ ./examples/batch/WordCount.jar \ --input hdfs:///

Adding keyed state to test harness before calling process function.

2020-11-12 Thread Marco Villalobos
Hi, I would like to adding keyed state to test harness before calling process function. I am using the OneInputStreamOperatorTestHarness. I can't find any examples online on how to do that, and I am struggling to figure this out. Can somebody please provide guidance? My test case has keyed s

Re: Adding keyed state to test harness before calling process function.

2020-11-12 Thread Marco Villalobos
> > On Fri, Nov 13, 2020 at 10:36 AM Marco Villalobos < > mvillalo...@kineteque.com> wrote: > >> Hi, >> >> I would like to adding keyed state to test harness before calling process >> function. >> >> I am using the OneInputStreamOperatorTestHar

Re: CLI help, documentation is confusing...

2020-11-13 Thread Marco Villalobos
ink > will pick up a session cluster or will create a per-job cluster. > This was always the case. > > For the -t and -e the change is that -e was deprecated (although still > active) in favour of -t. But it still has the same meaning. > > Finally on how to run Flink on EMR, I am not an

many questions (kafka table KafkaDeserilizationSchema support, recommended enrichment approach, prevent JDBC temporal dimension table from N +1 queries, etc.

2020-12-06 Thread Marco Villalobos
1. How can I create a kafka table that can use headers and map them to columns? Currently, I am using KafkaDeserilizationSchema to create a DataStream, and then I convert that DataStream into a Table. I would like to use a more direct approach. 2. What is the recommended way to enrich a kafka t

How does Flink cache values that also do not exist in the database?

2020-12-07 Thread Marco Villalobos
How does Flink cache values that also do not exist in the database? I would like to cache hits forever, but I would like to check items that do not exist in the database only every 15 minutes? Is there a way to set that up in the SQL / Table api? Also, is there a way to set that up in Keyed Sta

what are the valid dimensions for lookup.cache.ttl?

2020-12-08 Thread Marco Villalobos
In https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/jdbc.html there no allowable dimensions specified for the lookup.cache.ttl. Can somebody please provide a list of valid values and their meaning? I know 's' for seconds is supported. How do I specify minutes?

How can I optimize joins or cache misses in SQL api?

2020-12-08 Thread Marco Villalobos
scenario: kafka stream enriched with tableS in postgresql Let's pretend that the postgres has an organizations, departments, and persons table, and we want to join the full name of the kafka table that has the person id. I also want to determine if the person id is missing. This requires a left

How long will keyed state exist if no TTL given?

2020-12-08 Thread Marco Villalobos
After reading https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/stream/state/state.html It is unclear to me how long keyed state will exist if it has no TTL. Is it cached forever, unless explicitly cleared or overwritten? can somebody please explain to me? Thank you.

Re: what are the valid dimensions for lookup.cache.ttl?

2020-12-08 Thread Marco Villalobos
t;nanosecond" > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/api/java/org/apache/flink/util/TimeUtils.html#parseDuration-java.lang.String- > > Regards, > Roman > > > On Tue, Dec 8, 2020 at 4:04 PM Marco Villalobos > wrote: >> >> In

Re: How long will keyed state exist if no TTL given?

2020-12-08 Thread Marco Villalobos
Thank you for the clarification. On Tue, Dec 8, 2020 at 8:14 AM Khachatryan Roman wrote: > > Hi Marco, > > Yes, if TTL is not configured then the state will never expire (will stay > forever until deleted explicitly). > > Regards, > Roman > > > On Tue, Dec 8,

lookup cache clarification

2020-12-08 Thread Marco Villalobos
I set up the following lookup cache values: 'lookup.cache.max-rows' = '20' 'lookup.cache.ttl' = '1min' for a jdbc connector. This table currently only has about 2 records in it. However, since I set the TTL to 1 minute, I expected the job to query that table every minute. The documentat

relating tumbling windows to each other

2020-12-08 Thread Marco Villalobos
GIVEN two windows (ProcessWindowFunction), window A, and window B, AND window A is a tumbling processing time window of 15 minutes AND 20 records entered window A, and performs its business logic. How can I assure that Window B will process exactly all the records that left window A within the

Re: relating tumbling windows to each other

2020-12-08 Thread Marco Villalobos
, Dec 8, 2020 at 1:29 PM Marco Villalobos wrote: > > GIVEN two windows (ProcessWindowFunction), window A, and window B, > AND window A is a tumbling processing time window of 15 minutes > AND 20 records entered window A, and performs its business logic. > > How can I assure

Re: lookup cache clarification

2020-12-09 Thread Marco Villalobos
correctly. > > Marco Villalobos <mailto:mvillalo...@kineteque.com>> 于2020年12月9日周三 上午4:23写道: > I set up the following lookup cache values: > > 'lookup.cache.max-rows' = '20' > 'lookup.cache.ttl' = '1min' > > for a jdbc connect

How do I pass the aggregated results of an SQL TUMBLING Window as one set of data to a Process Function?

2020-12-10 Thread Marco Villalobos
I am sorry to task this twice. I reworded my question though, and I never got an answer. I am trying to learn how to use the SQL api, but mix-in the Streaming API where there is too much complex business logic. GIVEN two windows, window X an SQL tumbling processing time window of 15 minutes, a

Re: How do I pass the aggregated results of an SQL TUMBLING Window as one set of data to a Process Function?

2020-12-11 Thread Marco Villalobos
Alright, maybe my example needs to be more concrete. How about this: In this example, I don't want to create to windows just to re-combine what was just aggregated in SQL. Is there a way to transform the aggregate results into one datastream object so that I don't have to aggregate again? // agg

UDTAGG and SQL

2021-01-05 Thread Marco Villalobos
I am trying to use User defined Table Aggregate function directly in the SQL so that I could combine all the rows collected in a window into one object. GIVEN a User defined Table Aggregate function public class MyUDTAGG extends TableAggregateFunction { public PurchaseWindow createAcc

Re: UDTAGG and SQL

2021-01-05 Thread Marco Villalobos
ar function called > `MyUDTAGG` in your example and cannot find one. > > Maybe the built-in function `COLLECT` or `LISTAGG`is what you are looking for? > > https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/functions/systemFunctions.html > > Regards, > Timo &

Re: UDTAGG and SQL

2021-01-05 Thread Marco Villalobos
ve the implementation there and keep the SQL query > simple. But this is up to you. Consecutive windows are supported. > > Regards, > Timo > > > On 05.01.21 15:23, Marco Villalobos wrote: > > Hi Timo, > > > > Thank you for the quick response. > > >

question about timers

2021-01-19 Thread Marco Villalobos
If there are timers that have been checkpointed (we use rocksdb), and the system goes down, and then the system goes back up after the timers should have fired, do those timers that were skipped still fire, even though we are past that time? example: for example, if the current time is 1:00 p.m.

JDBC connection pools

2021-01-21 Thread Marco Villalobos
Currently, my jobs that require JDBC initialize a connection in the open method directly via JDBC driver. 1. What are the established best practices for this? 2. Is it better to use a connection pool that can validate the connection and reconnect? 3. Would each operator require its own connection

What causes a buffer pool exception? How can I mitigate it?

2021-01-25 Thread Marco Villalobos
Hi. What causes a buffer pool exception? How can I mitigate it? It is causing us plenty of problems right now. 2021-01-26 04:14:33,041 INFO org.apache.flink.streaming.api.functions.sink.filesystem.Buckets [] - Subtask 1 received completion notification for checkpoint with id=4. 2021-01-26 04:14:

memory tuning

2021-01-25 Thread Marco Villalobos
I have a flink job that collects and aggregates time-series data from many devices into one object (let's call that X) that was collected by a window. X contains time-series data, so it contains many String, Instant, a HashMap, and another type (Let's call Y) objects. When I collect 4 X instances

Re: memory tuning

2021-01-26 Thread Marco Villalobos
26, 2021 at 12:34 AM Matthias Pohl wrote: > Hi Marco, > Could you share the preconfiguration logs? They are printed in the > beginning of the taskmanagers' logs and contain a summary of the used > memory configuration? > > Best, > Matthias > > On Tue, Jan 26,

Re: What causes a buffer pool exception? How can I mitigate it?

2021-01-26 Thread Marco Villalobos
s. Do > you have the option to upgrade to 1.11.3? > > Best, > > Arvid > > On Tue, Jan 26, 2021 at 6:35 AM Marco Villalobos < > mvillalo...@kineteque.com> wrote: > >> Hi. What causes a buffer pool exception? How can I mitigate it? It i

windows and triggers

2021-01-26 Thread Marco Villalobos
I wrote this simple test: .window(TumblingProcessingTimeWindows.of(Time.minutes(1))) .trigger(PurgingTrigger.of(CountTrigger.of(5))) Thinking that if I send 2 elements of data, it would collect them after a minute. But that doesn't seem to be happening. Is my understanding of how windows and tri

Re: windows and triggers

2021-01-26 Thread Marco Villalobos
on the > progress of time but only by count. Right now, you have to write your own > custom trigger if you want to react based on both time and count. > On Tue, Jan 26, 2021 at 10:44 AM Marco Villalobos wrote: > I wrote this simple test: > > .window(TumblingProcessingTimeW

How do I backfill time series data?

2020-06-14 Thread Marco Villalobos
Hello Flink community. I need help. Thus far, Flink has proven very useful to me. I am using it for stream processing of time-series data. For the scope of this mailing list, let's say the time-series has the fields: name: String, value: double, and timestamp: Instant. I named the time series: t

Does Flink support reading files or CSV files from java.io.InputStream instead of file paths?

2020-06-15 Thread Marco Villalobos
Does Flink support reading files or CSV files from java.io.InputStream instead of file paths? I'd rather just store my file on the class path and load it with java.lang.ClassLoader#getResourceAsStream(String). If there is a way, I'd appreciate an example.

Re: Does Flink support reading files or CSV files from java.io.InputStream instead of file paths?

2020-06-16 Thread Marco Villalobos
esigned mostly to read files from a > distributed filesystem, where paths are used to refer to those files. If you > read from files on the classpath you could just use plain old Java code and > won't need a distributed processing system such as Flink. > > Best, > Aljosc

Re: How do I backfill time series data?

2020-06-16 Thread Marco Villalobos
is question in a new thread with more information. Thank you. Sincerely, Marco Villalobos > On Jun 15, 2020, at 11:13 AM, Robert Metzger wrote: > > Hi Marco, > > I'm not 100% if I understood the problem. Let me repeat: You want a stream of > 15 minute averages

what is the "Flink" recommended way of assigning a backfill to an average on an event time keyed windowed stream?

2020-06-16 Thread Marco Villalobos
I need to compute averages on time series data upon a 15 minute tumbling event time window that is backfilled. The time series data is a Tuple3 of name: String, value: double, event_time: Timestamp (Instant). I need to compute the average value of the name time series on a tumbling window of 1

Re: Does Flink support reading files or CSV files from java.io.InputStream instead of file paths?

2020-06-17 Thread Marco Villalobos
to create a source from some data that you have already available. > > Does that help? > > Best, > Aljoscha > > On 16.06.20 15:35, Marco Villalobos wrote: >> Okay, it is not supported. >> I understand such a feature is not needed in production systems, but it >

Re: what is the "Flink" recommended way of assigning a backfill to an average on an event time keyed windowed stream?

2020-06-18 Thread Marco Villalobos
I came up with a solution for backfills. However, at this moment, I am not happy with my solution. I think there might be other facilities within Flink which allow me to implement a better more efficient or more scalable solution. In another post, rmetz...@apache.org suggested that I use a proce

Re: Session Window with Custom Trigger

2020-06-23 Thread Marco Villalobos
Hi Kristoff, > On Jun 23, 2020, at 6:52 AM, KristoffSC > wrote: > > Hi all, > I'm using Flink 1.9.2 and I would like to ask about my use case and approach > I've took to meet it. > > The use case: > I have a keyed stream, where I have to buffer messages with logic: > 1. Buffering should sta

Re: Date Deserialization probleme

2020-06-25 Thread Marco Villalobos
Hello Aisssa, SimpleDateFormat, and java.util.Date are obsolete since JDK 1.8. Also, it can quite dangerous to use a class like SimpleDateFormat in a distributed system because it is not thread-safe. I suggest you use https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.h

Question about Watermarks within a KeyedProcessFunction

2020-06-26 Thread Marco Villalobos
My source is a Kafka topic. I am using Event Time. I assign the event time with an AscendingTimestampExtractor I noticed when debugging that in the KeyedProcessFunction that after my highest known event time of: 2020-06-23T00:46:30.000Z the processElement method had a watermark with an impossibl

Two Queries and a Kafka Topic

2020-08-04 Thread Marco Villalobos
Lets say that I have: SQL Query One from data in PostgreSQL (200K records). SQL Query Two from data in PostgreSQL (1000 records). and Kafka Topic One. Let's also say that main data from this Flink job arrives in Kafka Topic One. If I need SQL Query One and SQL Query Two to happen just one time,

Re: Two Queries and a Kafka Topic

2020-08-04 Thread Marco Villalobos
ement > > <https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/insert.html#run-an-insert-statement> > > >> 在 2020年8月5日,04:34,Marco Villalobos > <mailto:mvillalo...@kineteque.com>> 写道: >> >> Lets say that I have: >> >

Re: Two Queries and a Kafka Topic

2020-08-05 Thread Marco Villalobos
uent map steps. > > I think, option 3 is the easiest to be implemented while option 1 might be > the most elegant way in my opinion. > > Best regards > Theo > > Von: "Marco Villalobos" > An: "Leonard Xu" > CC: "user" > Gesendet: Mitt

Re: Two Queries and a Kafka Topic

2020-08-06 Thread Marco Villalobos
elf). You then keep > the data to be used for all subsequent map steps. > > I think, option 3 is the easiest to be implemented while option 1 might be > the most elegant way in my opinion. > > Best regards > Theo > > -- > *Von: *"Marco Vil

State Processor API to boot strap keyed state for Stream Application.

2020-08-07 Thread Marco Villalobos
I have read the documentation and various blogs that state that it is possible to load data into a data-set and use that data to bootstrap a stream application. The documentation literally says this, "...you can read a batch of data from any store, preprocess it, and write the result to a savepoin

Please help, I need to bootstrap keyed state into a stream

2020-08-08 Thread Marco Villalobos
According to the documentation, and various blogs, it is possible to use the Batch Execution Environment to bootstrap state into a save point, and then load that state in a Stream Execution Environment. I am trying to use that feature. State Processor API documentation states that "you can read

stopping with save points

2021-01-27 Thread Marco Villalobos
When I try to stop with a savepoint, I usually get the error below. I have not been able to create a single save point. Please advise. I am using Flink 1.11.0 Draining job "ed51084378323a7d9fb1c4c97c2657df" with a savepoint. The progr

Flink and Amazon EMR

2021-01-27 Thread Marco Villalobos
Just curious, has anybody had success with Amazon EMR with RocksDB and checkpointing in S3? That's the configuration I am trying to setup, but my system is running more slowly than expected.

Re: What causes a buffer pool exception? How can I mitigate it?

2021-01-28 Thread Marco Villalobos
gt;> double-checked and saw that the buffer pool is only released on >> cancellation or shutdown. >> >> So I'm assuming that there is another issue (e.g., Kafka cluster not >> reachable) and there is a race condition while shutting down. It seems like >> the buff

presto s3p checkpoints and local stack

2021-01-28 Thread Marco Villalobos
. Any advice would be appreciated. -Marco Villalobos

Re: presto s3p checkpoints and local stack

2021-01-28 Thread Marco Villalobos
t; s3.path-style-access: true > s3.path.style.access: true (only one of them is needed, but I don't know > which, so please try out) > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/deployment/filesystems/s3.html#configure-access-credentials > > On Thu, Ja

question on checkpointing

2021-01-28 Thread Marco Villalobos
Is it possible that checkpointing times out due to an operator taking too long? Also, does windowing affect the checkpoint barriers?

flink checkpoints adjustment strategy

2021-01-28 Thread Marco Villalobos
I am kind of stuck in determining how large a checkpoint interval should be. Is there a guide for that? If a timeout time is 10 minutes, we time out, what is a good strategy for adjusting that? Where is a good starting point for a checkpoint? How shall they be adjusted? We often see checkpoint

Re: flink checkpoints adjustment strategy

2021-01-29 Thread Marco Villalobos
o > busy ...), and async duration(too much io/network process ...), etc. > > Best, > Congxian > > > Marco Villalobos 于2021年1月29日周五 上午7:19写道: > >> I am kind of stuck in determining how large a checkpoint interval should >> be. >> >> Is there a guide fo

Re: Flink and Amazon EMR

2021-02-01 Thread Marco Villalobos
s working. You would need to analyse what's working slower than > expected. Checkpointing times? (Async duration? Sync duration? Start > delay/back pressure?) Throughput? Recovery/startup? Are you being rate > limited by Amazon? > > Piotrek > > czw., 28 sty 2021 o 03:46 M

Re: question on checkpointing

2021-02-01 Thread Marco Villalobos
gt; checkpoint timeout. > > 2) What kind of effects are you worried about? > > On 1/28/2021 8:05 PM, Marco Villalobos wrote: > > Is it possible that checkpointing times out due to an operator taking > > too long? > > > > Also, does windowing affect the checkpoint barriers? > > >

Question a possible use can for Iterative Streams.

2021-02-02 Thread Marco Villalobos
Hi everybody, I am brainstorming how it might be possible to perform database enrichment with the DataStream API, use keyed state for caching, and also utilize Async IO. Since AsyncIO does not support keyed state, then is it possible to use an Iterative Stream that uses keyed state for caching in

Re: Question regarding a possible use case for Iterative Streams.

2021-02-03 Thread Marco Villalobos
Hi Gorden, Thank you very much for the detailed response. I considered using the state-state processor API, however, our enrichment requirements make the state-processor API a bit inconvenient. 1. if an element from the stream matches a record in the database then it can remain in the cache a v

org.codehaus.janino.CompilerFactory cannot be cast ....

2021-02-03 Thread Marco Villalobos
Please advise me. I don't know what I am doing wrong. After I added the blink table planner to my my dependency management: dependency "org.apache.flink:flink-table-planner-blink_${scalaVersion}:${flinkVersion}" and added it as a dependency: implementation "org.apache.flink:flink-table-planner-

Re: org.codehaus.janino.CompilerFactory cannot be cast ....

2021-02-03 Thread Marco Villalobos
Oh, I found the solution. I simply need to not use TRACE log level for Flink. On Wed, Feb 3, 2021 at 7:07 PM Marco Villalobos wrote: > > Please advise me. I don't know what I am doing wrong. > > After I added the blink table planner to my my dependency managemen

threading and distribution

2021-02-05 Thread Marco Villalobos
as data flows from a source through a pipeline of operators and finally sinks, is there a means to control how many threads are used within an operator, and how an operator is distributed across the network? Where can I read up on these types of details specifically?

hybrid state backends

2021-02-05 Thread Marco Villalobos
Is it possible to use different statebackends for different operators? There are certain situations where I want the state to reside completely in memory, and other situations where I want it stored in rocksdb.

Re: threading and distribution

2021-02-05 Thread Marco Villalobos
. On Fri, Feb 5, 2021 at 3:06 AM Marco Villalobos wrote: > as data flows from a source through a pipeline of operators and finally > sinks, is there a means to control how many threads are used within an > operator, and how an operator is distributed across the network? > > Where c

What is the difference between RuntimeContext state and ProcessWindowFunction.Context state when using a ProcessWindowFunction?

2021-02-09 Thread Marco Villalobos
Hi, I am having a difficult time distinguishing the difference between RuntimeContext state and global state when using a ProcessWindowFunction. A ProcessWindowFunction has three access different kinds of state. 1. RuntimeContext state. 2. ProcessWindowFunction.Context global state 3. ProcessWind

clarification on backpressure metrics in Apache Flink Dashboard

2021-02-10 Thread Marco Villalobos
given: [source] -> [operator 1] -> [operator 2] -> [sink]. If within the dashboard, operator 1 shows that it has backpressure, does that mean I need to improve the performance of operator 2 in order to alleviate backpressure upon operator 1?

questions about broadcasts

2021-03-05 Thread Marco Villalobos
Is it possible for an operator to receive two different kinds of broadcasts? Is it possible for an operator to receive two different types of streams and a broadcast? For example, I know there is a KeyedCoProcessFunction, but is there a version of that which can also receive broadcasts?

questions regarding stateful functions

2021-04-06 Thread Marco Villalobos
Upon reading about stateful functions, it seems as though first, a data stream has to flow to an event ingress. Then, the stateful functions will perform computations via whatever functionality it provides. Finally, the results of said computations will flow to the event egress which will be yet a

Re: questions regarding stateful functions

2021-04-07 Thread Marco Villalobos
Fun within a DataStream application [1] > > [1] > https://ci.apache.org/projects/flink/flink-statefun-docs-master/docs/sdk/flink-datastream/ > > On Wed, Apr 7, 2021 at 2:49 AM Marco Villalobos > wrote: > >> Upon reading about stateful functions, it seems as though first, a

Re: questions regarding stateful functions

2021-04-07 Thread Marco Villalobos
eam to a stateful function can > emit a message that > in turn will be routed to that function using the data stream integration. > > > On Wed, Apr 7, 2021 at 7:16 PM Marco Villalobos > wrote: > >> Thank you for the clarification. >> >> BUTthere was o

How can I demarcate which event elements are the boundaries of a window?

2021-04-19 Thread Marco Villalobos
I have a tumbling window that aggregates into a process window function. Downstream there is a keyed process function. [window aggregate into process function] -> keyed process function I am not quite sure how the keyed process knows which elements are at the boundary of the window. Is there a m

DataStream Batch Execution Mode and large files.

2021-05-18 Thread Marco Villalobos
Hi, I am using the DataStream API in Batch Execution Mode, and my "source" is an s3 Buckets with about 500 GB of data spread across many files. Where does Flink stored the results of processed / produced data between tasks? There is no way that 500GB will fit in memory. So I am very curious how

DataStream API Batch Execution Mode restarting...

2021-05-18 Thread Marco Villalobos
I have a DataStream running in Batch Execution mode within YARN on EMR. My job failed an hour into the job two times in a row because the task manager heartbeat timed out. Can somebody point me out how to restart a job in this situation? I can't find that section of the documentation. thank you.

Questions Flink DataStream in BATCH execution mode scalability advice

2021-05-18 Thread Marco Villalobos
Questions Flink DataStream in BATCH execution mode scalability advice. Here is the problem that I am trying to solve. Input is an S3 bucket directory with about 500 GB of data across many files. The instance that I am running on only has 50GB of EBS storage. The nature of this data is time series

Re: DataStream API Batch Execution Mode restarting...

2021-05-18 Thread Marco Villalobos
it the job to execute it from the scratch. > > Best, > Yun > > > > ------Original Mail -- > *Sender:*Marco Villalobos > *Send Date:*Wed May 19 11:27:37 2021 > *Recipients:*user > *Subject:*DataStream API Batch Execution Mode restarting.

Re: DataStream Batch Execution Mode and large files.

2021-05-18 Thread Marco Villalobos
tories configured by io.tmp.dirs[2]. > > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/batch/blocking_shuffle/ > [2] > https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#io-tmp-dirs > > --Original Mail ---

Re: Questions Flink DataStream in BATCH execution mode scalability advice

2021-05-19 Thread Marco Villalobos
> On May 19, 2021, at 7:26 AM, Yun Gao wrote: > > Hi Marco, > > For the remaining issues, > > 1. For the aggregation, the 500GB of files are not required to be fit into > memory. > Rough speaking for the keyed().window().reduce(), the input records would be > first > sort according to the

Is it possible to leverage the sort order in DataStream Batch Execution Mode?

2021-05-21 Thread Marco Villalobos
Hello. I am using Flink 1.12.1 in EMR. I am processing historical time-series data with the DataStream API in Batch execution mode. I must average time series data into a fifteen minute interval and forward fill missing values. For example, this input: name, timestamp, value a,2019-06-23T00:07:

Does Flink 1.12.1 DataStream API batch execution mode support side outputs?

2021-05-22 Thread Marco Villalobos
I have been struggling for two days with an issue using the DataStream API in Batch Execution mode. It seems as though my side-output has no elements available to downstream operators. However, I am certain that the downstream operator received events. I logged the side-output element just before

Re: Does Flink 1.12.1 DataStream API batch execution mode support side outputs?

2021-05-23 Thread Marco Villalobos
I found the problem. I tried to sign timestamps to the operator (I don't know why), and when I did that, because I used the Flink API fluently, I was no longer referencing the operator that contained the side-outputs. Disregard my question. On Sat, May 22, 2021 at 9:28 PM Marco Villa

How do you debug a DataStream flat join on common window?

2021-05-23 Thread Marco Villalobos
Hi, Stream one has one element. Stream two has 2 elements. Both streams derive from side-outputs. I am using the DataStream API in batch execution mode. I want to join them on a common key and window. I am certain the keys match, but the flat join does not seem to be working. I deduce that t

DataStream API in Batch Mode job is timing out, please advise on how to adjust the parameters.

2021-05-24 Thread Marco Villalobos
I am running with one job manager and three task managers. Each task manager is receiving at most 8 gb of data, but the job is timing out. What parameters must I adjust? Sink: back fill db sink) (15/32) (50626268d1f0d4c0833c5fa548863abd) switched from SCHEDULED to FAILED on [unassigned resource]

Is it possible to use OperatorState, when NOT implementing a source or sink function?

2021-06-04 Thread Marco Villalobos
Is it possible to use OperatorState, when NOT implementing a source or sink function? If yes, then how?

Re: Is it possible to use OperatorState, when NOT implementing a source or sink function?

2021-06-05 Thread Marco Villalobos
> Best regards, > JING ZHANG > > > Marco Villalobos 于2021年6月5日周六 下午1:55写道: > >> Is it possible to use OperatorState, when NOT implementing a source or >> sink function? >> >> If yes, then how? >> >

Re: Is it possible to use OperatorState, when NOT implementing a source or sink function?

2021-06-05 Thread Marco Villalobos
this operator, it only goes to one task manager node. I need state, but I don't really need it keyed. On Sat, Jun 5, 2021 at 4:56 AM Marco Villalobos wrote: > Does that work in the DataStream API in Batch Execution Mode? > > On Sat, Jun 5, 2021 at 12:04 AM JING ZHANG wrote: >

DataStream API in Batch Execution mode

2021-06-07 Thread Marco Villalobos
How do I use a hierarchical directory structure as a file source in S3 when using the DataStream API in Batch Execution mode? I have been trying to find out if the API supports that, because currently our data is organized by years, halves, quarters, months, and but before I launch the job, I flat

Re: DataStream API in Batch Execution mode

2021-06-09 Thread Marco Villalobos
ening an issue for lacking the document? > > [1] > https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-files/src/test/java/org/apache/flink/connector/file/src/FileSourceTextLinesITCase.java > Best, > Guowei > > > On Tue, Jun 8, 2021 at 5:59 AM Marc

Please advise bootstrapping large state

2021-06-15 Thread Marco Villalobos
I must bootstrap state from postgres (approximately 200 GB of data) and I notice that the state processor API requires the DataSet API in order to bootstrap state for the Stream API. I wish there was a way to use the SQL API and use a partitioned scan, but I don't know if that is even possible wit

Re: Please advise bootstrapping large state

2021-06-16 Thread Marco Villalobos
) Using the DataSet API and state processor API. I would first try to see > how much effort it is to read the data using the DataSet API (could be less > convenient than the Flink SQL JDBC connector). > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/jdbc

  1   2   >