ElasticSearch failing when parallelised

2019-10-17 Thread Nicholas Walton
HI, I’m running ElasticSearch as a sink for a batch file processing a CSV file of 6.2 million rows, with each row being 181 numeric values. It quite happily processes a small example of around 2,000 rows, running each column through a single parallel pipeline, keyed by column number. However,

Re: standalone flink savepoint restoration

2019-10-17 Thread Congxian Qiu
Hi Do you specify the operatorid for all the operators?[1][2], asking this because from the exception you gave, if you only add new operators and all the old operators have specified operatorid, seems there would not throw such exception. [1] https://ci.apache.org/projects/flink/flink-docs-stable/

Re: Should I use static database connection pool?

2019-10-17 Thread John Smith
Usually the database connection pool is thread safe. When you mean task you mean a single deployed flink job? I still think a sink is only init once. You can prove it by putting logging in the open and close. On Thu., Oct. 17, 2019, 1:39 a.m. Xin Ma, wrote: > Thanks, John. If I don't static my

Re: Flink State Migration Version 1.8.2

2019-10-17 Thread ApoorvK
It is throwing below error , the class I am adding variables have other variable as an object of class which are also in state. Caused by: org.apache.flink.util.StateMigrationException: The new state typeSerializer for operator state must not be incompatible. at org.apache.flink.runtime.st

Re: Should I use static database connection pool?

2019-10-17 Thread John Smith
If by task you mean job then yes global static variables initialized im the main of the job do not get serialized/transfered to the nodes where that job may get assigned. The other thing is also since it is a sink, the sink will be serialized to that node and then initialized so that static variab

Querying nested JSON stream?

2019-10-17 Thread srikanth flink
Hi there, I'm using Flink SQL clinet to run the jobs for me. My stream is a JSON with nested objects. Couldn't find much document on querying the nested JSON, so I had to flatten the JSON and use as: SELECT `source.ip`, `destination.ip`, `dns.query`, `organization.id`, `dns.answers.data` FROM sour

Re: JDBC Table Sink doesn't seem to sink to database.

2019-10-17 Thread Rong Rong
Hi John, You are right. IMO the batch interval setting is used for increasing the JDBC execution performance purpose. The reason why your INSERT INTO statement with a `non_existing_table` the exception doesn't happen is because the JDBCAppendableSink does not check table existence beforehand. That

currentWatermark for Event Time is not increasing fast enough to go past the window.maxTimestamp

2019-10-17 Thread Vijay Balakrishnan
Hi, *Event Time Window: 15s* My currentWatermark for Event Time processing is not increasing fast enough to go past the window maxTimestamp. I have reduced *bound* used for watermark calculation to just *10 ms*. I have increased the parallelInput to process input from Kinesis in parallel to 2 slots

Re: JDBC Table Sink doesn't seem to sink to database.

2019-10-17 Thread John Smith
Yes correct, I set it to batch interval = 1 and it works fine. Anyways I think the JDBC sink could have some improvements like batchInterval + time interval execution. So if the batch doesn't fill up then execute what ever is left on that time interval. On Thu, 17 Oct 2019 at 12:22, Rong Rong wro

Re: Mirror Maker 2.0 cluster and starting from latest offset and other queries

2019-10-17 Thread Piotr Nowojski
Hi, It sounds like setting up Mirror Maker has very little to do with Flink. Shouldn’t you try asking this question on the Kafka mailing list? Piotrek > On 16 Oct 2019, at 16:06, Vishal Santoshi wrote: > > 1. still no clue, apart from the fact that ConsumerConfig gets it from > somewhere (

Re: Elasticsearch6UpsertTableSink how to trigger es delete index。

2019-10-17 Thread Piotr Nowojski
Hi, Take a look at the documentation. This [1] describes an example were a running query can produce updated results (and thus retracting the previous results). [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html#table-to-stream-conversion

Re: JDBC Table Sink doesn't seem to sink to database.

2019-10-17 Thread Rong Rong
Yes, I think having a time interval execution (for the AppendableSink) should be a good idea. Can you please open a Jira issue[1] for further discussion. -- Rong [1] https://issues.apache.org/jira/projects/FLINK/issues On Thu, Oct 17, 2019 at 9:48 AM John Smith wrote: > Yes correct, I set it t

Re: Mirror Maker 2.0 cluster and starting from latest offset and other queries

2019-10-17 Thread Vishal Santoshi
oh shit.. sorry wrong wrong forum :) On Thu, Oct 17, 2019 at 1:41 PM Piotr Nowojski wrote: > Hi, > > It sounds like setting up Mirror Maker has very little to do with Flink. > Shouldn’t you try asking this question on the Kafka mailing list? > > Piotrek > > On 16 Oct 2019, at 16:06, Vishal Santo

Re: JDBC Table Sink doesn't seem to sink to database.

2019-10-17 Thread John Smith
I recorded two: Time interval: https://issues.apache.org/jira/browse/FLINK-14442 Checkpointing: https://issues.apache.org/jira/browse/FLINK-14443 On Thu, 17 Oct 2019 at 14:00, Rong Rong wrote: > Yes, I think having a time interval execution (for the AppendableSink) > should be a good idea. > Ca

Managing Job Deployments in Production

2019-10-17 Thread Peter Groesbeck
How are folks here managing deployments in production? We are deploying Flink jobs on EMR manually at the moment but would like to move towards some form of automation before anything goes into production. Adding additional EMR Steps to a long running cluster to deploy or update jobs seems like th

Re: Should I use static database connection pool?

2019-10-17 Thread John Smith
Also pool should be transient because it it holds connections which shouldn't/cannot be serialized. On Thu., Oct. 17, 2019, 9:39 a.m. John Smith, wrote: > If by task you mean job then yes global static variables initialized im > the main of the job do not get serialized/transfered to the nodes w

Re: Broadcast state

2019-10-17 Thread Navneeth Krishnan
Ya, there will not be a problem of duplicates. But what I'm trying to achieve is if there a large static state which needs to be present just one per node rather than storing it per slot that would be ideal. The reason being is that the state is quite large around 100GB of mostly static data and it

ProcessFunction Timer

2019-10-17 Thread Navneeth Krishnan
Hi All, I'm currently using a tumbling window of 5 seconds using TumblingTimeWindow but due to change in requirements I would not have to window every incoming data. With that said I'm planning to use process function to achieve this selective windowing. I looked at the example provided in the do

Data processing with HDFS local or remote

2019-10-17 Thread Pritam Sadhukhan
Hi, I am trying to process data stored on HDFS using flink batch jobs. Our data is splitted into 16 data nodes. I am curious to know how data will be pulled from the data nodes with the same number of parallelism set as the data split on HDFS i.e. 16. Is the flink task being executed locally on

Re: Flink State Migration Version 1.8.2

2019-10-17 Thread Paul Lam
Hi, Could you confirm that you’re using POJOSerializer before and after migration? Best, Paul Lam > 在 2019年10月17日,21:34,ApoorvK 写道: > > It is throwing below error , > the class I am adding variables have other variable as an object of class > which are also in state. > > Caused by: org.apach

Re: JDBC Table Sink doesn't seem to sink to database.

2019-10-17 Thread Rong Rong
Splendid. Thanks for following up and moving the discussion forward :-) -- Rong On Thu, Oct 17, 2019 at 11:38 AM John Smith wrote: > I recorded two: > Time interval: https://issues.apache.org/jira/browse/FLINK-14442 > Checkpointing: https://issues.apache.org/jira/browse/FLINK-14443 > > > On Thu

Flink cluster on k8s with rocksdb state backend

2019-10-17 Thread dhanesh arole
Hello all, I am trying to provision a Flink cluster on k8s. Some of the jobs in our existing cluster use RocksDB state backend. I wanted to take a look at the Flink helm chart or deployment manifests that provision task managers with dynamic PV and how they manage it. We are running on kops manage

Re: Flink cluster on k8s with rocksdb state backend

2019-10-17 Thread Steven Nelson
You may want to look at using instances with local ssd drives. You don’t really need to keep the state data between instance stops and starts, because Flink will have to restore from a checkpoint or savepoint, so using ephemeral isn’t a problem. Sent from my iPhone > On Oct 17, 2019, at 11:31