Hi Vino,
Another use case would be I want to build a dag of batch sources, sinks and
transforms and I want to schedule the jobs periodically. One can say similar to
airflow but a Flink api would be lot better!
Sent from my iPhone
> On Jan 10, 2020, at 6:42 PM, vino yang wrote:
>
>
> Hi kan
Hi Vino,
I am new to Flink. I was thinking more like a dag builder api where I can build
a dag of source,sink and transforms and hopefully fink take cares of the entire
life cycle of the dag.
An example would be CDAP pipeline api.
Sent from my iPhone
> On Jan 10, 2020, at 6:42 PM, vino yang
Hi kant,
Can you provide more context about your question? What do you mean about
"pipeline API"?
IMO, you can build an ETL pipeline via composing several Flink transform
APIs. About choosing which transform APIs, it depends on your business
logic.
Here are the generic APIs list.[1]
Best,
Vino
Hi All,
I am wondering if there are pipeline API's for ETL?
Thanks!
Thanks Till , I will do some test about this , will this be some public
feature in next release version or later?
Till Rohrmann 于2020年1月10日周五 上午6:15写道:
> Hi,
>
> you would need to set the co-location constraint in order to ensure that
> the sub-tasks of operators are deployed to the same machine
Thanks Zhijiang, looks like serialization will always be there in keyed
stream
Zhijiang 于2020年1月10日周五 上午12:08写道:
> Only chained operators can avoid record serialization cost, but the
> chaining mode can not support keyed stream.
> If you want to deploy downstream with upstream in the same task m
Thank you both for the suggestions.
I did a bit more analysis using UI and identified at least one
problem that's occurring with the job rn. Going to fix it first and then
take it from there.
*Problem that I identified:*
I'm running with 26 parallelism. For the checkpoints that are expiring, one
o
The error we get is the following:
org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy
Yarn session cluster
at
org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:423)
at
org.apache.flink.client.c
Hi Kostas,
I didn’t see a follow-up to this, and have also run into this same issue of
winding up with a bunch of .inprogress files when a bounded input stream ends
and the job terminates.
When StreamingFileSystem.close() is called, shouldn’t all buckets get
auto-rolled, so that the .inprogres
Hi:
I have a few question about how state is shared in processors in Flink.
1. If I have a processor instantiated in the Flink app, and apply use in
multiple times in the Flink - (a) if the tasks are in the same slot - do
they share the same processor on the taskmanager ?
(b) if the tasks
The error you got is due to an older asm version which is fixed for 1.10
in https://issues.apache.org/jira/browse/FLINK-13467 .
On 10/01/2020 15:58, KristoffSC wrote:
Hi,
Yangze Guo, Chesnay Schepler thank you very much for your answers.
I have actually a funny setup.
So I have a Flink Job mod
Hi,
Yangze Guo, Chesnay Schepler thank you very much for your answers.
I have actually a funny setup.
So I have a Flink Job module, generated from Flink's maven archetype.
This module has all operators and Flink environment config and execution.
This module is compiled by maven with "maven.compil
Hi Yangze,
Thanks for your reply. Those are the docs I have read and followed. (I was also
able to set up a standalone flink cluster with secure HDFS, Zookeeper and Kafa.
)
Could you please let me know what I am missing? Thanks
Best,
Ethan
> On Jan 10, 2020, at 6:28 AM, Yangze Guo wrote:
>
Hi,
Interesting! What problem are you seeing when you don't unset that
environment variable? From reading UserGroupInformation.java our code
should almost work when that environment variable is set.
Best,
Aljoscha
On 10.01.20 15:23, Juan Gentile wrote:
Hello Aljoscha!
The way we send the
Hello Aljoscha!
The way we send the DTs to spark is by setting an env variable
(HADOOP_TOKEN_FILE_LOCATION) which interestingly we have to unset when we run
Flink because even if we do kinit that variable affects somehow Flink and
doesn’t work.
I’m not an expert but what you describe (We wou
Hi,
you would need to set the co-location constraint in order to ensure that
the sub-tasks of operators are deployed to the same machine. It effectively
means that subtasks a_i, b_i of operator a and b will be deployed to the
same slot. This feature is not super well exposed but you can take a lo
Hi,
it seems I hin send to early, my mail was missing a small part. This is
the full mail again:
to summarize and clarify various emails: currently, you can only use
Kerberos authentication via tickets (i.e. kinit) or keytab. The relevant
bit of code is in the Hadoop security module [1]. Her
Hi Juan,
to summarize and clarify various emails: currently, you can only use
Kerberos authentication via tickets (i.e. kinit) or keytab. The relevant
bit of code is in the Hadoop security module: [1]. Here you can see that
we either use keytab.
I think we should be able to extend this to al
Hi
For expired checkpoint, you can find something like " Checkpoint xxx of job
xx expired before completing" in jobmanager.log, then you can go to the
checkpoint UI to find which tasks did not ack, and go to these tasks to see
what happened.
If checkpoint was been declined, you can find something
Hi, Ethan
You could first check your cluster following this guide and check if
all the related config[2] set correctly.
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/security-kerberos.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/config.html#secu
Hi KristoffSC
As Zhu said, Flink enables slot sharing[1] by default. This feature is
nothing to do with the resource of your cluster. The benefit of this
feature is written in [1] as well. I mean, it will not detect how many
slots in your cluster and adjust its behavior toward this number. If
you
In regards to what we test:
We run our tests against Java 8 *and *Java 11, with the compilation and
testing being done with the same JDK.
In other words, we don't check whether Flink compiled with JDK 8 runs on
JDK 11, but we currently have no reason to believe that there is a
problem (and som
Hi Krzysztof
All the tests run with Java 11 after FLINK-13457[1]. Its fix version
is set to 1.10. So, I fear 1.9.1 is not guaranteed to be running on
java 11. I suggest you to wait for the release-1.10.
[1]https://issues.apache.org/jira/browse/FLINK-13457
Best,
Yangze Guo
On Fri, Jan 10, 2020 a
Hi Eva
If checkpoint failed, please view the web UI or jobmanager log to see why
checkpoint failed, might be declined by some specific task.
If checkpoint expired, you can also access the web UI to see which tasks did
not respond in time, some hot task might not be able to respond in time.
Gen
Hi Zhu Zhu,
well In my last test I did not change the job config, so I did not change
the parallelism level of any operator and I did not change policy regarding
slot sharing (it stays as default one). Operator Chaining is set to true
without any extra actions like "start new chain, disable chain e
Hi,
Thank you for your answer. Btw it seams that you send the replay only to my
address and not to the mailing list :)
I'm looking forward to try out 1.10-rc then.
Regarding second thing you wrote, that
*"on Java 11, all the tests(including end to end tests) would be run with
Java 11 profile now.
Hi sunfulin,
Looks like the error is happened in sink instead of source.
Caused by: java.lang.NullPointerException: Null result cannot be used for
atomic types.
at DataStreamSinkConversion$5.map(Unknown Source)
So the point is how did you write to sink. Can you share these codes?
Best,
Jin
I don't see a particular reason why you see this behavior. Chesnay's
explanation is the only plausible way that this behavior can happen.
I fear that without a specific log, we cannot help further.
On Fri, Jan 10, 2020 at 5:06 AM Jayant Ameta wrote:
> Also, the ES version I'm using is 5.6.7
>
>
Only chained operators can avoid record serialization cost, but the chaining
mode can not support keyed stream.
If you want to deploy downstream with upstream in the same task manager, it can
avoid network shuffle cost which can still get performance benefits.
As I know @Till Rohrmann has impleme
29 matches
Mail list logo