Hi
Thanks for bringing this.
The design looks very nice to me in that
1. In the new per-job mode, we don't need to compile user programs in the
client and can directly run user programs with user jars. That way, it's
easier for resource isolation in multi-tenant platforms and is much safer.
2. Th
Congratulations!
Regards,
Xiaogang
Guowei Ma 于2019年9月11日周三 下午7:07写道:
> Congratulations Zili !
>
> Best,
> Guowei
>
>
> Fabian Hueske 于2019年9月11日周三 下午7:02写道:
>
>> Congrats Zili Chen :-)
>>
>> Cheers, Fabian
>>
>> Am Mi., 11. Sept. 2019 um 12:48 Uhr schrieb Biao Liu > >:
>>
>>> Congrats Zili!
>>
ready started TaskManagers in P2P fashion, not to have a
> blocker on HDFS replication.
>
> Spark job without any tuning exact same size jar with 800 executors, can
> start without any issue at the same cluster in less than a minute.
>
> *Further questions:*
>
> *@ S
Hi Datashov,
We faced similar problems in our production clusters.
Now both lauching and stopping of containers are performed in the main
thread of YarnResourceManager. As containers are launched and stopped one
after another, it usually takes long time to boostrap large jobs. Things
get worse wh
Hi Prashantnayak
Thanks a lot for reporting this problem. Can you provide more details to
address it?
I am guessing master has to delete too many files when a checkpoint is
subsumed, which is very common in our cases. The number of files in the
recovery directory will increase if the master canno
Hi Vinay,
We observed a similar problem before. We found that RocksDB keeps a lot of
index and filter blocks in memory. With the growth in state size (in our
cases, most states are only cleared in windowed streams), these blocks will
occupy much more memory.
We now let RocksDB put these blocks in
Hi rhashmi
We are also experiencing slow checkpoints when there exist back pressure.
It seems there is no good method to handle back pressure now.
We work around it by setting a larger number of checkpoint timeout. The
default value is 10min. But checkpoints usually take more time to complete
whe
Hi Sathi,
According to the format specification of URI, "abc-checkpoint" is the host
name in the given uri and the path is null. Therefore, FsStateBackend are
complaining about the usage of the root directory.
Maybe "s3:///abc-checkpoint" ("///" instead of "//") is the uri that you
want to use. I
Hi Yunfan,
Jobs are supposed to correctly restart from both savepoints and checkpoints
with different parallelisms if only operator states and keyed states are
used. In the cases where there exist unpartitionable states (e.g., those
are produced by the Checkpointed interface), the job will fail to
Hi Yassine,
If I understand correctly, you are needing sorted states which
unfortunately are not supported in Flink now.
We have some ideas to provide such sorted states to facilitate the
development of user applications. But it is still under discussion due to
the concerns on back compatibility.
10 matches
Mail list logo