Hi Yangze, Xintong,

Thank you for instant response.

And big thanks for the hint on TM launch command. It indeed was the
problem. I've added my own custom mesos-taskmanager.sh
<https://gist.github.com/Atlaster/305b5d63429e7dbf264d43a6cc4d72e5> to echo
the launch command (I've switched to DEBUG level on logging, but it didn't
really display anything useful). May I suggest to add something like this
in the future releases?

As for my particular case, the issue was in mesos-appmaster.sh option:

-Dmesos.resourcemanager.tasks.taskmanager-cmd="/opt/job/custom_launch_tm.sh"

My custom launch script was slicing argument array incorrectly.

Thanks for the help and regards,
Alex.

чт, 12 мар. 2020 г. в 15:46, Xintong Song <tonysong...@gmail.com>:

> Hi Alex,
>
> Could you try to check and post your TM launch command? I suspect that
> there might be some unrecognized arguments that prevent the rest of
> arguments being parsed.
>
> The TM memory configuration process works as follow:
>
>    1. The resource manager will parse the configurations, checking which
>    options are configured and which are not, and calculate the size of each
>    memory component. (This is where ‘taskmanager.memory.process.size’ is 
> used.)
>    2. After deriving the memory component sizes, the resource manager
>    will generate launch command for the task managers, with dynamic
>    configurations "-D <key=value>" overwriting the memory component sizes.
>    Therefore, even you have not configured
>    'taskmanager.memory.task.heap.size', it is expected that before when the TM
>    is launched this config option should be available.
>    3. When a task manager is started, it will not do the calculations
>    again, and will directly read the memory component sizes calculated by
>    resource manager from the dynamic configurations. That means it is not
>    reading ‘taskmanager.memory.process.size’ and deriving memory component
>    sizes from it again.
>
> One thing that might have caused your problem is that, when 
> MesosTaskExecutorRunner
> parses the command line arguments (that's where the dynamic configurations
> are passed in), if it meets an unrecognized token it will stop parsing
> the rest of the arguments. That could be the reason that 
> 'taskmanager.memory.task.heap.size'
> is missing. You can take a look at the launching command, see if there's
> anything unexpected before the memory dynamic configurations.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Thu, Mar 12, 2020 at 2:26 PM Yangze Guo <karma...@gmail.com> wrote:
>
>> Hi, Alexander
>>
>> I could not reproduce it in my local environment. Normally, Mesos RM
>> will calculate all the mem config and add it to the launch command.
>> Unfortunately, all the log I could found for this command is at the
>> DEBUG level. Would you mind changing the log level to DEBUG or sharing
>> anything about the taskmanager launch command you could found in the
>> current log?
>>
>>
>> Best,
>> Yangze Guo
>>
>> On Thu, Mar 12, 2020 at 1:38 PM Alexander Kasyanenko
>> <as.kasyane...@gmail.com> wrote:
>> >
>> > Hi folks,
>> >
>> > I have a question related configuration for new memory introduced in
>> flink 1.10. Has anyone encountered similar problem?
>> > I'm trying to make use of taskmanager.memory.process.size configuration
>> key in combination with mesos session cluster, but I get an error like this:
>> >
>> > 2020-03-11 11:44:09,771 [main] ERROR
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Error while
>> starting the TaskManager
>> > org.apache.flink.configuration.IllegalConfigurationException: Failed to
>> create TaskExecutorResourceSpec
>> > at
>> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:72)
>> > at
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
>> > at
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.<init>(TaskManagerRunner.java:152)
>> > at
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
>> > at
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.lambda$main$0(MesosTaskExecutorRunner.java:106)
>> > at java.base/java.security.AccessController.doPrivileged(Native Method)
>> > at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>> > at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
>> > at
>> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>> > at
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.main(MesosTaskExecutorRunner.java:105)
>> > Caused by:
>> org.apache.flink.configuration.IllegalConfigurationException: The required
>> configuration option Key: 'taskmanager.memory.task.heap.size' , default:
>> null (fallback keys: []) is not set
>> > at
>> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
>> > at
>> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
>> > at java.base/java.util.Arrays$ArrayList.forEach(Arrays.java:4390)
>> > at
>> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
>> > at
>> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
>> > ... 9 more
>> >
>> > But when task manager is launched, it correctly parses process memory
>> key:
>> >
>> > 2020-03-11 11:43:55,376 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -
>> --------------------------------------------------------------------------------
>> > 2020-03-11 11:43:55,377 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Starting
>> MesosTaskExecutorRunner (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.2020 @
>> 19:18:19 CET)
>> > 2020-03-11 11:43:55,377 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  OS current
>> user: root
>> > 2020-03-11 11:43:57,347 [main] WARN
>> org.apache.hadoop.util.NativeCodeLoader                       - Unable to
>> load native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> > 2020-03-11 11:43:57,535 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM:
>> OpenJDK 64-Bit Server VM - AdoptOpenJDK - 11/11.0.2+9
>> > 2020-03-11 11:43:57,535 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Maximum
>> heap size: 746 MiBytes
>> > 2020-03-11 11:43:57,535 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JAVA_HOME:
>> (not set)
>> > 2020-03-11 11:43:57,539 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Hadoop
>> version: 2.6.5
>> > 2020-03-11 11:43:57,539 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  JVM
>> Options:
>> > 2020-03-11 11:43:57,539 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -
>>  -Xmx781818251
>> > 2020-03-11 11:43:57,539 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -
>>  -Xms781818251
>> > 2020-03-11 11:43:57,540 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -
>>  -XX:MaxDirectMemorySize=317424929
>> > 2020-03-11 11:43:57,540 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -
>>  -XX:MaxMetaspaceSize=100663296
>> > 2020-03-11 11:43:57,540 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -
>>  -Dlog.file=/var/log/flink-session-cluster/taskmanager.log
>> > 2020-03-11 11:43:57,540 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -
>>  -Dlog4j.configuration=file:/opt/flink/conf/log4j.properties
>> > 2020-03-11 11:43:57,540 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -
>>  -Dlogback.configurationFile=file:/opt/flink/conf/logback.xml
>> > 2020-03-11 11:43:57,540 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Program
>> Arguments: (none)
>> > 2020-03-11 11:43:57,540 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -  Classpath:
>> /opt/flink/lib/apache-log4j-extras-1.2.17.jar:/opt/flink/lib/flink-metrics-graphite-1.10.0.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.6.5-8.0.jar:/opt/flink/lib/flink-table-blink_2.12-1.10.0.jar:/opt/flink/lib/flink-table_2.12-1.10.0.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.12-1.10.0.jar:
>> > 2020-03-11 11:43:57,541 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     -
>> --------------------------------------------------------------------------------
>> > 2020-03-11 11:43:57,542 [main] INFO
>> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner     - Registered
>> UNIX signal handlers for [TERM, HUP, INT]
>> > 2020-03-11 11:43:57,550 [main] INFO
>> org.apache.flink.configuration.GlobalConfiguration            - Loading
>> configuration property: taskmanager.memory.process.size, 2g
>> > 2020-03-11 11:43:57,550 [main] INFO
>> org.apache.flink.configuration.GlobalConfiguration            - Loading
>> configuration property: taskmanager.cpu.cores, 2
>> > 2020-03-11 11:43:57,551 [main] INFO
>> org.apache.flink.configuration.GlobalConfiguration            - Loading
>> configuration property: taskmanager.numberOfTaskSlots, 4
>> > 2020-03-11 11:43:57,551 [main] INFO
>> org.apache.flink.configuration.GlobalConfiguration            - Loading
>> configuration property: parallelism.default, 1
>> > ...
>> >
>> > Judging by the docs specifying taskmanager.memory.process.size key
>> should be enough to launch the job, but it seems like this value is ignored.
>> > I would appreciate any suggestion.
>> >
>> > Regards and thanks in advance,
>> > Alex.
>>
>

Reply via email to