[
https://issues.apache.org/jira/browse/IGNITE-9034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anton Dmitriev updated IGNITE-9034:
-----------------------------------
Description:
TensorFlow distributed training historically has been based on workers,
parameter servers and manual assignments, but new TensorFlow API (Estimator
API) allows to run distributed training with minimal changes compare to single
device execution. Take a look [this
presentation|https://www.youtube.com/watch?v=bRMGoPqsn20] for more information.
Estimator API requires the following configuration:
* TF_CONFIG environment variable that contains json with cluster description
(see [this
tutorial|https://cloud.google.com/ml-engine/docs/tensorflow/distributed-training-details]),
* tf.contrib.distribute.MirroredStrategy(workers) that defines distribution
strategy.
The goal of this task is to allow:
* to start and maintain TensorFlow cluster on top of Apache Ignite that
contains workers and chief job,
* submit job into such cluster using command line interface.
> [ML] Add Estimator API support to TensorFlow cluster on top of Apache Ignite
> ----------------------------------------------------------------------------
>
> Key: IGNITE-9034
> URL: https://issues.apache.org/jira/browse/IGNITE-9034
> Project: Ignite
> Issue Type: Improvement
> Components: ml
> Reporter: Yury Babak
> Assignee: Anton Dmitriev
> Priority: Major
> Fix For: 2.7
>
>
> TensorFlow distributed training historically has been based on workers,
> parameter servers and manual assignments, but new TensorFlow API (Estimator
> API) allows to run distributed training with minimal changes compare to
> single device execution. Take a look [this
> presentation|https://www.youtube.com/watch?v=bRMGoPqsn20] for more
> information.
> Estimator API requires the following configuration:
> * TF_CONFIG environment variable that contains json with cluster description
> (see [this
> tutorial|https://cloud.google.com/ml-engine/docs/tensorflow/distributed-training-details]),
> * tf.contrib.distribute.MirroredStrategy(workers) that defines distribution
> strategy.
> The goal of this task is to allow:
> * to start and maintain TensorFlow cluster on top of Apache Ignite that
> contains workers and chief job,
> * submit job into such cluster using command line interface.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)