Easy demo
|
|
The tf.train.Server.create_local_server
method creates a single-process cluster, with an in-process server.
Create a cluster
cluster
a set of “tasks” that participate in the distributed execution of a TensorFlow graph.
task
Each task is associated with a TensorFlow “server”.
- which contains a “master” that can be used to create sessions,
- and a “worker” that executes operations in the graph.
jobs
A cluster can also be divided into one or more “jobs”,
- where each job contains one or more tasks.
create cluster
start one TensorFlow server per task in the cluster.
Each task typically runs on a different machine
but you can run multiple tasks on the same machine
In each task, do the following:
- Create a
tf.train.ClusterSpec
that describes all of the tasks in the cluster.- This should be the same for each task.
- Create a tf.train.Server
- passing the
tf.train.ClusterSpec
to the constructor, - identifying the local task with a job name and task index.
- passing the
Create a tf.train.ClusterSpec
to describe the cluster
The cluster specification dictionary maps job names to lists of network addresses.
Pass this dictionary to the tf.train.ClusterSpec constructor.
1 tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})/job:local/task:0
/job:local/task:1
12345678910 tf.train.ClusterSpec({"worker": ["worker0.example.com:2222","worker1.example.com:2222","worker2.example.com:2222"],"ps": ["ps0.example.com:2222","ps1.example.com:2222"]})/job:worker/task:0
/job:worker/task:1
/job:worker/task:2
/job:ps/task:0
/job:ps/task:1
Create a tf.train.Server instance in each task
A tf.train.Server
object contains
- a set of local devices, a set of connections to other tasks in its
tf.train.ClusterSpec
. - a tf.Session that can use these to perform a distributed computation.
Each server is a member of a specific named job and has a task index within that job.
A server can communicate with any other server in the cluster.
For example, to launch a cluster with two servers running on localhost:2222 and localhost:2223, run the following snippets in two different processes on the local machine:
Specifying distributed devices in your model
To place operations on a particular process, you can use the same tf.device function that is used to specify whether ops run on the CPU or GPU. For example:
In the above example, the variables are created on two tasks in the ps job, and the compute-intensive part of the model is created in the worker job. TensorFlow will insert the appropriate data transfers between the jobs (from ps to worker for the forward pass, and from worker to ps for applying gradients).
Replicated training
A common training configuration, called “data parallelism,”
- involves multiple tasks in a worker job training the same model on different mini-batches of data,
- updating shared parameters hosted in one or more tasks in a ps job.
There are many ways to specify this structure in TensorFlow, and we are building libraries that will simplify the work of specifying a replicated model.
Possible approaches include:
- In-graph replication.
- In this approach, the client builds a single tf.Graph that contains one set of parameters (in tf.Variable nodes pinned to /job:ps);
- and multiple copies of the compute-intensive part of the model, each pinned to a different task in /job:worker.
- Between-graph replication.?
- In this approach, there is a separate client for each /job:worker task, typically in the same process as the worker task.
- Each client builds a similar graph containing the parameters
- (pinned to /job:ps as before using
tf.train.replica_device_setter
to map them deterministically to the same tasks)
- (pinned to /job:ps as before using
- and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker.
- Asynchronous training.?
- In this approach, each replica of the graph has an independent training loop that executes without coordination. It is compatible with both forms of replication above.
- Synchronous training.
- n this approach, all of the replicas read the same values for the current parameters, compute gradients in parallel, and then apply them together.
- It is compatible with in-graph replication (e.g. using gradient averaging as in the CIFAR-10 multi-GPU trainer), and between-graph replication (e.g. using the tf.train.SyncReplicasOptimizer).
Putting it all together: example trainer program
The following code shows the skeleton of a distributed trainer program
Glossary
Client
A client is typically a program that builds a TensorFlow graph and constructs a tensorflow::Session to interact with a cluster.
A single client process can directly interact with multiple TensorFlow servers (see “Replicated training” above), and a single server can serve multiple clients.
Cluster
A TensorFlow cluster comprises a one or more “jobs”, each divided into lists of one or more “tasks”.
A cluster is typically dedicated to a particular high-level objective, such as training a neural network, using many machines in parallel.
A cluster is defined by a tf.train.ClusterSpec object.
Job
A job comprises a list of “tasks”, which typically serve a common purpose.
For example, a job named ps (for “parameter server”) typically hosts nodes that store and update variables
while a job named worker typically hosts stateless nodes that perform compute-intensive tasks.
The tasks in a job typically run on different machines.
The set of job roles is flexible: for example, a worker may maintain some state.
Master service
An RPC service that provides remote access to a set of distributed devices, and acts as a session target.
The master service implements the tensorflow::Session interface, and is responsible for coordinating work across one or more “worker services”.
All TensorFlow servers implement the master service.
Task
A task corresponds to a specific TensorFlow server, and typically corresponds to a single process. A task belongs to a particular “job” and is identified by its index within that job’s list of tasks.
TensorFlow server
A process running a tf.train.Server instance, which is a member of a cluster, and exports a “master service” and “worker service”.
Worker service
An RPC service that executes parts of a TensorFlow graph using its local devices. A worker service implements worker_service.proto. All TensorFlow servers implement the worker service.