分类问题
precision(精确率)
recall(召回率)
$F_1$
精确率和召回率的调和均值,
精确率和准确率都高的情况下,$F_1$值也会高。
sequence-sequence alignment
profile-sequence alignment
HMM-sequence alignment
profile-profile alignment
MonitoredSession
https://www.tensorflow.org/api_docs/python/tf/train/MonitoredSession
hook
https://www.tensorflow.org/api_docs/python/tf/contrib/learn/Experiment
Experiment is a class containing all information needed to train a model.
by passing an Estimator and inputs for training and evaluation, an Experiment instance knows how to invoke training and eval loops in a sensible fashion for distributed training.
|
|
Creates an Experiment instance.
None of the functions passed to this constructor are executed at construction time.
They are stored and used when a method is executed which requires it.
tf.train.Supervisor.managed_session
The managed_session() call starts a few services, which run in their own threads and use the managed session to run ops in your graph.
If your graph contains an integer variable named global_step, the services use its value to measure the number of training steps executed.
tf.train.QueueRunner
were added to the graph, the supervisor launches them in their own threads.The check for stop in the main training loop is important and necessary.
Notice:managed_session()
takes care of catching exceptions raised from the training loop to report them to the supervisor.
The main loop does not need to do anything special about exceptions. It only needs to check for the stop condition.
If the training program shuts down or crashes, its most recent checkpoint and event files are left in the logdir.
When you restart the program, managed_session() restores the graph from the most recent checkpoint and resumes training where it stopped.
A new events file is created. If you start TensorBoard and point it to the logdir, it will know how to merge the contents of the two events files and will show the training resuming at the last global step from the checkpoint.
Larger models may run out memory when the summary service runs:
For a larger model you can tell the supervisor to not run the summary service and instead run it yourself in your main training loop:
summary_op=None
when constructing the supervisor.
|
|
The managed_session() call takes care of initializing the model in the session
One common scenario is to initialize the model by loading a “pre-trained” checkpoint that was saved while training a usually slightly different model using a different dataset.
To load the pre-trained model, the init function needs a tf.train.Saver
object
The process is below:
For example to fetch different sets of summaries on a different schedule than the usual summary service.
Use the tf.train.Supervisor.loop
method
|
|
The supervisor always creates an events file in its logdir, as well as a tf.summary.FileWriter
to append events and summaries to that file.
If you want to write your own summaries it is a good idea to append them to that same events file
Method: tf.train.Supervisor.summary_computed
For more advanced usages:tf.train.Supervisor.summary_writer
https://www.tensorflow.org/api_docs/python/tf/train/Supervisor#summary_writer
checkpointing service can be configured by the following keyword arguments to the Supervisor() constructor:
tf.train.Saver
object to use for checkpointing.the summary service can be configured by the following keyword arguments to the Supervisor() constructor:
tf.GraphKeys.SUMMARY_OP
graph collection.tf.summary.merge_all()
global_step
in the graph.The managed_session()
call takes care of initializing or recovering a session.
It returns a session with a fully initialized model, ready to run ops.
If a checkpoint exists in the logdir when managed_session()
is called, the model is initialized by loading that checkpoint
otherwise it is initialized by calling an init op and optionally an init function.
When no checkpoint is available, model initialization is controlled by the following keyword arguments to the Supervisor()
constructor:
tf.GraphKeys.INIT_OP
collection.tf.global_variables_initializer()
.tf.GraphKeys.LOCAL_INIT_OP
collection.tf.tables_initializer()
and tf.local_variables_initializer()
.tf.report_uninitialized_variables()
.Checkpoint recovery is controlled by the following keyword arguments to the Supervisor() constructor:
|
|
The tf.train.Server.create_local_server
method creates a single-process cluster, with an in-process server.
a set of “tasks” that participate in the distributed execution of a TensorFlow graph.
Each task is associated with a TensorFlow “server”.
A cluster can also be divided into one or more “jobs”,
start one TensorFlow server per task in the cluster.
Each task typically runs on a different machine
but you can run multiple tasks on the same machine
In each task, do the following:
tf.train.ClusterSpec
that describes all of the tasks in the cluster. tf.train.ClusterSpec
to the constructor,tf.train.ClusterSpec
to describe the clusterThe cluster specification dictionary maps job names to lists of network addresses.
Pass this dictionary to the tf.train.ClusterSpec constructor.
1 tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})/job:local/task:0
/job:local/task:1
12345678910 tf.train.ClusterSpec({"worker": ["worker0.example.com:2222","worker1.example.com:2222","worker2.example.com:2222"],"ps": ["ps0.example.com:2222","ps1.example.com:2222"]})/job:worker/task:0
/job:worker/task:1
/job:worker/task:2
/job:ps/task:0
/job:ps/task:1
A tf.train.Server
object contains
tf.train.ClusterSpec
.Each server is a member of a specific named job and has a task index within that job.
A server can communicate with any other server in the cluster.
For example, to launch a cluster with two servers running on localhost:2222 and localhost:2223, run the following snippets in two different processes on the local machine:
To place operations on a particular process, you can use the same tf.device function that is used to specify whether ops run on the CPU or GPU. For example:
In the above example, the variables are created on two tasks in the ps job, and the compute-intensive part of the model is created in the worker job. TensorFlow will insert the appropriate data transfers between the jobs (from ps to worker for the forward pass, and from worker to ps for applying gradients).
A common training configuration, called “data parallelism,”
There are many ways to specify this structure in TensorFlow, and we are building libraries that will simplify the work of specifying a replicated model.
Possible approaches include:
tf.train.replica_device_setter
to map them deterministically to the same tasks)The following code shows the skeleton of a distributed trainer program
A client is typically a program that builds a TensorFlow graph and constructs a tensorflow::Session to interact with a cluster.
A single client process can directly interact with multiple TensorFlow servers (see “Replicated training” above), and a single server can serve multiple clients.
A TensorFlow cluster comprises a one or more “jobs”, each divided into lists of one or more “tasks”.
A cluster is typically dedicated to a particular high-level objective, such as training a neural network, using many machines in parallel.
A cluster is defined by a tf.train.ClusterSpec object.
A job comprises a list of “tasks”, which typically serve a common purpose.
For example, a job named ps (for “parameter server”) typically hosts nodes that store and update variables
while a job named worker typically hosts stateless nodes that perform compute-intensive tasks.
The tasks in a job typically run on different machines.
The set of job roles is flexible: for example, a worker may maintain some state.
An RPC service that provides remote access to a set of distributed devices, and acts as a session target.
The master service implements the tensorflow::Session interface, and is responsible for coordinating work across one or more “worker services”.
All TensorFlow servers implement the master service.
A task corresponds to a specific TensorFlow server, and typically corresponds to a single process. A task belongs to a particular “job” and is identified by its index within that job’s list of tasks.
A process running a tf.train.Server instance, which is a member of a cluster, and exports a “master service” and “worker service”.
An RPC service that executes parts of a TensorFlow graph using its local devices. A worker service implements worker_service.proto. All TensorFlow servers implement the worker service.
两个基本参数: model_fn and params
Notice:
The Estimator
also accepts the general configuration arguments model_dir and config.
input_fn
.input_fn
tf.estimator.ModeKeys.TRAIN
The model_fn was invoked in training mode, namely via a train()
calltf.estimator.ModeKeys.EVAL
The model_fn was invoked in evaluation mode, namely via an evaluate()
call.tf.estimator.ModeKeys.PREDICT
The model_fn was invoked in predict mode, namely via a predict()
call.optimizer
algorithm to minimize the loss values calculated by the loss function.
|
|
必须返回一个tf.estimator.EstimatorSpec
对象
predict()
|
|
两个基本参数: model_fn and params
Notice:
The Estimator
also accepts the general configuration arguments model_dir and config.
input_fn
.input_fn
tf.estimator.ModeKeys.TRAIN
The model_fn was invoked in training mode, namely via a train()
calltf.estimator.ModeKeys.EVAL
The model_fn was invoked in evaluation mode, namely via an evaluate()
call.tf.estimator.ModeKeys.PREDICT
The model_fn was invoked in predict mode, namely via a predict()
call.optimizer
algorithm to minimize the loss values calculated by the loss function.
|
|
必须返回一个tf.estimator.EstimatorSpec
对象
predict()
|
|
tf.get_variable
|
|
|
|
collections: name lists of tensors or other objects, such as tf.Variable
instances.
By default every tf.Variable gets placed in the following two collections:
tf.GraphKeys.GLOBAL_VARIABLES
tf.GraphKeys.TRAINABLE_VARIABLE
If you don’t want a variable to be trainable, add it to the tf.GraphKeys.LOCAL_VARIABLES
collection instead.
The method is:
or
There is no need to explicitly create a collection.
To add a variable (or any other object) to a collection:
To retrieve a list of all the variables (or other objects) you’ve placed in a collection:
Placement method:
It is particularly important for variables to be in the correct device in distributed settings.
For this reason we provide tf.train.replica_device_setter
, which can automatically place variables in parameter servers.
In the low-level TensorFlow API, must explicitly initialize the variables by ourselves.
Most high-level frameworks automatically initialize variables for you
tf.contrib.slim
, tf.estimator.Estimator
and Keras
Explicit initialization is otherwise useful because it
tf.global_variables_initializer()
|
|
To initialize variables yourself:
ask which variables have still not been initialized:
by default tf.global_variables_initializer
does not specify the order in which variables are initialized.
If you use a variable’s value while initializing another variable, use variable.initialized_value()
instead of variable
To use the value of a tf.Variable in a TensorFlow graph, simply treat it like a normal tf.Tensor.
To assign a value to a variable, use the methods such as assign
, assign_add
in the tf.Variable
class.
To force a re-read of the value of a variable after something has happened, you can use tf.Variable.read_value.
https://www.tensorflow.org/api_docs/python/tf/train/Optimizer
https://www.tensorflow.org/programmers_guide/variables#saving_and_restoring
https://www.tensorflow.org/programmers_guide/variables#sharing_variables
https://www.tensorflow.org/get_started/tflearn