Klpek's Note Library


  • Home

  • Categories

  • Archives

  • Tags

  • Search

evaluation-measure

Posted on 2017-09-07 |

分类问题

precision(精确率)

recall(召回率)

$F_1$

精确率和召回率的调和均值,

精确率和准确率都高的情况下,$F_1$值也会高。

later

http://charleshm.github.io/2016/03/Model-Performance/

Untitled

Posted on 2017-09-01 |

sequences alignment tools

BLAST

sequence-sequence alignment

PSI-BLAST

profile-sequence alignment

HMMER

HMM-sequence alignment

PROF_SIM and COMPASS

profile-profile alignment

HHSearch

HHSearch2

Alignment Quality Evaluation

plain MaxSub score

Modeler’s score

Balanced Score

Untitled

Posted on 2017-08-27 |

MonitoredSession
https://www.tensorflow.org/api_docs/python/tf/train/MonitoredSession
hook

Experiment

https://www.tensorflow.org/api_docs/python/tf/contrib/learn/Experiment
Experiment is a class containing all information needed to train a model.
by passing an Estimator and inputs for training and evaluation, an Experiment instance knows how to invoke training and eval loops in a sensible fashion for distributed training.

method

Constructor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
__init__(
estimator,
train_input_fn,
eval_input_fn,
eval_metrics=None,
train_steps=None,
eval_steps=100,
train_monitors=None,
eval_hooks=None,
local_eval_frequency=None,
eval_delay_secs=120,
continuous_eval_throttle_secs=60,
min_eval_frequency=None,
delay_workers_by_global_step=False,
export_strategies=None,
train_steps_per_iteration=None
)

Creates an Experiment instance.
None of the functions passed to this constructor are executed at construction time.
They are stored and used when a method is executed which requires it.

  • estimator
    • Object implementing Estimator interface,
  • train_input_fn
    • function, returns features and labels for training.
  • eval_input_fn
    • function, returns features and labels for evaluation.
    • If eval_steps is None, this should be configured only to produce for a finite number of batches (generally, 1 epoch over the evaluation data).
  • eval_metrics
    • dict of string, metric function.
    • If None, default set is used. This should be None if the estimator is ${tf.estimator.Estimator}$. If metrics are provided they will be appended to the default set.
  • train_steps
    • Perform this many steps of training. None, the default, means train forever.
  • eval_steps
    • evaluate runs until input is exhausted (or another exception is raised), or for eval_steps steps, if specified.
  • train_monitors
    • A list of monitors to pass to the Estimator’s fit function.
  • eval_hooks
    • A list of SessionRunHook hooks to pass to the Estimator’s evaluate function.
  • eval_delay_secs
    • Start evaluating after waiting for this many seconds.
  • continuous_eval_throttle_secs
    • when the last evaluation was started at least this many seconds ago for continuous_eval(), re-evaluate.
  • min_eval_frequency
    • (applies only to train_and_evaluate).
    • the minimum number of steps between evaluations.
    • Of course, evaluation does not occur if no new snapshot is available, hence, this is the minimum. If 0, the evaluation will only happen after training. If None, defaults to 1, unless model_dir is on GCS, in which case the default is 1000.
  • delay_workers_by_global_step
    • if True delays training workers based on global step instead of time.
  • export_strategies
    • Iterable of ExportStrategys, or a single one, or None.
  • train_steps_per_iteration
    • (applies only to continuous_train_and_eval).
    • Perform this many (integer) number of train steps for each training-evaluation iteration.
    • With a small value, the model will be evaluated more frequently with more checkpoints saved.
    • If None, will use a default value (which is smaller than train_steps if provided).

supervisor

basic use process

  1. Create a Supervisor object,
    • parameter: logdir
    • the path to a directory where to save checkpoints and summaries.
  2. Ask the supervisor for a session with tf.train.Supervisor.managed_session
  3. Use the session to execute a train op, checking at each step if the supervisor requests that the training stops.
started services

The managed_session() call starts a few services, which run in their own threads and use the managed session to run ops in your graph.
If your graph contains an integer variable named global_step, the services use its value to measure the number of training steps executed.

  • Checkpointing service: Saves a copy of the graph variables in the logdir.
    • The checkpoint filename uses the value of the global_step variable if one was added to your graph.
    • Runs every 10 minutes by default.
  • Summary service: Runs all the summary ops and appends their output to an events file in the logdir.
    • Runs every 2 minutes by default.
  • Step counter: Counts how many steps have been executed, by looking at changes in the global_step variable.
    • Appends a summary to the events file reporting the number of global steps per second.
    • The summary tag is “global_step/sec”.
    • This also runs every 2 minutes by default.
  • Queue Runners: If any tf.train.QueueRunner were added to the graph, the supervisor launches them in their own threads.
Checking for Stop

The check for stop in the main training loop is important and necessary.

  1. Exceptions raised in the service threads are reported to the supervisor which then sets its should_stop() condition to true.
  2. Other service threads notice that condition and terminate properly.
  3. The main training loop, within the managed_session() block, must also check for the stop condition and terminate.

Notice:
managed_session() takes care of catching exceptions raised from the training loop to report them to the supervisor.
The main loop does not need to do anything special about exceptions. It only needs to check for the stop condition.

Recovery

If the training program shuts down or crashes, its most recent checkpoint and event files are left in the logdir.
When you restart the program, managed_session() restores the graph from the most recent checkpoint and resumes training where it stopped.
A new events file is created. If you start TensorBoard and point it to the logdir, it will know how to merge the contents of the two events files and will show the training resuming at the last global step from the checkpoint.

Larger Model Scenario

Larger models may run out memory when the summary service runs:

  • The summary ops are run in parallel with the main loop running the train op.
  • This can cause memory usage to peak to up to two times the normal use.

For a larger model you can tell the supervisor to not run the summary service and instead run it yourself in your main training loop:

  • pass summary_op=None when constructing the supervisor.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
...create graph...
my_train_op = ...
my_summary_op = tf.summary.merge_all()
sv = tf.train.Supervisor(logdir="/my/training/directory",
summary_op=None) # Do not run the summary service
with sv.managed_session() as sess:
for step in range(100000):
if sv.should_stop():
break
if step % 100 == 0:
_, summ = sess.run([my_train_op, my_summary_op])
sv.summary_computed(sess, summ)
else:
sess.run(my_train_op)

Pre-trained Model Scenario

The managed_session() call takes care of initializing the model in the session

  • If model is available, it is restored from a checkpoint.
  • otherwise, initialized from scratch.

One common scenario is to initialize the model by loading a “pre-trained” checkpoint that was saved while training a usually slightly different model using a different dataset.

init function
  • is called only if the model needs to be initialized from scratch
  • not when the model can be recovered from a checkpoint from the logdir.

To load the pre-trained model, the init function needs a tf.train.Saver object

  • This saver must only restore the pre-trained variables
  • This is usually a good idea because the new model may contain variables that are not present in the pre-trained checkpoint
  • If you were using the default saver, you could get an error trying to restore all the variables of the new model from the pre-trained checkpoint.

The process is below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
...create graph...
# Create a saver that restores only the pre-trained variables.
pre_train_saver = tf.train.Saver([pre_train_var1, pre_train_var2])
# Define an init function that loads the pretrained checkpoint.
def load_pretrain(sess):
pre_train_saver.restore(sess, "<path to pre-trained-checkpoint>")
# Pass the init function to the supervisor.
#
# The init function is called _after_ the variables have been initialized
# by running the init_op.
sv = tf.train.Supervisor(logdir="/my/training/directory",
init_fn=load_pretrain)
with sv.managed_session() as sess:
# Here sess was either initialized from the pre-trained-checkpoint or
# recovered from a checkpoint saved in a previous run of this code.
...

Running Your Own Services

For example to fetch different sets of summaries on a different schedule than the usual summary service.
Use the tf.train.Supervisor.loopmethod

  • It repeatedly calls a function of your choice on a timer until the supervisor stop condition becomes true
  • It plays nicely with the other services.
1
2
3
4
5
6
7
8
9
10
11
def my_additional_summaries(sv, sess):
...fetch and write summaries, see below...
...
sv = tf.train.Supervisor(logdir="/my/training/directory")
with sv.managed_session() as sess:
# Call my_additional_summaries() every 1200s, or 20mn,
# passing (sv, sess) as arguments.
sv.loop(1200, my_additional_summaries, args=(sv, sess))
...main training loop...

Writing Summaries

The supervisor always creates an events file in its logdir, as well as a tf.summary.FileWriter to append events and summaries to that file.
If you want to write your own summaries it is a good idea to append them to that same events file

  • TensorBoard likes it better when only one events file in a directory is being actively appended to.

Method: tf.train.Supervisor.summary_computed

1
2
3
def my_additional_summaries(sv, sess):
summaries = sess.run(my_additional_summary_op)
sv.summary_computed(sess, summaries)

For more advanced usages:
tf.train.Supervisor.summary_writer
https://www.tensorflow.org/api_docs/python/tf/train/Supervisor#summary_writer

Supervisor Reference

Checkpointing: Where and When.

checkpointing service can be configured by the following keyword arguments to the Supervisor() constructor:

  • logdir:
    • path where the checkpointing service creates checkpoints.
    • Passing None disables the checkpointing and the summary services.
  • checkpoint_basename
    • Name of the checkpoint files to create, defaults to “model.ckpt”.
    • If the model contains a scalar integer variable named global_step, the value of that variable is appended to the checkpoint filename.
  • save_model_secs
    • Number of seconds between each checkpoint. Defaults to 600, or 10 minutes.
  • saver
    • A tf.train.Saver object to use for checkpointing.
    • Default creates one for you by calling tf.train.Saver(), which add ops to save and restore all variables in your model.
    • customer Saver need create customized Saver.
Summaries: Where and When

the summary service can be configured by the following keyword arguments to the Supervisor() constructor:

  • logdir
    • Path to a directory where the summary service creates event files.
    • Passing None disables the summary service as well as the checkpointing services.
  • save_summaries_secs
    • Number of seconds between each run of the summary service
    • Defaults to 120, or 2 minutes.
    • Pass 0 to disable the summary service.
  • summary_op
    • Op to use to fetch the summaries.
    • Default use the first op in the tf.GraphKeys.SUMMARY_OP graph collection.
    • If the collection is empty the supervisor creates an op that aggregates all summaries in the graph using tf.summary.merge_all()
    • Passing None disables the summary service.
  • global_step
    • Tensor to use to count the global step.
    • Default uses the first tensor in the tf.GraphKeys.GLOBAL_STEP graph collection.
    • If the collection is empty, the supervisor looks for a scalar integer variable named global_step in the graph.
    • If found, the global step tensor is used to measure the number of training steps executed
    • Note that your training op is responsible for incrementing the global step value.
Model Initialization and Recovery
initialization

The managed_session() call takes care of initializing or recovering a session.
It returns a session with a fully initialized model, ready to run ops.
If a checkpoint exists in the logdir when managed_session() is called, the model is initialized by loading that checkpoint
otherwise it is initialized by calling an init op and optionally an init function.
When no checkpoint is available, model initialization is controlled by the following keyword arguments to the Supervisor() constructor:

  • init_op
    • Op to run to initialize the model.
    • Default uses the first op in the tf.GraphKeys.INIT_OP collection.
    • If the collection is empty, the supervisor adds an op to initialize all the variables in the graph by calling tf.global_variables_initializer().
    • Pass None to not use an init op.
  • init_fn
    • Python function to call to initialize the model.
    • If specified, called as init_fn(sess) where sess is the managed session.
    • If an init op is also used, the init function is called after the init op.
  • local_init_op
    • An additional op to initialize parts of the graph that are not saved in checkpoints such as tables and local variables.
    • The local init op is run before the init op and the init function.
    • If not specified, the supervisor uses the first op in the tf.GraphKeys.LOCAL_INIT_OP collection.
    • If the collection is empty the supervisor adds an op to initialize all the tables and local variables in the graph by calling tf.tables_initializer() and tf.local_variables_initializer().
    • Pass None to not use a local init op.
  • ready_op
    • Op to check if the model is initialized.
    • After running the local init op, the init op, and the init function, the supervisor verifies that the model is fully initialized by running the ready op.
    • This is an op that returns an empty string if the model is initialized, or a description of what parts of the model are not initialized if not.
    • This is an op that returns an empty string if the model is initialized, or a description of what parts of the model are not initialized if not.
    • If not specified, the supervisor uses the first op in the tf.GraphKeys.READY_OP collection.
    • f the collection is empty the supervisor creates a ready op that verifies that all variables are initialized by calling tf.report_uninitialized_variables().
    • Pass None to disable the ready op. In that case the model is not checked after initialization.
Recovery

Checkpoint recovery is controlled by the following keyword arguments to the Supervisor() constructor:

  • logdir
    • Path to a directory in which to look for checkpoints.
    • The checkpoint service saves a metadata file, named “checkpoint”, in the checkpoint directory that indicates the path to the most recent checkpoint.
    • This file is in text format. When in a pinch, you can edit it manually to recover from a different checkpoint than the most recent one.
  • ready_op: (see above).
    • The ready op is run before and after loading the checkpoint.
    • The first run checks if the model needs to be initialized
    • the second run verifies that the model is fully initialized.
  • local_init_op: (see above).
    • The local init op is run before running the ready op the first time, to initialize local variables and tables.
    • saver: (see above). Saver object used to load the checkpoint.

Untitled

Posted on 2017-08-23 |

Easy demo

1
2
3
4
5
6
7
8
# Start a TensorFlow server as a single-process "cluster".
$ python
>>> import tensorflow as tf
>>> c = tf.constant("Hello, distributed TensorFlow!")
>>> server = tf.train.Server.create_local_server()
>>> sess = tf.Session(server.target) # Create a session on the server.
>>> sess.run(c)
'Hello, distributed TensorFlow!'

The tf.train.Server.create_local_server method creates a single-process cluster, with an in-process server.

Create a cluster

cluster

a set of “tasks” that participate in the distributed execution of a TensorFlow graph.

task

Each task is associated with a TensorFlow “server”.

  • which contains a “master” that can be used to create sessions,
  • and a “worker” that executes operations in the graph.
jobs

A cluster can also be divided into one or more “jobs”,

  • where each job contains one or more tasks.

create cluster

start one TensorFlow server per task in the cluster.
Each task typically runs on a different machine
but you can run multiple tasks on the same machine
In each task, do the following:

  1. Create a tf.train.ClusterSpec that describes all of the tasks in the cluster.
    • This should be the same for each task.
  2. Create a tf.train.Server
    • passing the tf.train.ClusterSpec to the constructor,
    • identifying the local task with a job name and task index.
Create a tf.train.ClusterSpec to describe the cluster

The cluster specification dictionary maps job names to lists of network addresses.
Pass this dictionary to the tf.train.ClusterSpec constructor.

1
tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})

/job:local/task:0
/job:local/task:1

1
2
3
4
5
6
7
8
9
10
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222"
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222"
]})

/job:worker/task:0
/job:worker/task:1
/job:worker/task:2
/job:ps/task:0
/job:ps/task:1

Create a tf.train.Server instance in each task

A tf.train.Server object contains

  • a set of local devices, a set of connections to other tasks in its tf.train.ClusterSpec.
  • a tf.Session that can use these to perform a distributed computation.

Each server is a member of a specific named job and has a task index within that job.
A server can communicate with any other server in the cluster.

For example, to launch a cluster with two servers running on localhost:2222 and localhost:2223, run the following snippets in two different processes on the local machine:

1
2
3
4
5
6
7
# In task 0:
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=0)
# In task 1:
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=1)

Specifying distributed devices in your model

To place operations on a particular process, you can use the same tf.device function that is used to specify whether ops run on the CPU or GPU. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
with tf.device("/job:ps/task:0"):
weights_1 = tf.Variable(...)
biases_1 = tf.Variable(...)
with tf.device("/job:ps/task:1"):
weights_2 = tf.Variable(...)
biases_2 = tf.Variable(...)
with tf.device("/job:worker/task:7"):
input, labels = ...
layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1)
logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2)
# ...
train_op = ...
with tf.Session("grpc://worker7.example.com:2222") as sess:
for _ in range(10000):
sess.run(train_op)

In the above example, the variables are created on two tasks in the ps job, and the compute-intensive part of the model is created in the worker job. TensorFlow will insert the appropriate data transfers between the jobs (from ps to worker for the forward pass, and from worker to ps for applying gradients).

Replicated training

A common training configuration, called “data parallelism,”

  • involves multiple tasks in a worker job training the same model on different mini-batches of data,
  • updating shared parameters hosted in one or more tasks in a ps job.

There are many ways to specify this structure in TensorFlow, and we are building libraries that will simplify the work of specifying a replicated model.
Possible approaches include:

  • In-graph replication.
    • In this approach, the client builds a single tf.Graph that contains one set of parameters (in tf.Variable nodes pinned to /job:ps);
    • and multiple copies of the compute-intensive part of the model, each pinned to a different task in /job:worker.
  • Between-graph replication.?
    • In this approach, there is a separate client for each /job:worker task, typically in the same process as the worker task.
    • Each client builds a similar graph containing the parameters
      • (pinned to /job:ps as before using tf.train.replica_device_setter to map them deterministically to the same tasks)
    • and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker.
  • Asynchronous training.?
    • In this approach, each replica of the graph has an independent training loop that executes without coordination. It is compatible with both forms of replication above.
  • Synchronous training.
    • n this approach, all of the replicas read the same values for the current parameters, compute gradients in parallel, and then apply them together.
    • It is compatible with in-graph replication (e.g. using gradient averaging as in the CIFAR-10 multi-GPU trainer), and between-graph replication (e.g. using the tf.train.SyncReplicasOptimizer).
Putting it all together: example trainer program

The following code shows the skeleton of a distributed trainer program

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import argparse
import sys
import tensorflow as tf
FLAGS = None
def main(_):
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")
# Create a cluster from the parameter server and worker hosts.
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
# Create and start a server for the local task.
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
# Assigns ops to the local worker by default.
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=cluster)):
# Build model...
loss = ...
global_step = tf.contrib.framework.get_or_create_global_step()
train_op = tf.train.AdagradOptimizer(0.01).minimize(
loss, global_step=global_step)
# The StopAtStepHook handles stopping after running given steps.
hooks=[tf.train.StopAtStepHook(last_step=1000000)]
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=(FLAGS.task_index == 0),
checkpoint_dir="/tmp/train_logs",
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Run a training step asynchronously.
# See `tf.train.SyncReplicasOptimizer` for additional details on how to
# perform *synchronous* training.
# mon_sess.run handles AbortedError in case of preempted PS.
mon_sess.run(train_op)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.register("type", "bool", lambda v: v.lower() == "true")
# Flags for defining the tf.train.ClusterSpec
parser.add_argument(
"--ps_hosts",
type=str,
default="",
help="Comma-separated list of hostname:port pairs"
)
parser.add_argument(
"--worker_hosts",
type=str,
default="",
help="Comma-separated list of hostname:port pairs"
)
parser.add_argument(
"--job_name",
type=str,
default="",
help="One of 'ps', 'worker'"
)
# Flags for defining the tf.train.Server
parser.add_argument(
"--task_index",
type=int,
default=0,
help="Index of task within the job"
)
FLAGS, unparsed = parser.parse_known_args()
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

Glossary

Client

A client is typically a program that builds a TensorFlow graph and constructs a tensorflow::Session to interact with a cluster.
A single client process can directly interact with multiple TensorFlow servers (see “Replicated training” above), and a single server can serve multiple clients.

Cluster

A TensorFlow cluster comprises a one or more “jobs”, each divided into lists of one or more “tasks”.
A cluster is typically dedicated to a particular high-level objective, such as training a neural network, using many machines in parallel.
A cluster is defined by a tf.train.ClusterSpec object.

Job

A job comprises a list of “tasks”, which typically serve a common purpose.
For example, a job named ps (for “parameter server”) typically hosts nodes that store and update variables
while a job named worker typically hosts stateless nodes that perform compute-intensive tasks.
The tasks in a job typically run on different machines.
The set of job roles is flexible: for example, a worker may maintain some state.

Master service

An RPC service that provides remote access to a set of distributed devices, and acts as a session target.
The master service implements the tensorflow::Session interface, and is responsible for coordinating work across one or more “worker services”.
All TensorFlow servers implement the master service.

Task

A task corresponds to a specific TensorFlow server, and typically corresponds to a single process. A task belongs to a particular “job” and is identified by its index within that job’s list of tasks.

TensorFlow server

A process running a tf.train.Server instance, which is a member of a cluster, and exports a “master service” and “worker service”.

Worker service

An RPC service that executes parts of a TensorFlow graph using its local devices. A worker service implements worker_service.proto. All TensorFlow servers implement the worker service.

Untitled

Posted on 2017-08-19 |

supervisor
https://www.tensorflow.org/programmers_guide/supervisor

Untitled

Posted on 2017-08-19 |

Untitled

Posted on 2017-08-19 |

tf.estimator 学习

基本结构

两个基本参数: model_fn and params

1
nn = tf.estimator.Estimator(model_fn=model_fn, params=model_params)

  • model_fn: A function object that contains all the aforementioned logic to support training, evaluation, and prediction. You are responsible for implementing that functionality.
  • params: An optional dict of hyperparameters (e.g., learning rate, dropout) that will be passed into the model_fn.

Notice:
The Estimator also accepts the general configuration arguments model_dir and config.

构建model_fn

基本骨架
Input: 接受参数
  • features
    • A dict containing the features passed to the model via input_fn.
  • labels
    • A Tensor containing the labels passed to the model via input_fn
    • Will be empty for predict() calls, as these are the values the model will infer.
  • mode
    • tf.estimator.ModeKeys.TRAIN The model_fn was invoked in training mode, namely via a train() call
    • tf.estimator.ModeKeys.EVAL The model_fn was invoked in evaluation mode, namely via an evaluate() call.
    • tf.estimator.ModeKeys.PREDICT The model_fn was invoked in predict mode, namely via a predict() call.
  • params
    • containing a dict of hyperparameters used for training
body: 逻辑过程
  1. Configure the model
    • a neural network
  2. Define the loss function
    • to calculate how closely the model’s predictions match the target values
  3. Define the training operation
    • to specify the optimizer algorithm to minimize the loss values calculated by the loss function.
  4. Generate predictions
  5. Return predictions/loss/train_op/eval_metric_ops in EstimatorSpec object
1
return EstimatorSpec(mode, predictions, loss, train_op, eval_metric_ops)
output: 返回

必须返回一个tf.estimator.EstimatorSpec对象

1
return EstimatorSpec(mode, predictions, loss, train_op, eval_metric_ops)

  • mode (required in all mode)
    • 直接将model_fn的mode传递下去
  • predictions (required in PREDICT mode)
    • A dict that maps key names of your choice to Tensors containing the predictions from the model.
    • In PREDICT mode, the dict that you return in EstimatorSpec will then be returned by predict()
    • you can construct it in the format in which you’d like to consume it
  • loss ((required in EVAL and TRAIN mode))
    • A Tensor containing a scalar loss value
    • the output of the model’s loss function
    • calculated over all the input examples
    • is used in TRAIN mode for error handling and logging
    • is automatically included as a metric in EVAL mode.
  • train_op (required only in TRAIN mode).
    • An Op that runs one step of training
  • eval_metric_ops (optional)
    • A dict of name/value pairs specifying the metrics that will be calculated when the model runs in EVAL mode.
    • The name is a label of your choice for the metric
    • the value is the result of your metric calculation.
    • The tf.metrics module provides predefined functions for a variety of common metrics.
      1
      eval_metric_ops = { "accuracy": tf.metrics.accuracy(labels, predictions) }
相关API

optimizers

Untitled

Posted on 2017-08-19 |

tf.estimator 学习

基本结构

两个基本参数: model_fn and params

1
nn = tf.estimator.Estimator(model_fn=model_fn, params=model_params)

  • model_fn: A function object that contains all the aforementioned logic to support training, evaluation, and prediction. You are responsible for implementing that functionality.
  • params: An optional dict of hyperparameters (e.g., learning rate, dropout) that will be passed into the model_fn.

Notice:
The Estimator also accepts the general configuration arguments model_dir and config.

构建model_fn

基本骨架
Input: 接受参数
  • features
    • A dict containing the features passed to the model via input_fn.
  • labels
    • A Tensor containing the labels passed to the model via input_fn
    • Will be empty for predict() calls, as these are the values the model will infer.
  • mode
    • tf.estimator.ModeKeys.TRAIN The model_fn was invoked in training mode, namely via a train() call
    • tf.estimator.ModeKeys.EVAL The model_fn was invoked in evaluation mode, namely via an evaluate() call.
    • tf.estimator.ModeKeys.PREDICT The model_fn was invoked in predict mode, namely via a predict() call.
  • params
    • containing a dict of hyperparameters used for training
body: 逻辑过程
  1. Configure the model
    • a neural network
  2. Define the loss function
    • to calculate how closely the model’s predictions match the target values
  3. Define the training operation
    • to specify the optimizer algorithm to minimize the loss values calculated by the loss function.
  4. Generate predictions
  5. Return predictions/loss/train_op/eval_metric_ops in EstimatorSpec object
1
return EstimatorSpec(mode, predictions, loss, train_op, eval_metric_ops)
output: 返回

必须返回一个tf.estimator.EstimatorSpec对象

1
return EstimatorSpec(mode, predictions, loss, train_op, eval_metric_ops)

  • mode (required in all mode)
    • 直接将model_fn的mode传递下去
  • predictions (required in PREDICT mode)
    • A dict that maps key names of your choice to Tensors containing the predictions from the model.
    • In PREDICT mode, the dict that you return in EstimatorSpec will then be returned by predict()
    • you can construct it in the format in which you’d like to consume it
  • loss ((required in EVAL and TRAIN mode))
    • A Tensor containing a scalar loss value
    • the output of the model’s loss function
    • calculated over all the input examples
    • is used in TRAIN mode for error handling and logging
    • is automatically included as a metric in EVAL mode.
  • train_op (required only in TRAIN mode).
    • An Op that runs one step of training
  • eval_metric_ops (optional)
    • A dict of name/value pairs specifying the metrics that will be calculated when the model runs in EVAL mode.
    • The name is a label of your choice for the metric
    • the value is the result of your metric calculation.
    • The tf.metrics module provides predefined functions for a variety of common metrics.
      1
      eval_metric_ops = { "accuracy": tf.metrics.accuracy(labels, predictions) }
相关API

optimizers

tf.Variable and tf.Tensor

Posted on 2017-08-18 |

tf.Variable

Create variable

tf.get_variable

  • easy to define models which reuse layers
1
2
my_int_variable = tf.get_variable("my_int_variable", [1, 2, 3], dtype=tf.int32,
initializer=tf.zeros_initializer)
1
2
other_variable = tf.get_variable("other_variable", dtype=tf.int32,
initializer=tf.constant([23, 42]))

Variable collections

collections: name lists of tensors or other objects, such as tf.Variable instances.

Predefined collections

By default every tf.Variable gets placed in the following two collections:

  • tf.GraphKeys.GLOBAL_VARIABLES
    • variables that can be shared across multiple devices
  • tf.GraphKeys.TRAINABLE_VARIABLE
    • variables for which TensorFlow will calculate gradients

If you don’t want a variable to be trainable, add it to the tf.GraphKeys.LOCAL_VARIABLES collection instead.
The method is:

1
2
my_local = tf.get_variable("my_local", shape=(),
collections=[tf.GraphKeys.LOCAL_VARIABLES])

or

1
my_non_trainable = tf.get_variable("my_non_trainable", shape=(), trainable=False)

Custom collections

There is no need to explicitly create a collection.
To add a variable (or any other object) to a collection:

1
tf.add_to_collection("my_collection_name", my_local)

To retrieve a list of all the variables (or other objects) you’ve placed in a collection:

1
tf.get_collection("my_collection_name")

Variable placement

Placement method:

1
2
with tf.device("/gpu:1"):
v = tf.get_variable("v", [1])

It is particularly important for variables to be in the correct device in distributed settings.
For this reason we provide tf.train.replica_device_setter, which can automatically place variables in parameter servers.

1
2
3
4
5
6
7
cluster_spec = {
"ps": ["ps0:2222", "ps1:2222"],
"worker": ["worker0:2222", "worker1:2222", "worker2:2222"]}
with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):
v = tf.get_variable("v", shape=[20, 20]) # this variable is placed
# in the parameter server
# by the replica_device_setter

Initializing variables

In the low-level TensorFlow API, must explicitly initialize the variables by ourselves.

  • that is, you are explicitly creating your own graphs and sessions)

Most high-level frameworks automatically initialize variables for you

  • tf.contrib.slim, tf.estimator.Estimator and Keras

Explicit initialization is otherwise useful because it

  • allows you not to rerun potentially expensive initializers when reloading a model from a checkpoint ?
  • as well as allowing determinism when randomly-initialized variables are shared in a distributed setting.?

tf.global_variables_initializer()

  • To initialize all trainable variables
  • To be called before training starts
1
2
session.run(tf.global_variables_initializer())
# Now all variables are initialized.

To initialize variables yourself:

1
session.run(my_variable.initializer)

ask which variables have still not been initialized:

1
print(session.run(tf.report_uninitialized_variables()))

Notice

by default tf.global_variables_initializer does not specify the order in which variables are initialized.
If you use a variable’s value while initializing another variable, use variable.initialized_value() instead of variable

1
2
v = tf.get_variable("v", shape=(), initializer=tf.zeros_initializer())
w = tf.get_variable("w", initializer=v.initialized_value() + 1)

Using variables

To use the value of a tf.Variable in a TensorFlow graph, simply treat it like a normal tf.Tensor.
To assign a value to a variable, use the methods such as assign, assign_add in the tf.Variable class.
To force a re-read of the value of a variable after something has happened, you can use tf.Variable.read_value.

1
2
3
4
5
6
7
8
v = tf.get_variable("v", shape=(), initializer=tf.zeros_initializer())
assignment = v.assign_add(1)
assignment2 = v.assign_add(1)
with tf.control_dependencies([assignment, assignment2]):
w = v.read_value() # w is guaranteed to reflect v's value after the
# assign_add operation.
# v: 0
# w: 2

tf.train.Optimizer

https://www.tensorflow.org/api_docs/python/tf/train/Optimizer

Saving and Restoring

https://www.tensorflow.org/programmers_guide/variables#saving_and_restoring

Sharing variables

https://www.tensorflow.org/programmers_guide/variables#sharing_variables

Tensors

https://www.tensorflow.org/programmers_guide/tensors

tf.contrib.learn 阅读历程

Posted on 2017-08-18 |

tf.contrib.learn 基本用法

https://www.tensorflow.org/get_started/tflearn

基本过程

  1. Load file containing training/test data into a TensorFlow Dataset
  2. Construct a neural network classifier
  3. Fit the model using the training data
  4. Evaluate the accuracy of the model
  5. Classify new samples/infer work
    Read more »
123
Klpek

Klpek

28 posts
3 categories
17 tags
© 2017 Powered By Klpek
Powered by Klpek
Theme - Ekon