MonitoredSession
https://www.tensorflow.org/api_docs/python/tf/train/MonitoredSession
hook
Experiment
https://www.tensorflow.org/api_docs/python/tf/contrib/learn/Experiment
Experiment is a class containing all information needed to train a model.
by passing an Estimator and inputs for training and evaluation, an Experiment instance knows how to invoke training and eval loops in a sensible fashion for distributed training.
method
Constructor
|
|
Creates an Experiment instance.
None of the functions passed to this constructor are executed at construction time.
They are stored and used when a method is executed which requires it.
- estimator
- Object implementing Estimator interface,
- train_input_fn
- function, returns features and labels for training.
- eval_input_fn
- function, returns features and labels for evaluation.
- If eval_steps is None, this should be configured only to produce for a finite number of batches (generally, 1 epoch over the evaluation data).
- eval_metrics
- dict of string, metric function.
- If None, default set is used. This should be None if the estimator is ${tf.estimator.Estimator}$. If metrics are provided they will be appended to the default set.
- train_steps
- Perform this many steps of training. None, the default, means train forever.
- eval_steps
- evaluate runs until input is exhausted (or another exception is raised), or for eval_steps steps, if specified.
- train_monitors
- A list of monitors to pass to the Estimator’s fit function.
- eval_hooks
- A list of SessionRunHook hooks to pass to the Estimator’s evaluate function.
- eval_delay_secs
- Start evaluating after waiting for this many seconds.
- continuous_eval_throttle_secs
- when the last evaluation was started at least this many seconds ago for continuous_eval(), re-evaluate.
- min_eval_frequency
- (applies only to train_and_evaluate).
- the minimum number of steps between evaluations.
- Of course, evaluation does not occur if no new snapshot is available, hence, this is the minimum. If 0, the evaluation will only happen after training. If None, defaults to 1, unless model_dir is on GCS, in which case the default is 1000.
- delay_workers_by_global_step
- if True delays training workers based on global step instead of time.
- export_strategies
- Iterable of ExportStrategys, or a single one, or None.
- train_steps_per_iteration
- (applies only to continuous_train_and_eval).
- Perform this many (integer) number of train steps for each training-evaluation iteration.
- With a small value, the model will be evaluated more frequently with more checkpoints saved.
- If None, will use a default value (which is smaller than train_steps if provided).
supervisor
basic use process
- Create a Supervisor object,
- parameter: logdir
- the path to a directory where to save checkpoints and summaries.
- Ask the supervisor for a session with
tf.train.Supervisor.managed_session
- Use the session to execute a train op, checking at each step if the supervisor requests that the training stops.
started services
The managed_session() call starts a few services, which run in their own threads and use the managed session to run ops in your graph.
If your graph contains an integer variable named global_step, the services use its value to measure the number of training steps executed.
- Checkpointing service: Saves a copy of the graph variables in the logdir.
- The checkpoint filename uses the value of the global_step variable if one was added to your graph.
- Runs every 10 minutes by default.
- Summary service: Runs all the summary ops and appends their output to an events file in the logdir.
- Runs every 2 minutes by default.
- Step counter: Counts how many steps have been executed, by looking at changes in the global_step variable.
- Appends a summary to the events file reporting the number of global steps per second.
- The summary tag is “global_step/sec”.
- This also runs every 2 minutes by default.
- Queue Runners: If any
tf.train.QueueRunner
were added to the graph, the supervisor launches them in their own threads.
Checking for Stop
The check for stop in the main training loop is important and necessary.
- Exceptions raised in the service threads are reported to the supervisor which then sets its should_stop() condition to true.
- Other service threads notice that condition and terminate properly.
- The main training loop, within the managed_session() block, must also check for the stop condition and terminate.
Notice:managed_session()
takes care of catching exceptions raised from the training loop to report them to the supervisor.
The main loop does not need to do anything special about exceptions. It only needs to check for the stop condition.
Recovery
If the training program shuts down or crashes, its most recent checkpoint and event files are left in the logdir.
When you restart the program, managed_session() restores the graph from the most recent checkpoint and resumes training where it stopped.
A new events file is created. If you start TensorBoard and point it to the logdir, it will know how to merge the contents of the two events files and will show the training resuming at the last global step from the checkpoint.
Larger Model Scenario
Larger models may run out memory when the summary service runs:
- The summary ops are run in parallel with the main loop running the train op.
- This can cause memory usage to peak to up to two times the normal use.
For a larger model you can tell the supervisor to not run the summary service and instead run it yourself in your main training loop:
- pass
summary_op=None
when constructing the supervisor.
|
|
Pre-trained Model Scenario
The managed_session() call takes care of initializing the model in the session
- If model is available, it is restored from a checkpoint.
- otherwise, initialized from scratch.
One common scenario is to initialize the model by loading a “pre-trained” checkpoint that was saved while training a usually slightly different model using a different dataset.
init function
- is called only if the model needs to be initialized from scratch
- not when the model can be recovered from a checkpoint from the logdir.
To load the pre-trained model, the init function needs a tf.train.Saver
object
- This saver must only restore the pre-trained variables
- This is usually a good idea because the new model may contain variables that are not present in the pre-trained checkpoint
- If you were using the default saver, you could get an error trying to restore all the variables of the new model from the pre-trained checkpoint.
The process is below:
Running Your Own Services
For example to fetch different sets of summaries on a different schedule than the usual summary service.
Use the tf.train.Supervisor.loop
method
- It repeatedly calls a function of your choice on a timer until the supervisor stop condition becomes true
- It plays nicely with the other services.
|
|
Writing Summaries
The supervisor always creates an events file in its logdir, as well as a tf.summary.FileWriter
to append events and summaries to that file.
If you want to write your own summaries it is a good idea to append them to that same events file
- TensorBoard likes it better when only one events file in a directory is being actively appended to.
Method: tf.train.Supervisor.summary_computed
For more advanced usages:tf.train.Supervisor.summary_writer
https://www.tensorflow.org/api_docs/python/tf/train/Supervisor#summary_writer
Supervisor Reference
Checkpointing: Where and When.
checkpointing service can be configured by the following keyword arguments to the Supervisor() constructor:
- logdir:
- path where the checkpointing service creates checkpoints.
- Passing None disables the checkpointing and the summary services.
- checkpoint_basename
- Name of the checkpoint files to create, defaults to “model.ckpt”.
- If the model contains a scalar integer variable named global_step, the value of that variable is appended to the checkpoint filename.
- save_model_secs
- Number of seconds between each checkpoint. Defaults to 600, or 10 minutes.
- saver
- A
tf.train.Saver
object to use for checkpointing. - Default creates one for you by calling tf.train.Saver(), which add ops to save and restore all variables in your model.
- customer Saver need create customized Saver.
- A
Summaries: Where and When
the summary service can be configured by the following keyword arguments to the Supervisor() constructor:
- logdir
- Path to a directory where the summary service creates event files.
- Passing None disables the summary service as well as the checkpointing services.
- save_summaries_secs
- Number of seconds between each run of the summary service
- Defaults to 120, or 2 minutes.
- Pass 0 to disable the summary service.
- summary_op
- Op to use to fetch the summaries.
- Default use the first op in the
tf.GraphKeys.SUMMARY_OP
graph collection. - If the collection is empty the supervisor creates an op that aggregates all summaries in the graph using
tf.summary.merge_all()
- Passing None disables the summary service.
- global_step
- Tensor to use to count the global step.
- Default uses the first tensor in the tf.GraphKeys.GLOBAL_STEP graph collection.
- If the collection is empty, the supervisor looks for a scalar integer variable named
global_step
in the graph. - If found, the global step tensor is used to measure the number of training steps executed
- Note that your training op is responsible for incrementing the global step value.
Model Initialization and Recovery
initialization
The managed_session()
call takes care of initializing or recovering a session.
It returns a session with a fully initialized model, ready to run ops.
If a checkpoint exists in the logdir when managed_session()
is called, the model is initialized by loading that checkpoint
otherwise it is initialized by calling an init op and optionally an init function.
When no checkpoint is available, model initialization is controlled by the following keyword arguments to the Supervisor()
constructor:
- init_op
- Op to run to initialize the model.
- Default uses the first op in the
tf.GraphKeys.INIT_OP
collection. - If the collection is empty, the supervisor adds an op to initialize all the variables in the graph by calling
tf.global_variables_initializer()
. - Pass None to not use an init op.
- init_fn
- Python function to call to initialize the model.
- If specified, called as init_fn(sess) where sess is the managed session.
- If an init op is also used, the init function is called after the init op.
- local_init_op
- An additional op to initialize parts of the graph that are not saved in checkpoints such as tables and local variables.
- The local init op is run before the init op and the init function.
- If not specified, the supervisor uses the first op in the
tf.GraphKeys.LOCAL_INIT_OP
collection. - If the collection is empty the supervisor adds an op to initialize all the tables and local variables in the graph by calling
tf.tables_initializer()
andtf.local_variables_initializer()
. - Pass None to not use a local init op.
- ready_op
- Op to check if the model is initialized.
- After running the local init op, the init op, and the init function, the supervisor verifies that the model is fully initialized by running the ready op.
- This is an op that returns an empty string if the model is initialized, or a description of what parts of the model are not initialized if not.
- This is an op that returns an empty string if the model is initialized, or a description of what parts of the model are not initialized if not.
- If not specified, the supervisor uses the first op in the tf.GraphKeys.READY_OP collection.
- f the collection is empty the supervisor creates a ready op that verifies that all variables are initialized by calling
tf.report_uninitialized_variables()
. - Pass None to disable the ready op. In that case the model is not checked after initialization.
Recovery
Checkpoint recovery is controlled by the following keyword arguments to the Supervisor() constructor:
- logdir
- Path to a directory in which to look for checkpoints.
- The checkpoint service saves a metadata file, named “checkpoint”, in the checkpoint directory that indicates the path to the most recent checkpoint.
- This file is in text format. When in a pinch, you can edit it manually to recover from a different checkpoint than the most recent one.
- ready_op: (see above).
- The ready op is run before and after loading the checkpoint.
- The first run checks if the model needs to be initialized
- the second run verifies that the model is fully initialized.
- local_init_op: (see above).
- The local init op is run before running the ready op the first time, to initialize local variables and tables.
- saver: (see above). Saver object used to load the checkpoint.