tf.contrib.learn 基本用法
https://www.tensorflow.org/get_started/tflearn
基本过程
- Load file containing training/test data into a TensorFlow Dataset
- Construct a neural network classifier
- Fit the model using the training data
- Evaluate the accuracy of the model
- Classify new samples/infer work
Logging and Monitoring Basics with tf.contrib.learn
https://www.tensorflow.org/get_started/monitors
To track and evaluate progress in real time, When training a model.
Without any logging, model training feels like a bit of a black box; you can’t see what’s happening as TensorFlow steps through gradient descent, get a sense of whether the model is converging appropriately, or audit to determine whether early stopping might be appropriate.
One way to address this problem would be to split model training into multiple fit calls with smaller numbers of steps in order to evaluate accuracy more progressively. However, this is not recommended practice, as it greatly slows down model training.
tf.contrib.learn offers another solution: a Monitor API designed to help you log metrics and evaluate your model while training is in progress.Enabling Logging
TensorFlow uses five different levels for log messages: DEBUG, INFO, WARN, ERROR, and FATAL.1tf.logging.set_verbosity(tf.logging.INFO)
when tracking model training, you’ll want to adjust the level to INFO, which will provide additional feedback as fit operations are in progress.
With INFO-level logging, tf.contrib.learn automatically outputs training-loss metrics to stderr after every 100 steps.
Configuring a ValidationMonitor for Streaming Evaluation
tf.contrib.learn provides several high-level Monitors you can attach to your fit operations to further track metrics and/or debug lower-level TensorFlow operations during model training, including:
Monitor | Description |
---|---|
CaptureVariable | Saves a specified variable’s values into a collection at every n steps of training |
PrintTensor | Logs a specified tensor’s values at every n steps of training |
SummarySaver | Saves tf.Summary protocol buffers for a given tensor using a tf.summary.FileWriter at every n steps of training |
ValidationMonitor | Logs a specified set of evaluation metrics at every n steps of training, and, if desired, implements early stopping under certain conditions |
Evaluating Every N Steps
while logging training loss, you might also want to simultaneously evaluate against test data to see how well the model is generalizing.
You can accomplish this by configuring a ValidationMonitor
with the test data,, and setting how often to evaluate with every_n_steps
. The default value of every_n_steps
is 100.
ValidationMonitors rely on saved checkpoints to perform evaluation operations, so you’ll want to modify instantiation of the classifier to add a tf.contrib.learn.RunConfig
that includes save_checkpoints_secs
, which specifies how many seconds should elapse between checkpoint saves during training.
Customizing the Evaluation Metrics with MetricSpec
To specify the exact metrics you’d like to run in each evaluation pass, you can add a metrics param to the ValidationMonitor constructor.
metrics takes a dict of key/value pairs, where each key is the name you’d like logged for the metric, and the corresponding value is a MetricSpec object.
MetricSpec constructor: metric_fn, prediction_key, label_key, weights_key.
Early Stopping with ValidationMonitor
In addition to logging eval metrics, ValidationMonitors make it easy to implement early stopping when specified conditions are met, via three params:
Param | Description |
---|---|
early_stopping_metric | Metric that triggers early stopping (e.g., loss or accuracy) under conditions specified in early_stopping_rounds and early_stopping_metric_minimize. Default is “loss”. |
early_stopping_metric_minimize | True if desired model behavior is to minimize the value of early_stopping_metric; False if desired model behavior is to maximize the value of early_stopping_metric. Default is True. |
early_stopping_rounds | Sets a number of steps during which if the early_stopping_metric does not decrease (if early_stopping_metric_minimize is True) or increase (if early_stopping_metric_minimize is False), training will be stopped. Default is None, which means early stopping will never occur. |
Building Input Functions
https://www.tensorflow.org/get_started/input_fn
How to construct an input_fn to preprocess and feed data into your models.
tf.contrib.learn supports using a custom input function (input_fn) to encapsulate the logic for preprocessing and piping data into your models.
Anatomy of an input_fn
the basic skeleton for an input function:
The body of the input function contains the specific logic for preprocessing your input data, such as scrubbing out bad examples or feature scaling.
Input functions must return the following two values containing the final feature and label data to be fed into your model (as shown in the above code skeleton):
- feature_cols (list of Tensors)
A dict containing key/value pairs that map feature column names to Tensors (or SparseTensors) containing the corresponding feature data. - labels
A Tensor containing your label (target) values: the values your model aims to predict.
Converting Feature Data to Tensors
For continuous data, you can create and populate a Tensor using tf.constant:
For sparse, categorical data (data where the majority of values are 0), you’ll instead want to populate a SparseTensor, which is instantiated with three arguments:
- dense_shape
The shape of the tensor. Takes a list indicating the number of elements in each dimension. - indices.
The indices of the elements in your tensor that contain nonzero values. Takes a list of terms, where each term is itself a list containing the index of a nonzero element. - values
A one-dimensional tensor of values. Term i in values corresponds to term i in indices and specifies its value.123456sparse_tensor = tf.SparseTensor(indices=[[0,1], [2,4]],values=[6, 0.5],dense_shape=[3, 5])[[0, 6, 0, 0, 0][0, 0, 0, 0, 0][0, 0, 0, 0, 0.5]]
Passing input_fn Data to Your Model
|
|
Threading and Queues
Queues are a powerful mechanism for asynchronous computation using TensorFlow.
A queue is a node in a TensorFlow graph. In particular, nodes can enqueue new items in to the queue, or dequeue existing items from the queue.
N.B. Queue methods (such as q.enqueue(…)) must run on the same device as the queue. Incompatible device placement directives will be ignored when creating these operations.
Queue usage overview
Queues, such as tf.FIFOQueue and tf.RandomShuffleQueue, are important TensorFlow objects for computing tensors asynchronously in a graph.
A typical input architecture is to use a RandomShuffleQueue to prepare inputs for training a model:
- Multiple threads prepare training examples and push them in the queue.
- A training thread executes a training op that dequeues mini-batches from the queue.
benefits
- in the Reading data
- Session object is multithreaded, so multiple threads can easily use the same session and run ops in parallel.
However,it is not always easy in Python
because All threads must be able to stop together, exceptions must be caught and reported, and queues must be properly closed when stopping.
so, TensorFlow provides two classes to help:tf.train.Coordinator
andtf.train.QueueRunner
.` - tf.train.Coordinator
- helps multiple threads stop together,
- report exceptions to a program that waits for them to stop.
- QueueRunner
- create a number of threads cooperating to enqueue tensors in the same queue.
Coordinator
helps multiple threads stop together
Method:
- tf.train.Coordinator.should_stop: returns True if the threads should stop.
- tf.train.Coordinator.request_stop: requests that threads should stop.
- tf.train.Coordinator.join: waits until the specified threads have stopped.
Basic Usage
- first create a Coordinator object,
- then create a number of threads that use the coordinator.
- The threads typically run loops that stop when
coordinator.should_stop()
returns True. - Any thread can decide that the computation should stop by calling
coordinator.request_stop()
, to ask for all the threads to stop. To cooperate with the requests, each thread must check forcoord.should_stop()
on a regular basis.coord.should_stop()
returns True as soon ascoord.request_stop()
has been called.
|
|
more detail
https://www.tensorflow.org/api_docs/python/tf/train/Coordinator
QueueRunner
The QueueRunner class creates a number of threads that repeatedly run an enqueue op.
These threads can use a coordinator to stop together.
In addition, a queue runner runs a closer thread that automatically closes the queue if an exception is reported to the coordinator.
basic usage
Step 1: create Queue and add related ops to the queue.
- First create a queue (e.g. a tf.RandomShuffleQueue).
- Add ops that process examples and enqueue them in the queue.
- Create dequeue ops from the queue, and return data tensor .
- Use data tensor to build the Tensorflow graph.1234567example = ...ops to create one example...# Create a queue, and an op that enqueues examples one at a time in the queue.queue = tf.RandomShuffleQueue(...)enqueue_op = queue.enqueue(example)# Create a training graph that starts by dequeuing a batch of examples.inputs = queue.dequeue_many(batch_size)train_op = ...use 'inputs' to build the training part of the graph...
2. create QueueRunner and combine with Coordinator
- create a QueueRunner that will run a few threads to process and enqueue examples. .
- Launch the graph.
- Create a coordinator, launch the queue runner threads.
- Run the training loop, controlling termination with the coordinator.
- When done, ask the threads to stop.
- wait for all threads to actually stop.
|
|
Handling exceptions
Threads started by queue runners do more than just run the enqueue ops.tf.errors.OutOfRangeError
exception, which is used to report that a queue was closed.
A coordinator must similarly catch and report exceptions in its main loop.
Reading data
https://www.tensorflow.org/programmers_guide/reading_data
Reading from files
A typical pipeline for reading records from files has the following stages:
Step 1: Create string_input_producer
Creates a FIFO queue for holding the filenames until the reader needs them
Arguments:
- filenames list.
- filename in list is either a constant string Tensor or
tf.train.match_filenames_once
function.
- filename in list is either a constant string Tensor or
- shuffle and maximum number of epochs.
- A queue runner adds the whole list of filenames to the queue once for each epoch
- shuffling the filenames within an epoch if shuffle=True.
- This procedure provides a uniform sampling of files, so that examples are not under- or over- sampled relative to each other.
- The queue runner works in a thread separate from the reader that pulls filenames from the queue, so the shuffling and enqueuing process does not block the reader.
|
|
Step 2: create a Reader
A. Select the reader that matches your input file format
B. Then pass the filename queue to the reader’s read method.
- The read method outputs a key and a scalar string value.
- key: identifying the file and record (useful for debugging if you have some weird records)
- Each execution of read reads a single line from the file.
C. Use one (or more) of the decoder and conversion ops to decode this string into the tensors that make up an example.
Step 3: call tf.train.start_queue_runners
call tf.train.start_queue_runners
to populate the queue before you call run or eval to execute the read.
Standard TensorFlow format -TFRecords
more details:
https://www.tensorflow.org/api_guides/python/python_io#tfrecords_format_details
https://www.github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/how_tos/reading_data/convert_to_records.py
Preprocessing
You can then do any preprocessing of these examples you want.
Examples include normalization of your data, picking a random slice, adding noise or distortions
more details:
https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10/cifar10_input.py
Batching
At the end of the pipeline we use another queue to batch together examples for training, evaluation, or inference. For this we use a queue that randomizes the order of examples, using the tf.train.shuffle_batch
.
multiple reader and one single filename queue
If we need more parallelism or shuffling of examples between files, use multiple reader instances using the tf.train.shuffle_batch_join. For example:
We still only use a single filename queue that is shared by all the readers.
- That way we ensure that the different readers use different files from the same epoch until all the files from the epoch have been started.
- It is also usually sufficient for a single thread to fill the filename queue.)
How many threads do you need?
the tf.train.shuffle_batch* functions add a summary to the graph that indicates how full the example queue is. If you have enough reading threads, that summary will stay above zero.
Creating threads to prefetch using QueueRunner objects
many of the tf.train functions listed above add tf.train.QueueRunner objects to your graph.
These require that you call tf.train.start_queue_runners before running any training or inference steps, or it will hang forever.
This is best combined with a tf.train.Coordinator to cleanly shut down these threads when there are errors.
The recommended code pattern:
Summarize
read data process
First we create the graph.
It will have a few pipeline stages that are connected by queues.
- The first stage: generate filenames to read and enqueue them in the filename queue.
- The second stage: consumes filenames (using a Reader), produces examples, and enqueues them in an example queue.
- Depending on how you have set things up, you may actually have a few independent copies of the second stage (in ther word, several readers), so that you can read from multiple files in parallel.
- At the end is an enqueue operation, which enqueues into a queue that the next stage dequeues from.
We want to start threads running these enqueuing operations, so that our training loop can dequeue examples from the example queue.
Method: add a tf.train.QueueRunner
to the graph using the tf.train.add_queue_runner
function.
- Each QueueRunner is responsible for one stage, and holds the list of enqueue operations that need to be run in threads.
Once the graph is constructed, the tf.train.start_queue_runners
function asks each QueueRunner in the graph to start its threads running the enqueuing operations.
If all goes well, you can now run your training steps and the queues will be filled by the background threads.
If you have set an epoch limit, at some point an attempt to dequeue examples will get an tf.errors.OutOfRangeError(Equal to EOS/EOF).
The last ingredient is the tf.train.Coordinator. This is responsible for letting all the threads know if anything has signalled a shut down.
- Most commonly this would be because an exception was raised, for example one of the threads got an error when running some operation (or an ordinary Python exception).
Filtering records or producing multiple examples per record
Instead of examples with shapes [x, y, z], you will produce a batch of examples with shape [batch, x, y, z]. The batch size can be 0 if you want to filter this record out (maybe it is in a hold-out set?), or bigger than 1 if you are producing multiple examples per record. Then simply set enqueue_many=True when calling one of the batching functions (such as shuffle_batch or shuffle_batch_join).
Using the Dataset
https://www.tensorflow.org/programmers_guide/datasets
Two API abstractions
tf.contrib.data.Dataset
contains a sequence of elements
There are two distinct ways to create a dataset:
- Creating a
source
(e.g. Dataset.from_tensor_slices()) constructs a dataset from one or more tf.Tensor objects. - Applying a
transformation
(e.g. Dataset.batch()) constructs a dataset from one or more tf.contrib.data.Dataset objects.
tf.contrib.data.Iterator
Provides the main way to extract elements from a dataset.
The operation returned by Iterator.get_next()
yields the next element of a Dataset when executed, and typically acts as the interface between input pipeline code and your model.
For more sophisticated uses, the Iterator.initializer operation enables you to reinitialize and parameterize an iterator with different datasets, so that you can, for example, iterate over training and validation data multiple times in the same program.
Basic mechanics
Define a source
- construct a
Dataset
from some tensors in memory- use
tf.contrib.data.Dataset.from_tensors()
- use
tf.contrib.data.Dataset.from_tensor_slices()
- use
- construct a
Dataset
from your input data are on disk in the recommend TFRecord format- use
tf.contrib.data.TFRecordDataset
- use
- transform a Dataset into a new Dataset
- apply per-element transformations such as
Dataset.map()
(to apply a function to each element), and multi-element transformations such asDataset.batch()
.
- apply per-element transformations such as
- The most common way to consume values from a Dataset
- make an iterator object that provides access to one element of the dataset at a time.
- for example, by calling
Dataset.make_one_shot_iterator()
. - A tf.contrib.data.Iterator provides two operations:
Iterator.initializer
- enables you to (re)initialize the iterator’s state
Iterator.get_next()
- returns
tf.Tensor
objects that correspond to the symbolic next element.
- returns
- Depending on your use case, you might choose a different type of iterator, and the options are outlined below.
Dataset structure
A dataset comprises elements that each have the same structure, called components.
- Each component has a tf.DType and a tf.TensorShape.
- The
Dataset.output_types
andDataset.output_shapes
properties allow you to inspect the inferred types and shapes of each component of a dataset element. - The nested structure of these properties map to the structure of an element, which may be a single tensor, a tuple of tensors, or a nested tuple of tensors.
|
|
- use
collections.namedtuple
or a dictionary mapping strings to tensors to represent a single element of a Dataset.
|
|
The Dataset transformations support datasets of any structure. When using the Dataset.map(), Dataset.flat_map(), and Dataset.filter() transformations, which apply a function to each element, the element structure determines the arguments of the function:
|
|
Creating an iterator
After building a Dataset, the next step is to create an Iterator to access elements from that dataset.
The Dataset API currently supports three kinds of iterator, in increasing level of sophistication:
one-shot
- supports iterating once through a dataset
- no need for explicit initialization
- handle almost all of the cases that the existing queue-based input pipelines support
|
|
initializable
- must run an explicit
iterator.initializer
operation before using it. - enables you to parameterize the definition of the dataset
- using one or more tf.placeholder() tensors that can be fed when you initialize the iterator.
reinitializable
- can be initialized from multiple different Dataset objects
- A reinitializable iterator is defined by its structure.
- For example, training input pipeline uses random perturbations and validation input pipeline uses on unmodified data.
- These pipelines will typically use different Dataset objects that have the same structure (i.e. the same types and compatible shapes for each component).
|
|
feedable
- used together with
tf.placeholder
to select what Iterator to use in each call to tf.Session.run - offers the same functionality as a reinitializable iterator
- not require to initialize the iterator from the start of a dataset when you switch between iterators
- use
tf.contrib.data.Iterator.from_string_handle
to define a feedable iterator that allows you to switch between the two datasets
|
|
Consuming values from an iterator
If the iterator reaches the end of the dataset, executing the Iterator.get_next() operation will raise a tf.errors.OutOfRangeError.
Must initialize it again if you want to use it further.
If each element of the dataset has a nested structure, the return value of Iterator.get_next() will be one or more tf.Tensor objects in the same nested structure:
|
|
Notice:
- evaluating any of next1, next2, or next3 will advance the iterator for all components
- A typical consumer of an iterator will include all components in a single expression.
Reading input data
Consuming NumPy arrays
Dataset + tf.placeholder() + Iterator + feed
|
|
Consuming TFRecord data
tf.contrib.data.TFRecordDataset
- if you have two sets of files for training and validation
tf.placeholder
+iterator
|
|
Consuming text data
tf.contrib.data.TextLineDataset
- Given one or more filenames,
- produce one string-valued element per line of those files.
- can use
tf.placeholder(tf.string)
- remove unrelated lines using the
Dataset.skip()
andDataset.filter()
, more than one filename useDataset.flat_map()
in addition.
Preprocessing data with Dataset.map()
Dataset.map() transformation works in each item in dataset.
Parsing tf.Example protocol buffer messages
For TFRecordDataset
and TFRecord-format file
Each tf.train.Example record contains one or more “features”, and the input pipeline typically converts these features into tensors.
Decoding image data and resizing it
necessary to convert images of different sizes to a common size, so that they may be batched into a fixed size.
|
|
Applying arbitrary Python logic with tf.py_func()
sometimes useful to call upon external Python libraries when parsing your input data
Batching dataset elements
Simple batching
Dataset.batch()
transformation
batching stacks n consecutive elements of a dataset into a single element
Batching tensors with padding
Dataset.padded_batch()
transformation
The Dataset.padded_batch() transformation allows you to set different padding for each dimension of each component
Training workflows
Processing multiple epochs
Dataset API offers two main ways to process multiple epochs of the same data.
- The simplest way:
Dataset.repeat(num_repeat)
- with no arguments will repeat the input indefinitely
- without signaling the end of one epoch and the beginning of the next epoch
|
|
- loop + catch the
tf.errors.OutOfRangeError
,- If want to receive a signal at the end of each epoch, to use a training loop that catches the
tf.errors.OutOfRangeError
. - No need
repeat
.
- If want to receive a signal at the end of each epoch, to use a training loop that catches the
|
|
Randomly shuffling input data
Dataset.shuffle()
tansformation
- Randomly shuffles the input dataset using a similar algorithm to
tf.RandomShuffleQueue
- maintains a fixed-size buffer and chooses the next element uniformly at random from that buffer.
|
|
Using high-level APIs
tf.train.MonitoredTrainingSession
tf.train.MonitoredTrainingSession
Simplifies many aspects of running TensorFlow in a distributed setting.
- uses the
tf.errors.OutOfRangeError
to signal that training has completed
when used withDataset
API, recommendusing Dataset.make_one_shot_iterator()
Demo:
tf.estimator.Estimator
when using a Dataset
in the input_fn
of tf.estimator.Estimator
, recommend using Dataset.make_one_shot_iterator()
Demo:
later
https://stackoverflow.com/questions/41175011/tf-contrib-learn-tutorial-deprecation-warning