statstream.exact.streaming_cov

statstream.exact.streaming_cov(X, steps=None)

Covariance matrix of a streaming dataset.

Computes the the covariance matrix of a dataset from a stream of batches of samples. The data has to be provided by an iterator yielding batches of samples. Either a number of steps can be specified, or the iterator is assumed to be emptied in a finite number of steps. In the first case only the given number of batches is extracted from the iterator and used for the covariance calculation, even if the iterator could yield more data.

Samples are given along the first axis. The covariance has the squared shape of as remaining axes, e.g. batches of shape [batch_size, d1, ..., dN] will produce a covariance of shape [d1, ..., dN, d1, ..., dN].

This function consumes an iterator, thus finite iterators will be empty after a call to this function, unless steps is set to a smaller number than batches in the iterator.

Parameters:
Xiterable

An iterator yielding batches of samples.

stepsint, optional

The number of batches to use from the iterator (all available batches are used if set to None). The defaul is None.

Returns:
array

The covariance matrix of the seen data samples.

Warning

Use this function only on data sets of reasonably small dimensions.

Full covariances matrices are costly to compute and require large amounts of memory. The shape of the covariance matrix is squared the size of each individual sample in the data set.

If your data is high dimensional and you do not need the exact covariance matrix, consider using streaming_mean_and_low_rank_cov or streaming_low_rank_cov from statstream.approximate instead.

See also

streaming_mean_and_cov

get mean and covariance in a single pass.

Notes

Computing covariances necessarily includes computing the mean, so there is no computational benefit of using streaming_cov over using streaming_mean_and_cov.

The streamed covariance calculation is a generalization of the streamed variance calculation as described in [1].

References

[1]

Tony F. Chan & Gene H. Golub & Randall J. LeVeque, “Updating formulae and a pairwise algorithm for computing sample variances”, 1979.