statstream.approximate.streaming_low_rank_cov

statstream.approximate.streaming_low_rank_cov(X, rank, steps=None, tree=False, reset=None)

Low rank factorization of the covariance matrix of a streaming dataset.

Computes a factorization of the covariance matrix of a dataset from a stream of batches of samples. If the full data set was given in a matrix A of shape [N, M], where N is the number of data samples and M is the dimensionality of each sample, then the covariance matrix is 1/(N-1)*(A-mean(A)).T @ (A-mean(A)) and of shape [M, M]. The function computes a matrix L of shape [M, K] such that L.T @ L is an approximation of the covariance matrix of rank at most K. This is done in a two-pass algorithm that first computes the mean from a stream of batches and then the covariance using streaming_low_rank_autocorrelation shifted by the precomputed mean.

This is done from a stream of sample batches without ever explicitly forming matrices of the full shape [M, M]. Batches can be combined in a linear streaming way (which gives more relative weight to later batches) or in binary tree mode, where batches are combined pairwise, then the results are combined again pairwise and so on (this leads to an additional memory requirement of a factor of log(N)).

The data has to be provided by an iterator yielding batches of samples. Either a number of steps can be specified, or the iterator is assumed to be emptied in a finite number of steps. In the first case only the given number of batches is extracted from the iterator and used for the covariance calculation, even if the iterator could yield more data.

Samples are given along the first axis. The full covariance would have the squared shape as the remaining axes, e.g. batches of shape [batch_size, d1, ..., dN] would result in a covariance maxtrix of shape [d1, ..., dN, d1, ..., dN]. The low-rank covariance factor L will have shape [K, d1, ..., dN].

This function consumes an iterator twice, thus only finite iterators can be handled and the given iterator will be empty after a call to this function, unless steps is set to a smaller number than batches in the iterator. For restarting the iterator for the second pass, a reset function needs to be available. This can either be passed as a seperate argument or be part of the iterator itself. If no reset function is provided as argument, the iterator X is assumed to have a reset() method, which is called after the mean computation.

Parameters:
Xiterable

An iterator yielding the batches of samples.

rankint

The maximal rank of the approximate decomposition factor.

stepsint, optional

The number of batches to use from the iterator (all available batches are used if set to None). The defaul is None.

treebool, optional

Use the binary tree mode to combine batches more evenly at the cost of additional memory requirement. The default is False.

resetcallable() or None, optional

A function handle to reset the iterator after the first pass for the mean calculation. The reset function must accept the iterator as argument and return a resetted iterator. If set to None the iterator is assumed to have a reset method, which will then be used. The default is None.

Returns:
array

A low-rank factor of a symmetric decomposition of the covariance matrix of the seen data.

Notes

Computing covariances necessarily includes computing the mean, so there is no computational benefit of using streaming_low_rank_cov over using streaming_mean_and_low_rank_cov. In fact this function internally uses the latter and simly discards the mean.

The algorithm for combining SVD based low rank approximations is described in more detail in [1].

References

[1]

Radim, Rehurek, “Scalability of Semantic Analysis in Natural Language Processing”, 2011.