statstream.approximate.streaming_mean_and_low_rank_cov¶
- statstream.approximate.streaming_mean_and_low_rank_cov(X, rank, steps=None, tree=False, reset=None)¶
Mean and a low rank factorization of the covariance matrix of a streaming dataset.
Computes the mean and a factorization of the covariance matrix of a dataset from a stream of batches of samples. If the full data set was given in a matrix
A
of shape[N, M]
, whereN
is the number of data samples andM
is the dimensionality of each sample, then the covariance matrix is1/(N-1)*(A-mean(A)).T @ (A-mean(A))
and of shape[M, M]
. The function computes a matrixL
of shape[M, K]
such thatL.T @ L
is an approximation of the covariance matrix of rank at mostK
. This is done in a two-pass algorithm that first computes the mean from a stream of batches and then the covariance usingstreaming_low_rank_autocorrelation
shifted by the precomputed mean.This is done from a stream of sample batches without ever explicitly forming matrices of the full shape
[M, M]
. Batches can be combined in a linear streaming way (which gives more relative weight to later batches) or in binary tree mode, where batches are combined pairwise, then the results are combined again pairwise and so on (this leads to an additional memory requirement of a factor oflog(N)
).The data has to be provided by an iterator yielding batches of samples. Either a number of steps can be specified, or the iterator is assumed to be emptied in a finite number of steps. In the first case only the given number of batches is extracted from the iterator and used for the mean and covariance calculation, even if the iterator could yield more data.
Samples are given along the first axis. The mean has the same shape as the remaining axes, e.g. batches of shape
[batch_size, d1, ..., dN]
will produce a mean of shape[d1, ..., dN]
. The covariance factorL
will have shape[K, d1, ..., dN]
.This function consumes an iterator twice, thus only finite iterators can be handled and the given iterator will be empty after a call to this function, unless
steps
is set to a smaller number than batches in the iterator. For restarting the iterator for the second pass, a reset function needs to be available. This can either be passed as a seperate argument or be part of the iterator itself. If no reset function is provided as argument, the iteratorX
is assumed to have areset()
method, which is called after the mean computation.- Parameters:
- Xiterable
An iterator yielding the batches of samples.
- rank
int
The maximal rank of the approximate decomposition factor.
- steps
int
, optional The number of batches to use from the iterator (all available batches are used if set to
None
). The defaul isNone
.- treebool, optional
Use the binary tree mode to combine batches more evenly at the cost of additional memory requirement. The default is
False
.- reset
callable()
orNone
, optional A function handle to reset the iterator after the first pass for the mean calculation. The reset function must accept the iterator as argument and return a resetted iterator. If set to
None
the iterator is assumed to have a reset method, which will then be used. The default isNone
.
- Returns:
Notes
The algorithm for combining SVD based low rank approximations is described in more detail in [1].
References
[1]Radim, Rehurek, “Scalability of Semantic Analysis in Natural Language Processing”, 2011.