statstream.approximate.streaming_low_rank_autocorrelation

statstream.approximate.streaming_low_rank_autocorrelation(X, rank, steps=None, shift=0.0, tree=False)

Low rank factorization of the sample autocorrelation matrix of a streaming dataset.

Computes a factorization of the autocorrelation matrix of a dataset from a stream of batches of samples. If the full data set was given in a matrix A of shape [N, M], where N is the number of data samples and M is the dimensionality of each sample, then the autocorrelation matrix is 1/(N-1)*A.T @ A and of shape [M, M]. The function computes a matrix L of shape [M, K] such that L.T @ L is an approximation of the autocorrelation matrix of rank at most K.

This is done from a stream of sample batches without ever explicitly forming matrices of the full shape [M, M]. Batches can be combined in a linear streaming way (which gives more relative weight to later batches) or in binary tree mode, where batches are combined pairwise, then the results are combined again pairwise and so on (this leads to an additional memory requirement of a factor of log(N)).

The data has to be provided by an iterator yielding batches of samples. Either a number of steps can be specified, or the iterator is assumed to be emptied in a finite number of steps. In the first case only the given number of batches is extracted from the iterator and used for the correlation calculation, even if the iterator could yield more data.

Samples are given along the first axis. The correlation has the squared shape as the remaining axes, e.g. batches of shape [batch_size, d1, ..., dN] will result in a correlation factor of shape [K, d1, ..., dN].

This function consumes an iterator, thus finite iterators will be empty after a call to this function, unless steps is set to a smaller number than batches in the iterator.

Parameters:
Xiterable

An iterator yielding the batches of samples.

rankint

The maximal rank of the approximate decomposition factor.

stepsint, optional

The number of batches to use from the iterator (all available batches are used if set to None). The defaul is None.

shiftarray, optional

Apply a shift of data samples before calculating correlations, that is use (X-shift) instead of X (must be broadcastable to the shape of batches from X). The default is 0.0, that is no shift is used.

treebool, optional

Use the binary tree mode to combine batches more evenly at the cost of additional memory requirement. The default is False.

Returns:
array

A low-rank factor of a symmetric decomposition of the autocorrelation matrix of the seen data.

Notes

The algorithm for combining SVD based low rank approximations is described in more detail in [1].

References

[1]

Radim, Rehurek, “Scalability of Semantic Analysis in Natural Language Processing”, 2011.