OverviewΒΆ
statstream
provides functions for obtaining statistical insights from a
stream of data. We think it is best to show the core concepts of the package by
a simple exemplary demonstration.
Let us begin by synthetically generating a stream of random data.
>>> import numpy as np
>>> def data_generator(n):
... for _ in range(n):
... yield np.random.randn(2, 3)
>>> data_stream = data_generator(8)
We have created a generator function creating a stream of random data of shape
(2, 3)
. The first axis is considered as batch_size
, the remaining axes
as data dimensions. So here, or stream provides chunks of 3-dimensional
vectors, each chunk containing two such vectors. Calling data_generator(n)
with n=8
we obtain an iterator providing eight such chunks.
Let us have a look at one such chunk.
>>> next(data_stream)
array([[-0.96083597, -0.86513521, -0.70060355],
[ 0.2771605 , -1.58573487, 1.16854072]]) # random
The full data set contains sixteen vectors (eight chunks of two vectors each). In this simple example we could read out the full iterator into a single array and compute statistics like component-wise mean or variance for the data set.
>>> arr = np.concatenate(list(data_stream), axis=0)
>>> arr.shape
(16, 3)
>>> np.mean(arr, axis=0)
array([ 0.30680563, -0.30335616, -0.06189916]) # random
>>> np.var(arr, axis=0)
array([0.93472631, 1.4720951 , 1.13723147]) # random
However, if the data set iterator can not be transformed into a list, for
example because it is too large to be stored in memory or generated in
real-time, computing mean and variance in this way is impossible. Instead we
need to use an algorithm that can handle streaming data. This is what
statstream
is for.
In the above example we have exhausted our streaming data source, so
data_stream
is now empty. Lets quickly generate a new stream and compute
the mean from it.
>>> data_stream = data_generator(8)
>>> from statstream.exact import streaming_mean, streaming_var
>>> streaming_mean(data_stream)
array([-0.41722521, 0.0331529 , 0.14349293]) # random
The same can of course also be done for the variance.
>>> data_stream = data_generator(8)
>>> streaming_var(data_stream)
array([0.64355842, 1.44646403, 0.74006582]) # random
Note that again we have to create a new data_stream
, since the other one
was consumed by streaming_mean
and afterwards empty.