statstream: Statistics for Streaming Data

Release v22.2.0.dev (What's new?).

statstream is a lightweight Python package providing data analysis and statistics utilities for streaming data.

Its main goal is to provide single-pass variants of conventional numpy data analysis and statistics functionality for streaming data that is either generated on the fly or to large to be handled at once. Data can be streamed as in chunks called mini-batches, which makes statstream extremely useful in combination with machine learning and deep learning packages like keras, tensorflow, or pytorch.

Getting Started

statstream is a Python-only package hosted on PyPI. The recommended installation method is pip-installing into a virtual environment.

$ pip install statstream

The next three steps should bring you up and running in no time:

  • The Overview section will show you a simple example of statstream in action and introduce you to its core ideas.

  • The Examples section will give you a comprehensive tour of statstream’s features. After reading, you will know about our advanced features and how to use them.

  • The API Reference reference is a quick way to look up details of all features and their options.

  • If at any point you get confused by some terminology, please check out our Glossary.

Project Information

statstream is released under the MIT license, its documentation lives at Read the Docs, the code on GitHub, and the latest release can be found on PyPI. It’s tested on Python 2.7 and 3.5+.

If you’d like to contribute to statstream you’re most welcome. We have written a short guide to help you get you started!

Further Reading

Additional information on the algorithmic aspects of statstream can be found in the following works:

  • Tony F. Chan & Gene H. Golub & Randall J. LeVeque, “Updating formulae and a pairwise algorithm for computing sample variances”, 1979

  • Radim, Rehurek, “Scalability of Semantic Analysis in Natural Language Processing”, 2011


Full Table of Contents