Stream DaQ

Remember the joy of bath time with those trusty rubber ducks, keeping us company while floating through the bubbles? Well, think of Stream DaQ as the duck for your data — keeping your streaming data clean and afloat in a sea of information. Just like those bath ducks helped make our playtime fun and carefree, Stream DaQ keeps an eye on your data and lets you know the moment things get messy, so you can take action in real time!

The project

Stream DaQ is originally developed in Python, leveraging internally the Pathway stream processing library, which is an open source project, as well. Previous versions of the project were featuring more Python stream processing libraries, namely Faust and Bytewax. You can find source code using these frameworks in the faust-vs-bytewax branch of the repository. Comparisons between the two libraries are also available there. Our immediate plans is to extend the functionality of Stream DaQ primarily in Pathway. The latest advancements of the tool will always be available in the main branch (you are here).

The project is developed by the Data Engineering Team (DELAB) of Datalab AUTh, under the supervision of Prof. Anastasios Gounaris.

Key functionality

Stream DaQ keeps an eye on your data stream, letting you know when travelling data are not as expected. In real time. So that you can take actions. There are several key aspects of the tool making it a powerful option for data quality monitoring on data streams:

Highly configurable: Stream DaQ comes with plenty of built-in data quality measurements, so that you can choose which of them fit your use case. We know that every data-centric application is different, so being able to **define ** what "data quality" means for you is precious.
Real time alerts: Stream DaQ defines highly meaningful data quality checks for data streams, letting the check results be a stream on their own, as well. This architectural choice enables real time alerts, in case the standards or thresholds you have defined are not met!

Current Suite of Measurements

The following list provides a brief overview of the current Suite of Measurements supported by Stream DaQ. Note that for all the functionalities below, Stream DaQ comes with a wide range of implementation variations, in order to cover a broad and heterogeneous spectrum of DQ needs, depending on the domain, the task and the system at hand. For all these functionalities, Stream DaQ lets room for extensive customization via custom thresholding and variation selection with only a couple lines of code. Stay tuned, since new functionalities are on their way to be incorporated into Stream DaQ! Since then, Hapy Qua(ck)litying!

Descriptive statistics: min, max, mean, median, std, number/fraction above mean
Availability: at least a value in a window
Frozen stream detection: all/most values are the same in a window
Range/Set validation: values fall within a range or set of accepted values
Regex validation: values comply with a privded regex
Unordered stream detection: elements violate (time) ordering
Volume & Cardinalities: count, distinct, unique, most_frequent, heavy hitters (approx)
Distribution percentiles
Correlation analysis between fields
Value Lengths: {min, max, mean, median} length (String-specific)
Integer - Fractional parts: {min, max, mean, median} integer or fractional part (Float-specific)

Example code

Data quality monitoring is a highly case-specific task. We feel you! That's why we have tried to make it easier than ever to define what data quality means to your very own case. You just need a couple of lines to define what you want to be monitored and that's it! Sit back, and focus on the important stuff; the real time results. Stream DaQ reliably handles all the rest for you. See it in a toy example!

from StreamDaQ import StreamDaQ
from DaQMeasures import DaQMeasures as dqm
from Windows import tumbling
from datetime import timedelta

# Step 1: Configure monitoring parameters
daq = StreamDaQ().configure(
    window=tumbling(duration=timedelta(hours=1)),
    wait_for_late=timedelta(seconds=10),
    time_format="%H:%M:%S"
)

# Step 2: Define what data quality means for your unique use-case
daq.add(dqm.count('items'), "count") \
    .add(dqm.min('items'), "min") \
    .add(dqm.median('items'), "std") \
    .add(dqm.most_frequent('items'), "most_frequent") \
    .add(dqm.number_of_distinct('items'), "distinct")

# Step 3: Kick-off DQ monitoring
daq.watch_out()

In this simple example, we define four different measurements:

the number of elements (count)
their minimum (min)
their median (median)
their standard deviation (std_dev)
the most frequent values (most_frequent)
the number of distinct values (number_of_distinct)

The above data quality measures are monitored and reported in real time for every window of the stream. After defining them, Stream DaQ takes over to continuously monitor (watch_out) the stream, as new data arrive. The monitoring results are reported in real time, as a meta stream. That is, every row of the result is a new stream object itself, as following:

window_start | count | min | median | most_frequent | distinct
18:06:40     | 12    | 1   | 5.5    | (9, 1)        | 7
18:07:00     | 3     | 4   | 7      | (4, 7)        | 3
18:07:20     | 17    | 1   | 5      | 3             | 10
18:07:40     | 9     | 1   | 3      | (3, 1)        | 7
  .             .      .     .           .            .
  .             .      .     .           .            .
  .             .      .     .           .            .

The above measurements are just a small subset of the large amount of built-in measurements ready-for-use. A detailed list of all the available measurements will be included shortly.

Execution

The easiest way to run the code in this repository is to create a new conda environment and install the required packages. To do so, execute the following commands in a terminal:

conda env create --file environment.yml
conda activate daq
pip install -r requirements.txt

The above three commands are required only the first time you run the code. For every next run, simply activate the conda environment daq:

conda activate daq

and then follow the following steps:

Starting from the root folder of the project, go to the pathway directory.

cd pathway
# all the commands from now on should be executed in this directory

Run the main file
```
python main.py
```

Work in progress

The project is in full development at current time, so more functionalities, documentation, examples and demonstrations are on their way to be included shortly. We thank you for your patience.

Acknowledgements

Special thanks to Maria Kavouridou for putting effort and love, in order to give birth to the Stream DaQ logo.

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
datasets		datasets
dq_dashboard		dq_dashboard
pathway		pathway
storage		storage
.gitignore		.gitignore
README.md		README.md
Stream DaQ logo.png		Stream DaQ logo.png
environment.yml		environment.yml
faust_vs_bytewax.png		faust_vs_bytewax.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stream DaQ

The project

Key functionality

Current Suite of Measurements

Example code

Execution

Work in progress

Acknowledgements

About

Languages

Bilpapster/stream-DaQ

Folders and files

Latest commit

History

Repository files navigation

Stream DaQ

The project

Key functionality

Current Suite of Measurements

Example code

Execution

Work in progress

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Languages