Do you want to visualize missing values in your data? There are plenty amazing methods (check missingno for example) but they all look bulky when your data has too many columns. nafig
will help you to build a perfect NA figure!
$ pip install -U nafig
or install with Poetry
$ poetry add nafig
Here are some examples of the usage both for simulated and real world data. Check this notebook to play with code yourself!
First, let's import the core function and other useful things:
>>> from nafig.plots import na_text_barplot # The core function
>>> from nafig.utils import create_example_data # To simulate data
>>> import pandas as pd # To works with tables
>>> df, feature_types = create_example_data()
df
is just a pandas dataframe with missing values. feature_types
is an array, containing data type description for each column. This is just an example, so labels don't correspond to actual data types.
>>> feature_types[:10]
array(['Categorical', 'Categorical', 'Binary', 'Continuous', 'Continuous',
'Continuous', 'Binary', 'Continuous', 'Continuous', 'Binary'],
dtype='<U11')
This toy dataframe contains 300 columns. Visualization of missing data with heatmap would unfortunately be too bulky. How to explore missing data distribution in this dataset? Try NA text barplot!
>>> na_text_barplot(df, hue=feature_types, line_height=1.5)
Columns of the dataset are binned by percentage of the missing data in them. Colouring by feature types helps to understand, which types of data are missing. On Y-axis you can see the number of features in each group.
You can vary the number of bins using num_bins parameter:
>>> na_text_barplot(df, hue=feature_types, line_height=1.5, num_bins=20)
>>> na_text_barplot(df, hue=feature_types, line_height=2, num_bins=2, fig_width=8, font_size=3)
Now let's see some real data examples!
Data source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv
>>> DATA_PATH = "data/house-prices/train.csv"
>>> house_prices_df = pd.read_csv(DATA_PATH, index_col=0)
This is a reasonably good data with most of the values present. But thanks to this plot, we can see, which features are the bad guys!
>>> na_text_barplot(house_prices_df, fig_width=17, num_bins=20, line_height=1.5)
Note that if you don't pass the hue
parameter, features will be colored by the data type of the column. If you don't want to colorize features at all, set hue
to False
.
By setting remove_empty_bins
to True
, you can remove the empty bins. It will require a reader to pay more attention to the X-axis but will save you some space.
>>> na_text_barplot(house_prices_df, fig_width=10, num_bins=20,
line_height=1.5, remove_empty_bins=True)
Data source: https://www.kaggle.com/datasets/airbnb/seattle
>>> airbnb_df = pd.read_csv("data/airbnb/listings.csv")
This dataset has a bit more missing data. On the plot we can see that all integer features are almost complete, and some object
and floating number columns contain missing values
>>> na_text_barplot(airbnb_df, fig_width=18, line_height=1.8, font_size=9, remove_empty_bins=True)
Feel free to explore other parameters! There are more to help you create a perfect missing values visualization
- Supports for
Python 3.9
and higher. Poetry
as the dependencies manager. See configuration inpyproject.toml
andsetup.cfg
.- Automatic codestyle with
black
,isort
andpyupgrade
. - Ready-to-use
pre-commit
hooks with code-formatting. - Type checks with
mypy
; docstring checks withdarglint
; security checks withsafety
andbandit
- Testing with
pytest
. - Ready-to-use
.editorconfig
,.dockerignore
, and.gitignore
. You don't have to worry about those things.
GitHub
integration: issue and pr templates.Github Actions
with predefined build workflow as the default CI/CD.- Everything is already set up for security checks, codestyle checks, code formatting, testing, linting, docker builds, etc with
Makefile
. More details in makefile-usage. - Dockerfile for your package.
- Always up-to-date dependencies with
@dependabot
. You will only enable it. - Automatic drafts of new releases with
Release Drafter
. You may see the list of labels inrelease-drafter.yml
. Works perfectly with Semantic Versions specification.
Makefile
contains a lot of functions for faster development.
1. Download and remove Poetry
To download and install Poetry run:
make poetry-download
To uninstall
make poetry-remove
2. Install all dependencies and pre-commit hooks
Install requirements:
make install
Pre-commit hooks coulb be installed after git init
via
make pre-commit-install
3. Codestyle
Automatic formatting uses pyupgrade
, isort
and black
.
make codestyle
# or use synonym
make formatting
Codestyle checks only, without rewriting files:
make check-codestyle
Note:
check-codestyle
usesisort
,black
anddarglint
library
Update all dev libraries to the latest version using one comand
make update-dev-deps
4. Code security
make check-safety
This command launches Poetry
integrity checks as well as identifies security issues with Safety
and Bandit
.
make check-safety
5. Type checks
Run mypy
static type checker
make mypy
6. Tests with coverage badges
Run pytest
make test
7. All linters
Of course there is a command to rule run all linters in one:
make lint
the same as:
make test && make check-codestyle && make mypy && make check-safety
8. Docker
make docker-build
which is equivalent to:
make docker-build VERSION=latest
Remove docker image with
make docker-remove
More information about docker.
9. Cleanup
Delete pycache files
make pycache-remove
Remove package build
make build-remove
Delete .DS_STORE files
make dsstore-remove
Remove .mypycache
make mypycache-remove
Or to remove all above run:
make cleanup
You can see the list of available releases on the GitHub Releases page.
We follow Semantic Versions specification.
We use Release Drafter
. As pull requests are merged, a draft release is kept up-to-date listing the changes, ready to publish when you’re ready. With the categories option, you can categorize pull requests in release notes using labels.
Label | Title in Releases |
---|---|
enhancement , feature |
🚀 Features |
bug , refactoring , bugfix , fix |
🔧 Fixes & Refactoring |
build , ci , testing |
📦 Build System & CI/CD |
breaking |
💥 Breaking Changes |
documentation |
📝 Documentation |
dependencies |
⬆️ Dependencies updates |
You can update it in release-drafter.yml
.
GitHub creates the bug
, enhancement
, and documentation
labels for you. Dependabot creates the dependencies
label. Create the remaining labels on the Issues tab of your GitHub repository, when you need them.
This project is licensed under the terms of the MIT
license. See LICENSE for more details.
@misc{nafig,
author = {VladimirShitov},
title = {Package for plotting figures with NA data distribution},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/VladimirShitov/nafig}}
}
This project was generated with python-package-template