Skip to content

Commit

Permalink
Merge pull request #129 from paulsengroup/feat/stats
Browse files Browse the repository at this point in the history
Implement several methods to efficiently compute various descriptive statistics given PixelSelector object
  • Loading branch information
robomics authored Nov 30, 2024
2 parents 7d4bb26 + c4c1d73 commit 25a957e
Show file tree
Hide file tree
Showing 20 changed files with 2,164 additions and 60 deletions.
11 changes: 11 additions & 0 deletions .github/workflows/fuzzy-testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,17 @@ jobs:
*".${{ matrix.format }}" \
*.cool
- name: Run test (describe)
run: |
test/scripts/fuzzer.py \
--resolution ${{ steps.test-params.outputs.resolution }} \
--duration '${{ steps.test-params.outputs.duration }}' \
--normalization ${{ matrix.normalization }} \
--nproc $(nproc) \
--format describe \
*".${{ matrix.format }}" \
*.cool
fuzzy-testing-status-check:
name: Status Check (fuzzy-testing)
if: ${{ always() }}
Expand Down
38 changes: 36 additions & 2 deletions docs/api/generic.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,15 +48,49 @@ Generic API

.. automethod:: coord1
.. automethod:: coord2
.. automethod:: nnz
.. automethod:: sum
.. automethod:: to_arrow
.. automethod:: to_coo
.. automethod:: to_csr
.. automethod:: to_df
.. automethod:: to_numpy
.. automethod:: to_pandas

**Statistics**

:py:class:`hictkpy.PixelSelector` exposes several methods to compute or estimate several statistics efficiently.

The main features of these methods are:

* All statistics are computed by traversing the data only once and without caching interactions.
* All methods can be tweaked to include or exclude non-finite values.
* All functions implemented using short-circuiting to detect scenarios where the required statistics can be computed without traversing all pixels.

The following statistics are guaranteed to be exact:

* nnz
* sum
* min
* max
* mean

The rest of the supported statistics (currently variance, skewness, and kurtosis) are estimated and are thus not guaranteed to be exact.
However, in practice, the estimation is usually very accurate (relative error < 1.0e-6).

You can instruct hictkpy to compute the exact statistics by passing ``exact=True`` to :py:meth:`hictkpy.PixelSelector.describe()` and related methods.
It should be noted that for large queries this will result in slower computations and higher memory usage.

.. automethod:: describe
.. automethod:: kurtosis
.. automethod:: max
.. automethod:: mean
.. automethod:: min
.. automethod:: nnz
.. automethod:: skewness
.. automethod:: sum
.. automethod:: variance

**Iteration**

.. automethod:: __iter__

.. code-block:: ipythonconsole
Expand Down
4 changes: 4 additions & 0 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,10 @@ find_package(
)

find_package(Arrow REQUIRED)
find_package(Boost REQUIRED)
find_package(FMT REQUIRED)
find_package(nanobind REQUIRED)
find_package(phmap REQUIRED)
find_package(spdlog REQUIRED)
find_package(Filesystem REQUIRED)

Expand Down Expand Up @@ -55,7 +57,9 @@ target_link_libraries(
hictk::hic
hictk::transformers
Arrow::arrow_$<IF:$<BOOL:${BUILD_SHARED_LIBS}>,shared,static>
Boost::headers
fmt::fmt-header-only
phmap
spdlog::spdlog_header_only
std::filesystem
)
Expand Down
4 changes: 2 additions & 2 deletions src/cooler_file_writer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ void CoolerFileWriter::add_pixels(const nb::object &df) {

std::visit(
[&](const auto &n) {
using N = hictk::remove_cvref_t<decltype(n)>;
using N = remove_cvref_t<decltype(n)>;
const auto pixels = coo_format ? coo_df_to_thin_pixels<N>(df, true)
: bg2_df_to_thin_pixels<N>(_w->bins(), df, true);
lck.reset();
Expand Down Expand Up @@ -138,7 +138,7 @@ hictk::File CoolerFileWriter::finalize(std::string_view log_lvl_str, std::size_t
try {
std::visit(
[&](const auto &num) {
using N = hictk::remove_cvref_t<decltype(num)>;
using N = remove_cvref_t<decltype(num)>;
_w->aggregate<N>(_path.string(), false, _compression_lvl, chunk_size, update_freq);
},
_w->open("0").pixel_variant());
Expand Down
15 changes: 15 additions & 0 deletions src/include/hictkpy/common.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,30 @@

#include <hictk/common.hpp>
#include <hictk/numeric_variant.hpp>
#include <hictk/type_traits.hpp>
#include <stdexcept>
#include <string>
#include <string_view>
#include <type_traits>

namespace hictkpy {

#define HICTKPY_LIKELY HICTK_LIKELY
#define HICTKPY_UNLIKELY HICTK_UNLIKELY

[[noreturn]] inline void unreachable_code() { hictk::unreachable_code(); }

template <typename T, typename U>
[[maybe_unused]] [[nodiscard]] constexpr T conditional_static_cast(U value) {
return hictk::conditional_static_cast<T>(value);
}

template <typename T>
using remove_cvref = hictk::remove_cvref<T>;

template <typename T>
using remove_cvref_t = typename remove_cvref<T>::type;

template <typename T>
[[nodiscard]] constexpr std::string_view map_type_to_dtype() {
if constexpr (std::is_unsigned_v<T>) {
Expand Down
135 changes: 135 additions & 0 deletions src/include/hictkpy/pixel_aggregator.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
// Copyright (C) 2024 Roberto Rossini <[email protected]>
//
// SPDX-License-Identifier: MIT

#pragma once

#include <parallel_hashmap/phmap.h>

#include <array>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics.hpp>
#include <cstddef>
#include <cstdint>
#include <optional>
#include <string>
#include <string_view>
#include <utility>
#include <variant>

namespace hictkpy {

struct Stats {
std::optional<std::int64_t> nnz{};
std::optional<std::variant<std::int64_t, double>> sum{};
std::optional<std::variant<std::int64_t, double>> min{};
std::optional<std::variant<std::int64_t, double>> max{};
std::optional<double> mean{};
std::optional<double> variance{};
std::optional<double> skewness{};
std::optional<double> kurtosis{};
};

class PixelAggregator {
template <typename N>
using Accumulator = boost::accumulators::accumulator_set<
N, boost::accumulators::stats<
boost::accumulators::droppable<boost::accumulators::tag::count>,
boost::accumulators::droppable<boost::accumulators::tag::sum>,
boost::accumulators::droppable<boost::accumulators::tag::min>,
boost::accumulators::droppable<boost::accumulators::tag::max>,
boost::accumulators::droppable<boost::accumulators::tag::mean>,
boost::accumulators::droppable<boost::accumulators::tag::variance>,
boost::accumulators::droppable<boost::accumulators::tag::skewness>,
boost::accumulators::droppable<boost::accumulators::tag::kurtosis>>>;

// When using ints there is a high chance that computing the 3rd or 4th moments causes an int
// overflow, leading to completely incorrect results.
// At the same time we cannot use doubles everywhere, because for large numbers the value of
// e.g. sum, mean etc. will no longer be exact
using KurtosisAccumulator = boost::accumulators::accumulator_set<
double, boost::accumulators::stats<boost::accumulators::tag::kurtosis>>;

std::variant<Accumulator<std::int64_t>, Accumulator<double>> _accumulator{
Accumulator<std::int64_t>{}};
std::optional<KurtosisAccumulator> _kurtosis_accumulator{};

bool _compute_count{true};
bool _compute_sum{true};
bool _compute_min{true};
bool _compute_max{true};
bool _compute_mean{true};
bool _compute_variance{true};
bool _compute_skewness{true};
bool _compute_kurtosis{true};

bool _finite_found{false};
bool _nan_found{false};
bool _neg_inf_found{false};
bool _pos_inf_found{false};

public:
static constexpr std::array<std::string_view, 8> valid_metrics{
"nnz", "sum", "min", "max", "mean", "variance", "skewness", "kurtosis"};

PixelAggregator() = default;

template <typename N, bool keep_nans, bool keep_infs, typename PixelSelector>
[[nodiscard]] Stats compute(const PixelSelector& sel,
const phmap::flat_hash_set<std::string>& metrics, bool exact);

private:
static void validate_metrics(const phmap::flat_hash_set<std::string>& metrics);

template <bool keep_nans, bool keep_infs, typename N>
void update_finiteness_counters(N n) noexcept;
template <bool keep_nans, bool keep_infs, bool skip_kurtosis, typename N, typename It,
typename StopCondition>
[[nodiscard]] std::pair<It, std::size_t> process_pixels_until_true(Accumulator<N>& accumulator,
It first, It last,
StopCondition break_loop);

template <typename N>
[[nodiscard]] bool sum_can_be_skipped(Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] bool min_can_be_skipped(Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] bool max_can_be_skipped(Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] bool mean_can_be_skipped(Accumulator<N>& accumulator) const;
template <typename N>
void disable_redundant_accumulators(Accumulator<N>& accumulator);

template <bool keep_nans, bool keep_infs, typename N, typename It>
void process_all_remaining_pixels(Accumulator<N>& accumulator, It&& first, It&& last);

template <typename N>
void reset(const phmap::flat_hash_set<std::string>& metrics);

template <typename N>
[[nodiscard]] Stats extract(const phmap::flat_hash_set<std::string>& metrics);
template <typename N, std::size_t keep_nans, std::size_t keep_infs, typename It>
[[nodiscard]] std::optional<Stats> handle_edge_cases(
It first, It last, const phmap::flat_hash_set<std::string>& metrics);

template <typename N>
[[nodiscard]] static std::int64_t extract_nnz(const Accumulator<N>& accumulator);
template <typename N>
[[nodiscard]] N extract_sum(const Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] N extract_min(const Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] N extract_max(const Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] double extract_mean(const Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] double extract_variance(const Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] double extract_skewness(const Accumulator<N>& accumulator) const;
template <typename N>
[[nodiscard]] double extract_kurtosis(const Accumulator<N>& accumulator) const;
};

} // namespace hictkpy

#include "../../pixel_aggregator_impl.hpp"
26 changes: 18 additions & 8 deletions src/include/hictkpy/pixel_selector.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
#include <string_view>
#include <tuple>
#include <variant>
#include <vector>

#include "hictkpy/nanobind.hpp"

Expand Down Expand Up @@ -54,14 +55,23 @@ struct PixelSelector {
[[nodiscard]] auto get_coord2() const -> GenomicCoordTuple;

[[nodiscard]] nanobind::iterator make_iterable() const;
[[nodiscard]] nanobind::object to_arrow(std::string_view span = "upper_triangle") const;
[[nodiscard]] nanobind::object to_df(std::string_view span = "upper_triangle") const;
[[nodiscard]] nanobind::object to_pandas(std::string_view span = "upper_triangle") const;
[[nodiscard]] nanobind::object to_coo(std::string_view span = "upper_triangle") const;
[[nodiscard]] nanobind::object to_csr(std::string_view span = "upper_triangle") const;
[[nodiscard]] nanobind::object to_numpy(std::string_view span = "full") const;
[[nodiscard]] nanobind::object sum() const;
[[nodiscard]] std::int64_t nnz() const;
[[nodiscard]] nanobind::object to_arrow(std::string_view span) const;
[[nodiscard]] nanobind::object to_df(std::string_view span) const;
[[nodiscard]] nanobind::object to_pandas(std::string_view span) const;
[[nodiscard]] nanobind::object to_coo(std::string_view span) const;
[[nodiscard]] nanobind::object to_csr(std::string_view span) const;
[[nodiscard]] nanobind::object to_numpy(std::string_view span) const;

[[nodiscard]] nanobind::dict describe(const std::vector<std::string>& metrics, bool keep_nans,
bool keep_infs, bool exact) const;
[[nodiscard]] std::int64_t nnz(bool keep_nans, bool keep_infs) const;
[[nodiscard]] nanobind::object sum(bool keep_nans, bool keep_infs) const;
[[nodiscard]] nanobind::object min(bool keep_nans, bool keep_infs) const;
[[nodiscard]] nanobind::object max(bool keep_nans, bool keep_infs) const;
[[nodiscard]] double mean(bool keep_nans, bool keep_infs) const;
[[nodiscard]] double variance(bool keep_nans, bool keep_infs, bool exact) const;
[[nodiscard]] double skewness(bool keep_nans, bool keep_infs, bool exact) const;
[[nodiscard]] double kurtosis(bool keep_nans, bool keep_infs, bool exact) const;

[[nodiscard]] static auto parse_span(std::string_view span) -> QuerySpan;
[[nodiscard]] static auto parse_count_type(std::string_view type) -> PixelVar;
Expand Down
Loading

0 comments on commit 25a957e

Please sign in to comment.