Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement EulerFD approximate discovering FD algorithm #414

Merged
merged 6 commits into from
Jan 11, 2025

Conversation

mitya-y
Copy link
Contributor

@mitya-y mitya-y commented May 11, 2024

Implement approximate FD (Functional Dependencies) discovery algorithm based on the article "EulerFD: An Efficient Double-Cycle Approximation of Functional Dependencies" by Qiongqiong Lin, Yunfan Gu, Jingyan Sai.
Add unit tests for the approximate FD algorithms, including a custom random option to ensure consistent test results across different systems. Utilize a custom random function to calculate answers of the current EulerFD version for each test dataset with a selected seed.
Integrate the EulerFD algorithm into Python bindings and the Python console.

For more information on EulerFD, refer to the presentation: EulerFD Overview.
Detailed development information and test results can be found here: Development Details.

@mitya-y mitya-y force-pushed the euler-fd-development branch 3 times, most recently from 4c95ec1 to 2af6292 Compare May 13, 2024 17:23
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.h Outdated Show resolved Hide resolved
}

void SearchTreeEulerFD::UpdateInterAndUnion(std::shared_ptr<Node> const& node) {
auto node_copy = node;

This comment was marked as resolved.

This comment was marked as resolved.

}

std::shared_ptr<SearchTreeEulerFD::Node> SearchTreeEulerFD::FindNode(Bitset const& set) {
auto current_node = root_;

This comment was marked as resolved.

This comment was marked as resolved.

std::sort(neg.begin(), neg.end(), [](Bitset const &left, Bitset const &right) {
return left.count() > right.count();
});
fd_num += Invert(real_rhs, neg);

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

src/core/config/custom_random/option.cpp Outdated Show resolved Hide resolved
src/core/config/descriptions.h Outdated Show resolved Hide resolved
src/python_bindings/py_util/py_to_any.cpp Outdated Show resolved Hide resolved
src/tests/test_fd_util.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.cpp Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/mlfq.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/search_tree.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/search_tree.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/search_tree.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/search_tree.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/eulerfd.cpp Outdated Show resolved Hide resolved

// in each column mapping string values into integer values.
// using only hash isnt good idea because colisions dont processing
std::vector<std::unordered_map<std::string, size_t>> columns(number_of_attributes_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to use string_view to alleviate unnecessary copy

Copy link
Contributor Author

@mitya-y mitya-y Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no unnecessary copy, because all maps "indexes" (strings) was moved (values[std::move(line[i])] = id, line 54) from std::vector<std::string>, on which allocation I can't influence, because it is a result of input_table_->GetNextRow() method.

src/core/algorithms/fd/eulerfd/eulerfd.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/eulerfd.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/eulerfd.cpp Outdated Show resolved Hide resolved

class MLFQ {
private:
using Queue = std::pair<std::queue<Cluster *>, double>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot find where double is used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be barrier values, but now it isn't used, because I use log instead.
Should I remove it?


namespace algos {

void Cluster::ShuffleData(RandomStrategy rand) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use std::shuffle instead of custom shuffling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because implementation of std::shuffle depends on STL, but we want have same permutation of array on any platforms and compilers (it is necessary for consistent hash values in test).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because implementation of std::shuffle depends on STL, but we want have same permutation of array on any platforms and compilers (it is necessary for consistent hash values in test).

This is an important piece of information, so I believe it is a good idea to leave it as a comment in the code somewhere near this method.

}

double Cluster::GetAverage() const {
double sum = std::accumulate(hist_effects_.begin(), hist_effects_.end(), 0.0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

Copy link
Contributor Author

@mitya-y mitya-y Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where const?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const double sum = ....

just for readability, not necessary


Cluster *MLFQ::Get() {
if (actual_queue_ >= 0) {
Cluster *save = queues_[actual_queue_].first.front();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why such name save?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I save a pointer on cluster in this variable, and return it in end of function.

double EulerFD::SamlingInCluster(Cluster *cluster) {
return cluster->Sample([this](size_t t1, size_t t2) -> size_t {
Bitset agree_set = BuildAgreeSet(t1, t2);
auto &&[_, result] = invalids_.insert(agree_set);

This comment was marked as resolved.

src/core/algorithms/fd/eulerfd/eulerfd.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/eulerfd.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/eulerfd.h Outdated Show resolved Hide resolved
break;
}

tuples_.emplace_back(std::vector<size_t>(number_of_attributes_));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tuples_.emplace_back(std::vector<size_t>(number_of_attributes_));
tuples_.emplace_back(number_of_attributes_);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think first is variant is better, because it is more explicit, but hasn't overhead, because will be called move constructor.

@mitya-y mitya-y force-pushed the euler-fd-development branch from 2af6292 to 012944f Compare September 23, 2024 09:28
src/core/algorithms/fd/eulerfd/eulerfd.cpp Show resolved Hide resolved
}

double Cluster::GetAverage() const {
double sum = std::accumulate(hist_effects_.begin(), hist_effects_.end(), 0.0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const double sum = ....

just for readability, not necessary

src/core/config/custom_random/type.h Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/eulerfd.cpp Outdated Show resolved Hide resolved
src/tests/test_fd_approximate.cpp Outdated Show resolved Hide resolved
@mitya-y mitya-y force-pushed the euler-fd-development branch from 012944f to 056a15e Compare January 6, 2025 21:27
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/core/algorithms/fd/eulerfd/eulerfd.cpp Outdated Show resolved Hide resolved
@mitya-y mitya-y force-pushed the euler-fd-development branch 2 times, most recently from afd46e6 to b651c44 Compare January 8, 2025 20:17
…for set random strategy and seed (it will be necessary for writing tests)
…om random from utilities. Check EulerFD in these test cases : calculating answer of algorithm in custom seed
@chernishev chernishev force-pushed the euler-fd-development branch from f279426 to 21a2efb Compare January 10, 2025 09:46
@mitya-y mitya-y force-pushed the euler-fd-development branch from 8a37db6 to 1c91ed4 Compare January 10, 2025 15:28
@mitya-y mitya-y force-pushed the euler-fd-development branch 2 times, most recently from eb3c766 to b3812ab Compare January 10, 2025 15:37
@mitya-y mitya-y force-pushed the euler-fd-development branch from b3812ab to 2a4cca3 Compare January 10, 2025 15:45
examples/datasets/adult.csv Outdated Show resolved Hide resolved
src/python_bindings/py_util/py_to_any.cpp Outdated Show resolved Hide resolved
src/python_bindings/py_util/py_to_any.cpp Outdated Show resolved Hide resolved
src/core/algorithms/fd/eulerfd/cluster.cpp Outdated Show resolved Hide resolved
src/core/config/descriptions.h Outdated Show resolved Hide resolved
@mitya-y mitya-y force-pushed the euler-fd-development branch from 2a4cca3 to 3edd29f Compare January 11, 2025 00:44
Copy link
Collaborator

@BUYT-1 BUYT-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks okay

@chernishev chernishev merged commit 1f4bcbd into Desbordante:main Jan 11, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants