Implement EulerFD approximate discovering FD algorithm #414

mitya-y · 2024-05-11T15:13:00Z

Implement approximate FD (Functional Dependencies) discovery algorithm based on the article "EulerFD: An Efficient Double-Cycle Approximation of Functional Dependencies" by Qiongqiong Lin, Yunfan Gu, Jingyan Sai.
Add unit tests for the approximate FD algorithms, including a custom random option to ensure consistent test results across different systems. Utilize a custom random function to calculate answers of the current EulerFD version for each test dataset with a selected seed.
Integrate the EulerFD algorithm into Python bindings and the Python console.

For more information on EulerFD, refer to the presentation: EulerFD Overview.
Detailed development information and test results can be found here: Development Details.

src/core/algorithms/fd/eulerfd/cluster.cpp

src/core/algorithms/fd/eulerfd/mlfq.h

src/core/algorithms/fd/eulerfd/search_tree.cpp

+}
+
+void SearchTreeEulerFD::UpdateInterAndUnion(std::shared_ptr<Node> const& node) {
+    auto node_copy = node;


src/core/algorithms/fd/eulerfd/search_tree.cpp

+}
+
+std::shared_ptr<SearchTreeEulerFD::Node> SearchTreeEulerFD::FindNode(Bitset const& set) {
+    auto current_node = root_;


src/core/algorithms/fd/eulerfd/eulerfd.cpp

+        std::sort(neg.begin(), neg.end(), [](Bitset const &left, Bitset const &right) {
+            return left.count() > right.count();
+        });
+        fd_num += Invert(real_rhs, neg);


src/core/config/custom_random/option.cpp

src/core/config/descriptions.h

src/python_bindings/py_util/py_to_any.cpp

src/tests/test_fd_util.h

src/core/algorithms/fd/eulerfd/cluster.cpp

src/core/algorithms/fd/eulerfd/mlfq.h

src/core/algorithms/fd/eulerfd/mlfq.cpp

src/core/algorithms/fd/eulerfd/search_tree.h

src/core/algorithms/fd/eulerfd/search_tree.cpp

src/core/algorithms/fd/eulerfd/eulerfd.cpp

slesarev-hub · 2024-07-09T18:52:31Z

src/core/algorithms/fd/eulerfd/eulerfd.cpp

+
+    // in each column mapping string values into integer values.
+    // using only hash isnt good idea because colisions dont processing
+    std::vector<std::unordered_map<std::string, size_t>> columns(number_of_attributes_);


Try to use string_view to alleviate unnecessary copy

There is no unnecessary copy, because all maps "indexes" (strings) was moved (values[std::move(line[i])] = id, line 54) from std::vector<std::string>, on which allocation I can't influence, because it is a result of input_table_->GetNextRow() method.

src/core/algorithms/fd/eulerfd/eulerfd.cpp

slesarev-hub · 2024-08-05T20:10:23Z

src/core/algorithms/fd/eulerfd/mlfq.h

+
+class MLFQ {
+private:
+    using Queue = std::pair<std::queue<Cluster *>, double>;


Cannot find where double is used

It must be barrier values, but now it isn't used, because I use log instead.
Should I remove it?

slesarev-hub · 2024-08-05T20:14:16Z

src/core/algorithms/fd/eulerfd/cluster.cpp

+
+namespace algos {
+
+void Cluster::ShuffleData(RandomStrategy rand) {


Could we use std::shuffle instead of custom shuffling?

No, because implementation of std::shuffle depends on STL, but we want have same permutation of array on any platforms and compilers (it is necessary for consistent hash values in test).

No, because implementation of std::shuffle depends on STL, but we want have same permutation of array on any platforms and compilers (it is necessary for consistent hash values in test).

This is an important piece of information, so I believe it is a good idea to leave it as a comment in the code somewhere near this method.

slesarev-hub · 2024-08-05T20:16:36Z

src/core/algorithms/fd/eulerfd/cluster.cpp

+}
+
+double Cluster::GetAverage() const {
+    double sum = std::accumulate(hist_effects_.begin(), hist_effects_.end(), 0.0);


where const?

const double sum = ....

just for readability, not necessary

slesarev-hub · 2024-08-05T20:40:02Z

src/core/algorithms/fd/eulerfd/mlfq.cpp

+
+Cluster *MLFQ::Get() {
+    if (actual_queue_ >= 0) {
+        Cluster *save = queues_[actual_queue_].first.front();


Why such name save?

Because I save a pointer on cluster in this variable, and return it in end of function.

src/core/algorithms/fd/eulerfd/eulerfd.cpp

+double EulerFD::SamlingInCluster(Cluster *cluster) {
+    return cluster->Sample([this](size_t t1, size_t t2) -> size_t {
+        Bitset agree_set = BuildAgreeSet(t1, t2);
+        auto &&[_, result] = invalids_.insert(agree_set);


src/core/algorithms/fd/eulerfd/eulerfd.h

slesarev-hub · 2024-08-05T20:59:48Z

src/core/algorithms/fd/eulerfd/eulerfd.cpp

+            break;
+        }
+
+        tuples_.emplace_back(std::vector<size_t>(number_of_attributes_));


Suggested change

tuples_.emplace_back(std::vector<size_t>(number_of_attributes_));

tuples_.emplace_back(number_of_attributes_);

I think first is variant is better, because it is more explicit, but hasn't overhead, because will be called move constructor.

src/core/algorithms/fd/eulerfd/eulerfd.cpp

slesarev-hub · 2024-10-12T18:33:53Z

src/core/algorithms/fd/eulerfd/cluster.cpp

+}
+
+double Cluster::GetAverage() const {
+    double sum = std::accumulate(hist_effects_.begin(), hist_effects_.end(), 0.0);


const double sum = ....

just for readability, not necessary

src/core/config/custom_random/type.h

src/core/algorithms/fd/eulerfd/eulerfd.cpp

src/tests/test_fd_approximate.cpp

github-actions

clang-tidy made some suggestions

src/core/algorithms/fd/eulerfd/eulerfd.cpp

…for set random strategy and seed (it will be necessary for writing tests)

…om random from utilities. Check EulerFD in these test cases : calculating answer of algorithm in custom seed

examples/basic/mining_fd_approximate.py

examples/advanced/comparison_mining_fd_approximate.py

examples/basic/mining_fd_approximate.py

examples/datasets/adult.csv

src/python_bindings/py_util/py_to_any.cpp

src/core/algorithms/fd/eulerfd/cluster.cpp

src/core/config/descriptions.h

BUYT-1

looks okay

mitya-y force-pushed the euler-fd-development branch 3 times, most recently from 4c95ec1 to 2af6292 Compare May 13, 2024 17:23

Firsov62121 reviewed May 31, 2024

View reviewed changes

slesarev-hub reviewed Jul 6, 2024

View reviewed changes

slesarev-hub reviewed Jul 9, 2024

View reviewed changes

slesarev-hub reviewed Aug 5, 2024

View reviewed changes

mitya-y force-pushed the euler-fd-development branch from 2af6292 to 012944f Compare September 23, 2024 09:28

slesarev-hub approved these changes Oct 16, 2024

View reviewed changes

BUYT-1 requested changes Oct 25, 2024

View reviewed changes

src/core/config/custom_random/type.h Outdated Show resolved Hide resolved

src/core/algorithms/fd/eulerfd/eulerfd.cpp Outdated Show resolved Hide resolved

BUYT-1 requested changes Oct 26, 2024

View reviewed changes

src/tests/test_fd_approximate.cpp Outdated Show resolved Hide resolved

mitya-y force-pushed the euler-fd-development branch from 012944f to 056a15e Compare January 6, 2025 21:27

github-actions bot reviewed Jan 6, 2025

View reviewed changes

src/core/algorithms/fd/eulerfd/eulerfd.cpp Outdated Show resolved Hide resolved

mitya-y force-pushed the euler-fd-development branch 2 times, most recently from afd46e6 to b651c44 Compare January 8, 2025 20:17

mitya-y added 2 commits January 10, 2025 12:46

Implement the algorithm according to the EulerFD article, add option …

9e7b9b0

…for set random strategy and seed (it will be necessary for writing tests)

Create test cases for approximate fd discovering algortihms. Use cust…

4aa0665

…om random from utilities. Check EulerFD in these test cases : calculating answer of algorithm in custom seed

chernishev force-pushed the euler-fd-development branch from f279426 to 21a2efb Compare January 10, 2025 09:46

chernishev reviewed Jan 10, 2025

View reviewed changes

examples/basic/mining_fd_approximate.py Outdated Show resolved Hide resolved

chernishev reviewed Jan 10, 2025

View reviewed changes

examples/basic/mining_fd_approximate.py Outdated Show resolved Hide resolved

chernishev reviewed Jan 10, 2025

View reviewed changes

examples/basic/mining_fd_approximate.py Outdated Show resolved Hide resolved

chernishev reviewed Jan 10, 2025

View reviewed changes

examples/basic/mining_fd_approximate.py Outdated Show resolved Hide resolved

chernishev reviewed Jan 10, 2025

View reviewed changes

examples/basic/mining_fd_approximate.py Outdated Show resolved Hide resolved

chernishev reviewed Jan 10, 2025

View reviewed changes

examples/advanced/comparison_mining_fd_approximate.py Outdated Show resolved Hide resolved

mitya-y force-pushed the euler-fd-development branch from 8a37db6 to 1c91ed4 Compare January 10, 2025 15:28

mitya-y added 2 commits January 10, 2025 18:33

Add EulerFD algorithm and custom_random option in python bindings

a1893f6

Fix some code style issues and add draft examples of using EulerFD

74aa527

mitya-y force-pushed the euler-fd-development branch 2 times, most recently from eb3c766 to b3812ab Compare January 10, 2025 15:37

chernishev reviewed Jan 10, 2025

View reviewed changes

examples/basic/mining_fd_approximate.py Outdated Show resolved Hide resolved

mitya-y force-pushed the euler-fd-development branch from b3812ab to 2a4cca3 Compare January 10, 2025 15:45

BUYT-1 requested changes Jan 10, 2025

View reviewed changes

mitya-y added 2 commits January 11, 2025 03:43

Refactor custom random seed option

603dd8c

Refactor examples of EulerFD using

3edd29f

mitya-y force-pushed the euler-fd-development branch from 2a4cca3 to 3edd29f Compare January 11, 2025 00:44

BUYT-1 approved these changes Jan 11, 2025

View reviewed changes

chernishev merged commit 1f4bcbd into Desbordante:main Jan 11, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement EulerFD approximate discovering FD algorithm #414

Implement EulerFD approximate discovering FD algorithm #414

mitya-y commented May 11, 2024

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

slesarev-hub Jul 9, 2024

mitya-y Sep 21, 2024 •

edited

Loading

slesarev-hub Aug 5, 2024

mitya-y Sep 21, 2024

slesarev-hub Aug 5, 2024

mitya-y Sep 21, 2024

chernishev Sep 22, 2024

slesarev-hub Aug 5, 2024

mitya-y Sep 21, 2024 •

edited

Loading

slesarev-hub Oct 12, 2024

slesarev-hub Aug 5, 2024

mitya-y Sep 21, 2024

This comment was marked as resolved.

slesarev-hub Aug 5, 2024

mitya-y Sep 22, 2024

slesarev-hub Oct 12, 2024

github-actions bot left a comment

BUYT-1 left a comment


		namespace algos {

		void Cluster::ShuffleData(RandomStrategy rand) {

	tuples_.emplace_back(std::vector<size_t>(number_of_attributes_));
	tuples_.emplace_back(number_of_attributes_);

Implement EulerFD approximate discovering FD algorithm #414

Implement EulerFD approximate discovering FD algorithm #414

Conversation

mitya-y commented May 11, 2024

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Choose a reason for hiding this comment

mitya-y Sep 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mitya-y Sep 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

BUYT-1 left a comment

Choose a reason for hiding this comment

mitya-y Sep 21, 2024 •

edited

Loading

mitya-y Sep 21, 2024 •

edited

Loading