-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement EulerFD approximate discovering FD algorithm #414
Conversation
4c95ec1
to
2af6292
Compare
} | ||
|
||
void SearchTreeEulerFD::UpdateInterAndUnion(std::shared_ptr<Node> const& node) { | ||
auto node_copy = node; |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
} | ||
|
||
std::shared_ptr<SearchTreeEulerFD::Node> SearchTreeEulerFD::FindNode(Bitset const& set) { | ||
auto current_node = root_; |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
std::sort(neg.begin(), neg.end(), [](Bitset const &left, Bitset const &right) { | ||
return left.count() > right.count(); | ||
}); | ||
fd_num += Invert(real_rhs, neg); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
||
// in each column mapping string values into integer values. | ||
// using only hash isnt good idea because colisions dont processing | ||
std::vector<std::unordered_map<std::string, size_t>> columns(number_of_attributes_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to use string_view
to alleviate unnecessary copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no unnecessary copy, because all maps "indexes" (strings) was moved (values[std::move(line[i])] = id
, line 54) from std::vector<std::string>
, on which allocation I can't influence, because it is a result of input_table_->GetNextRow()
method.
|
||
class MLFQ { | ||
private: | ||
using Queue = std::pair<std::queue<Cluster *>, double>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot find where double
is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It must be barrier values, but now it isn't used, because I use log instead.
Should I remove it?
|
||
namespace algos { | ||
|
||
void Cluster::ShuffleData(RandomStrategy rand) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use std::shuffle
instead of custom shuffling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because implementation of std::shuffle depends on STL, but we want have same permutation of array on any platforms and compilers (it is necessary for consistent hash values in test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because implementation of std::shuffle depends on STL, but we want have same permutation of array on any platforms and compilers (it is necessary for consistent hash values in test).
This is an important piece of information, so I believe it is a good idea to leave it as a comment in the code somewhere near this method.
} | ||
|
||
double Cluster::GetAverage() const { | ||
double sum = std::accumulate(hist_effects_.begin(), hist_effects_.end(), 0.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where const
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const double sum = ....
just for readability, not necessary
|
||
Cluster *MLFQ::Get() { | ||
if (actual_queue_ >= 0) { | ||
Cluster *save = queues_[actual_queue_].first.front(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why such name save
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I save a pointer on cluster in this variable, and return it in end of function.
double EulerFD::SamlingInCluster(Cluster *cluster) { | ||
return cluster->Sample([this](size_t t1, size_t t2) -> size_t { | ||
Bitset agree_set = BuildAgreeSet(t1, t2); | ||
auto &&[_, result] = invalids_.insert(agree_set); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
break; | ||
} | ||
|
||
tuples_.emplace_back(std::vector<size_t>(number_of_attributes_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuples_.emplace_back(std::vector<size_t>(number_of_attributes_)); | |
tuples_.emplace_back(number_of_attributes_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think first is variant is better, because it is more explicit, but hasn't overhead, because will be called move constructor.
2af6292
to
012944f
Compare
} | ||
|
||
double Cluster::GetAverage() const { | ||
double sum = std::accumulate(hist_effects_.begin(), hist_effects_.end(), 0.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const double sum = ....
just for readability, not necessary
012944f
to
056a15e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
afd46e6
to
b651c44
Compare
…for set random strategy and seed (it will be necessary for writing tests)
…om random from utilities. Check EulerFD in these test cases : calculating answer of algorithm in custom seed
f279426
to
21a2efb
Compare
8a37db6
to
1c91ed4
Compare
eb3c766
to
b3812ab
Compare
b3812ab
to
2a4cca3
Compare
2a4cca3
to
3edd29f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks okay
Implement approximate FD (Functional Dependencies) discovery algorithm based on the article "EulerFD: An Efficient Double-Cycle Approximation of Functional Dependencies" by Qiongqiong Lin, Yunfan Gu, Jingyan Sai.
Add unit tests for the approximate FD algorithms, including a custom random option to ensure consistent test results across different systems. Utilize a custom random function to calculate answers of the current EulerFD version for each test dataset with a selected seed.
Integrate the EulerFD algorithm into Python bindings and the Python console.
For more information on EulerFD, refer to the presentation: EulerFD Overview.
Detailed development information and test results can be found here: Development Details.