-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starter Project #1: Deduplicating alarms to improve usability of Acto #215
Comments
Discussed the task with @Spedoske and he will take it. The plan is to first use Acto to test the RabbitMQ operator and inspect the testing results. It will give a good idea for @Spedoske to understand how duplicated alarms are. Then we can implement the dedup feature. @Spedoske please read the papers linked in the task and think about how to do it. It's a very challenging task actually. |
I can run Acto now. I got the alarm report and I can also reproduce some of the trails. |
@Spedoske before you starting to implement anything, let's make sure we do the following two:
|
Description
Acto finds 56 bugs in 11 operators, however, it reports more than two thousand alarms in total. This is because Acto reports duplicated alarms for the same bug.
The large number of alarms imposes a usability issue to Acto: users need to inspect a large number of alarms every time they run Acto, while only finding a few bugs. It also makes Acto's evaluation labor-intensive.
We want to reduce or eliminate the duplicated alarms users have to inspect for each unique bug.
Solution
There are two solutions in my mind, the first one requires users to inspect the alarms first, and then write rules to automate the alarm inspection. The first solution is very actionable and would improve the usability of Acto and dramatically reduce the evaluation overhead of Acto. The second one aims to deduplicate the alarms automatically. It is more ambitious but less concrete than the first solution.
Solution 1: making alarm inspection "one time effort" by writing rules
This solution aims to make alarm inspection for each bug "one time effort".
Our experience in inspecting alarms is that, the alarms caused by the same bug share similar triggering condition and root cause. For example, Acto found a bug in cass-operator, that cass-operator is unable to delete labels from Pods/Services. This bug is triggered every time Acto tries to delete annotations/labels in the CR. This bug can cause duplicated alarms because there are multiple properties in the cass-operator's CR corresponding to the annotations/labels.
To make the inspection a one-time effort, we can provide a way for users to describe the mapping from alarms to the bug. For example, from the cass-operator's label/annotation bug described above, we know that the bug will be triggerrer when Acto deletes any label/annotation property in the CR, and we know which properties in the CR are labels/annotations. Then we can describe the mapping by writing a rule like the following:
This way any alarm corresponding to this bug can be automatically inspected in the future.
The alarm inspection can also turned into an interactive process: users inspect one alarm, and write one rule, and then the rule can automatically inspect the alarms corresponding to this bug so that users don't have to inspect them.
Actions:
Solution 2: deduplicating alarms
This is a much more ambitious solution, which is to deduplicate alarms without any manual effort.
There are existing works for bucketing failed tests:
But I am not sure how these techniques can be applied to Acto's alarms.
Acto's alarms are hard to deduplicate because we do not have much information:
The first step for this solution would be to first deduplicate the bugs with explicit errors, e.g. crash bugs.
The text was updated successfully, but these errors were encountered: