You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Specifically, we repeat the fuzzing experiment (with the same set of extracted constraints) eight times. For each run, we generate 2,000 test inputs for the baseline and 2,000 test inputs for DocTer for all APIs from the three libraries. In each run, we use the same random seed for both the baseline and DocTer. Different runs use different seeds. Since it requires significant manual effort to inspect the detected bugs from all eight runs, which are 2,301 API crashes to examine, we use the number of buggy APIs (i.e., the number of APIs that crash) to indicate the fuzzers’ effectiveness. Overall, among the eight runs, DocTer on average detects 172.0 buggy APIs while the baseline on average detects 115.6 buggy APIs. We perform the Mann-Whitney U-test and confirm the improvement is statistically significant with a p-value of 0.0004 and the Cohen’s d effect size of 8.96 (effect size more than 2.0 is huge).
Here, we list all results for the 8 runs for Baseline (Table A) and DocTer (Table B), as well as the break down of CI (Table C) and VI (Table D) mode.