Skip to content

Benchmarking CVD

This page is not normative

This page is not considered a core part of the Vultron Protocol as proposed in the main documentation. Although within the page we might provide guidance in terms of SHOULD, MUST, etc., the content here is not normative.

Our Observational analysis supports the conclusion that vulnerability disclosure as currently practiced demonstrates skill. In both data sets examined, our estimated αd is positive for most dD. However, there is uncertainty in our estimates due to the application of the principle of indifference to unobserved data. This principle assumes a uniform distribution across event transitions in the absence of CVD, which is an assumption we cannot readily test. The spread of the estimates in Observing Skill represents the variance in our samples, not this assumption-based uncertainty. Our interpretation of αd values near zero is therefore that they reflect an absence of evidence rather than evidence that skill is absent. While we cannot rule definitively on luck or low skill, values of αd>0.9 should reliably indicate skillful defenders.

If, as seems plausible from the evidence, it turns out that further observations of h are significantly skewed toward the higher end of the poset (H,D), then it may be useful to empirically calibrate our metrics rather than using the a priori frequencies in Reasoning over Histories as our baseline. This analysis baseline would provide context on "more skillful than the average for some set of teams" rather than more skillful than blind luck.

  • CVD Benchmarks discusses this topic, which should be viewed as an examination of what "reasonable" should mean in the context of a "reasonable baseline expectation."
  • MPCVD suggests how the model might be applied to establish benchmarks for CVD processes involving any number of participants.

CVD Benchmarks

As described above, in an ideal CVD situation, each observed history would achieve all 12 desiderata D. Realistically, this is unlikely to happen. We can at least state that we would prefer that most cases reach fix ready before attacks (FA).

Per the Event Frequency table in Reasoning Over Possible Histories, (reproduced below for convenience), even in a world without skill we would expect FA to hold in 37.5% of cases.

Expected Frequency of rowcol when events are chosen uniformly from possible transitions in each state

V F D P X A
V 0 1 1 0.333 0.667 0.750
F 0 0 1 0.111 0.333 0.375
D 0 0 0 0.037 0.167 0.187
P 0.667 0.889 0.963 0 0.500 0.667
X 0.333 0.667 0.833 0.500 0 0.500
A 0.250 0.625 0.812 0.333 0.500 0

This means that αFA<0 for anything less than a 0.375 success rate.

Benchmarking CVD

In fact, we propose to generalize this for any dD, such that αd should be greater than some benchmark constant cd:

αdcd0

where cd is a based on observations of αd collected across some collection of CVD cases.

We propose as a starting point a naïve benchmark of cd=0. This is a low bar, as it only requires that CVD actually do better than possible events which are independent and identically distributed (i.i.d.) within each case. For example, given a history in which (V,F,P) have already happened (i.e., state qVFdPxa), D, X, or A are equally likely to occur next.

The i.i.d. assumption may not be warranted.

We anticipate that event ordering probabilities might be conditional on history: for example, exploit publication may be more likely when the vulnerability is public (p(X|qQP)>p(X|qQp)) or attacks may be more likely when an exploit is public (p(A|qQX)>p(A|qQx)). If the i.i.d. assumption fails to hold for transition events σΣ, observed frequencies of hH could differ significantly from the rates predicted by the uniform probability assumption behind the Event Frequency table above.

Supporting Observations

Some example suggestive observations are:

  • There is reason to suspect that only a fraction of vulnerabilities ever reach the exploit public event X, and fewer still reach the attack event A. Recent work by the Cyentia Institute found that "5% of all CVEs are both observed within organizations AND known to be exploited", which suggests that fDA0.95.

  • Likewise, DX holds in 28 of 70 (0.4) h. However Cyentia found that "15.6% of all open vulnerabilities observed across organizational assets in our sample have known exploits", which suggests that fDX0.844.

We might therefore expect to find many vulnerabilities remaining indefinitely in VFDPxa.

On their own these observations can equally well support the idea that we are broadly observing skill in vulnerability response, rather than that the world is biased from some other cause. However, we could choose a slightly different goal than differentiating skill and "blind luck" as represented by the i.i.d. assumption. One could aim to measure "more skillful than the average for some set of teams" rather than more skillful than blind luck.

If this were the "reasonable" baseline expectation, the primary limitation is available observations. This model helps overcome this limitation because it provides a clear path toward collecting relevant observations. For example, by collecting dates for the six σΣ for a large sample of vulnerabilities, we can get better estimates of the relative frequency of each history h in the real world. It seems as though better data would serve more to improve benchmarks rather than change expectations of the role of chance.

Interpreting Frequency Observations as Skill Benchmarks

As an applied example, if we take the first item in the above list as a broad observation of fDA=0.95, we can plug into

αd=deffdobsfd1fd

from Discriminating Skill and Luck to get a potential benchmark of αDA=0.94, which is considerably higher than the naïve generic benchmark αd=0. It also implies that we should expect actual observations of histories hH to skew toward the 19 h in which DA nearly 20x as often as the 51 h in which AD. Similarly, if we interpret the second item as a broad observation of fDX=0.844, we can then compute a benchmark αDX=0.81, which is again a significant improvement over the naïve αd=0 benchmark.