Quantifying R Package Dependency Risk

We recently commented on excess package dependencies as representing risk in the R package ecosystem.

The question remains: how much risk? Is low dependency a mere talisman, or is there evidence it is a good practices (or at least correlates with other good pracices)?

Well, it turns out we can quantify it: each additional non-core package declared as an “Imports” or “Depends” is associated with an extra 11% relative chance of a package having an indicated issue on CRAN. At over 5 non-core “Imports” plus “Depends” a package has significantly elevated risk.

The number of dependent packages in use versus modeled issue probability can be summed up in the following graph.

Unnamed chunk 6 2

In the above graph the dashed horizontal line is the overall rate that packages have issues on CRAN. Notice the curve cross the line well before 5 on-trivial dependencies.

In fact packages importing more than 5 non-trivial dependencies have issues on CRAN at an empirical rate of 32%, above the model prediction and almost double the overall rate of 17%. Doubling a risk is considered very significant. And almost half the packages using more than 10 non-trivial dependencies have known issues on CRAN.

A very short analysis deriving the above can be found here.

Obviously we are using lack of problems on CRAN as a rough approximation for package quality, and number of non-trivial package Imports and Depends as rough proxy for package complexity. It would be interesting to quantify and control for other measures.

Our theory is the imports are not so much causing problems, but a “code smell” correlated with other package issues. We feel this is evidence that the principles that guide some package developers to prefer packages with well defined purposes and low dependencies are part of a larger set of principles that lead to higher quality software.

A table of all scored packages and their problem/risk estimate can be found here.