Optimal Package Installation
Splitting software into packages and defining dependencies between them is a good thing. It simplifies security updates and enables code reusability. However, it comes with a big pack of problems, collected under the term dependency hell. All package managers have to deal with it and some of them get it right (e.g., Meteor, libsolv used by Zypper and DNF) and many of them do not (e.g., pip). In this post I want to explore an optimal way to install a desired package and its requirements.
Approach without Versions
We start with a simple example. It is Saturday evening, the weather is awful and so you decide to install a new computer game. Because games are complex software products, they usually depend on multiple libraries. These libraries may also depend on libraries and so on. The resulting relations can be visualized by the following graph:
Each node represents a package (i.e., a game, program, or library) and an arrow from A to B means “A depends on B to be runnable”. I left out the concept of versions here, but even without you can see the following not that nice attributes of the graph:
- one package can require multiple packages
- one package can be required by multiple packages
- cyclic dependencies can occur.
- even for a simple game the graph can get quite “large” (you don not want to see the graph for bigger and more complex projects)
There is a good and a bad thing about this: The good thing is that the set of required packages can easily be determined by a simple [breadth-first traversal](Breadth-first search) (do not forget to keep a record of the visited nodes, because this graph is not a tree). The bad thing is: It gets more complicated when introducing versions.
A Graph-like Structure
Sadly, not all packages out there use Semantic Versioning. If you do not, do the world a favour and start using it! Now! Anyway, this misbehaviour leads to the fact that we need a abstract version concept. So a version \(v \in V\) can basically be everything. Even using colours as version tags is allowed. For the following example I just stay with \(V = (\mathbb{N}_0, \mathbb{N}_0) = \{0, 1, 2, 3, \ldots\}^2\). We also assume that at maximum one version of a package can get installed. This seems legit because most package managers out there work like this. That is why some package names include the mayor version, e.g., python2
on Debian. For our example we now get more than one node for GUI lib
:
Now we continue with the requirements. Each package version can have its own set of requirements \(R\) (aka dependencies). Because these requirements are not standardized, we also have to define an abstract function to check if “A is a requirement of B”:
$$ \text{check} \colon R \times V \rightarrow \{0,1\} $$
For our example this would be:
$$ \begin{array}{lcl} V &=& (\mathbb{N}_0, \mathbb{N}_0) \\ R &=& ((\mathbb{N}_0, \mathbb{N}_0), (\mathbb{N}_0, \mathbb{N}_0)) \\ \text{check}(((r_{a, \text{begin}},r_{b, \text{begin}}), & & \\ (r_{a, \text{end}},r_{b, \text{end}})), & & \\ (v_a, v_b)) &=& [r_{a,\text{begin}} < v_a < r_{a,\text{end}}] \\ & & \vee [r_{a, \text{begin}} = v_a] \wedge [r_{b, \text{begin}} \leq v_b] \\ & & \vee [r_{a, \text{end}} = v_a] \wedge [r_{b, \text{begin}} < v_b] \\ \end{array} $$
This might lead to a situation where multiple versions of a dependency satisfy a requirement. In the real world this is often true for minor versions. We express this by n OR-connection:
As seen in the simple example without versions, a package can have multiple dependencies and they all have to be satisfied. This works like a logical AND:
As a last part we need an entry point that says “I want to install game 1.0
”:
Now we can put everything together:
As you might recognize, I made some small modification by uniting outgoing connections of similar versions. This deduplication can be done if two versions have the same requirements set. It not only helps to simplify the graph, but also keep the transformation of it shorter. In the real world you can have many patch/bugfix releases of a package that do not differ in their requirements.
I also want to mention the fact that it can happen that AND- or OR-nodes only have one outgoing connection. In this case they can be ignored, similar to \(\{\sum,\prod,\bigcup,\bigcap,\ldots\}\) with one element.
Pseudo Boolean Optimization
We now step back for a minute and focus on a more theoretical theme: Pseudo-boolean Optimization. You may know SAT and SMT. Pseudo-boolean problems are somehow between. Imagine a finite set of boolean variables \(V\). They are pseudo-boolean because their boolean state is encoded as a natural number, so that \(\text{true} \equiv 1\)and \(\text{false} \equiv 0\). For a concrete problem they have to satisfy constraints of the following form:
$$ a_1 \cdot v_1 + a_2 \cdot v_2 + \ldots + a_n \cdot v_n \,\{=, \geq\}\, x $$ $$ \begin{array}{ll} a_1, a_2, \ldots, a_n, x &\in \mathbb{Q} \\ v_1, v_2, \ldots, v_n &\in V \end{array} $$
The constraints are expressed in (in-)equations, in contrast to SMT problems. Because there can be many (sometimes exponentially many) solutions for this problem, you can define an optimization objective:
$$ \min b_1 \cdot v_1 + b_2 \cdot v_2 + \ldots + b_m \cdot v_m $$ $$ \begin{array}{ll} b_1, b_2, \ldots, b_m, x &\in \mathbb{Q} \\ v_1, v_2, \ldots, v_m &\in V \end{array} $$
There is a standard format for encoding these problems. I have experimented with two solvers: MiniSat+ and SAT4J Pseudo. While the first one took ages to generate a solution, I was very satisfied with the latter one. You might want to take a look at the Pseudo-Boolean Competition 2012 and the upcoming Pseudo-Boolean Evaluation 2015 for a list of available solvers and their performance.
Graph Transformation
Now that we have a solver for a specific problem, we talk about the transformation of the given graph-like structure into a PBO instance. The following listing presents parts of the graph and the constraints that represent it. Afterwards there will be a short discussion on the optimization objective.
Normal Requirement
$$ (-1) \cdot n_1 + (1) \cdot n_2 \geq 0 $$
AND Node
$$ (-N) \cdot n_{\cap} + (1) \cdot n_1 + (1) \cdot n_2 + \ldots + (1) \cdot n_N \geq 0 $$
OR Node
$$ (-1) \cdot n_{\cup} + (1) \cdot n_1 + (1) \cdot n_2 + \ldots + (1) \cdot n_N \geq 0 $$
Maximum One Package Version
$$ (-1) \cdot n_1 + (-1) \cdot n_2 + \ldots + (-1) \cdot n_N \geq -1 $$
Initial Requirement
$$ (1) \cdot n \geq 1 $$
Optimization Objective
Using the given constraints, there might be too many possible solution for the problem. There number can grow exponentially in the number of included packages. For an automatic package installation process there should be only one solution emitted by the system. To do so we need an optimization function. Abstractly spoken for every package \(p\) and every version \(v\) we need a number how “bad” the installation of this package should be:
$$ \text{cost} \colon V \times P \rightarrow \mathbb{N}_0 $$
For package installations, the following non-exhausting list of properties for the cost function are desirable:
- every package costs ⇒ install a minimal set of packages
- older versions have higher cost than newer versions ⇒ progress
- packages with know security problems have very high cost ⇒ security
Please note that the version numbers from the example (and most version numbers out there) are not embeddable in \(\mathbb{N}\) which makes satisfying the second property harder. There exist at least the following two workarounds:
- assume a maximum major and minor version
- order all versions of a package and use this timeline as a scoring
You might want to tune the second approach to increase the cost difference between major versions.
Using a well designed cost function, we can now write down the optimization function:
$$ \begin{array}{llll} &\phantom{+} \text{cost}(v_1, p_1) \cdot n_{1,1} &+ \text{cost}(v_2, p_1) \cdot n_{2,1} &+ \ldots &+ \text{cost}(v_{N_1}, p_1) \cdot n_{N_1,1} \\ &+ \text{cost}(v_1, p_2) \cdot n_{1,2} &+ \text{cost}(v_2, p_2) \cdot n_{2,2} &+ \ldots &+ \text{cost}(v_{N\_2}, p_2) \cdot n_{{N_2},2} \\ &\phantom{+} \vdots &\phantom{+} \vdots &\phantom{+} &\phantom{+} \vdots \\ &+ \text{cost}(v_1, p_M) \cdot n_{1,M} &+ \text{cost}(v_2, p_M) \cdot n_{2,M} &+ \ldots &+ \text{cost}(v_{N_M}, p_M) \cdot n_{N_M,M} \\ \end{array} $$
Result
Finally, we now know, which packages need to be installed to make the game playable:
- framebuffer 201.0
- game 1.0
- graphics lib 1.1
- GUI lib 2.0
- resourceloader 1.1
By the way: there are no oshacks required and therefore they get eliminated by the optimization function.
Should you wonder about the performance of this approach: the entire problem for Invenio produces ≈60k variables and ≈70k constraints and even my 5 year old notebook is able to find a solution in <7sec.
Final Thoughts
This approach can be extended for other packaging systems, e.g., extra_requirements
as they are found in the Python ecosystem. Take a look at my Extended Python Requirements Calculator (eprc) for a practical implementation of this approach. It can also be extended to enable systems where multiple versions of a package can be installed or to deal with the fact that there might already be packages installed on the system.
Keep in mind that this approach does not automatically resolve conflicts, like for example npm does. The reason behind this is that it does not assume any specific structure of version numbers. Hence, the result of an automatic conflict resolution would be rather luck than a stable and closed set of package requirements.
There are plenty of use cases for solvers and optimizers out there. If you liked this post, visit Daniel’s blog post on Satisfiability Modulo Theories For Parallel Cooking And Other Optimizations. If you like well written and well researched post (who does not?!) you might also be interested in other posts of him.