Minimal Key Set is NP hard

It usually gives us a chuckle when we find some natural and seemingly easy data science question is NP-hard. For instance we have written that variable pruning is NP-hard when one insists on finding a minimal sized set of variables.

In this note we show that finding a minimal set of columns that form a primary key in a database is also NP-hard.

Problem: Minimum Cardinality Primary Key

Instance: Vectors x1 through xm elements of {0,1}n and positive integer k.

Question: Is there a “primary key” of size no more then k? That is: is there a subset P of {1, …, n} with |P| ≤ k such that for any integers a, b with 1 ≤ a < b ≤ n we can find an j in P such that xa(j) doesn’t equal xb(j) (i.e. xa and xb differ at some index named in P, and hence can be distinguished or “told apart”).

Now the standard reference on NP-hardness (Garey and Johnson, Computers and Intractability, Freeman, 1979) does have some NP-hard database examples (such as SR26 Minimum Cardinality Key). However the stated formulations are a bit hard to decode, so we will relate the above problem directly to a more accessible problem: SP8 Hitting Set.

Problem: SP8 Hitting Set

Instance: Collection C of subsets of a finite set S, positive integer K ≤ |S|.

Question: Is there a subset S contained in S with |S| ≤ K such that S contains at least one element from each subset in C?

The idea is: SP8 is thought to be difficult to solve, so if we show how Minimum Cardinality Primary Key could be used to easily solve SP8 this is then evidence Minimum Cardinality Primary Key is also difficult to solve.

So suppose we have an arbitrary instance of SP8 in front of us. Without loss of generality assume S = {1, …, n}, C = {C1, …, Cm}, and all of the Ci are non-empty and distinct.

We build an instance of the Minimum Cardinality Primary Key problem by defining a table with columns named s1 through sn plus d1 through dm.

Now we define the rows of our table:

  • Let r0 be the row of all zeros.
  • For i from 1 to m let zi be the row with zi(di) = 1 and all other columns equal to zero.
  • For i from 1 to m let xi be the row with xi(di) = 1, xi(sj) = 1 for all j in Ci, and all other columns equal to zero.

Now let’s look at what sets of columns form primary keys for the collection of rows r0, zi, xi.

We must have all of di in P, as each di is the unique index of the only difference between zi and r0. Also, for any i we must have a j such that zi(dj)=1 and j in Ci, as if there were none we could not tell zi from xi (as they differ only in indices named by Ci).

This lets us confirm a good primary key set P is such that S = {j | sj in P} is itself a good hitting set for the SP8 problem. And for any hitting set S we have P = {sj | j in S} union {di, … dm} is a good solution for the Minimum Cardinality Primary Key problem (the di allow us to distinguish r0 from zi, the zi from themselves, r0 from xi, and the xi from them selves; the set hitting property lets us distinguish zi from the corresponding xi, completing the unique keying of rows by the chosen column set). And the solution sizes are always such that |P| = |S’| + m.

So: if we had a method to solve arbitrary instances of the Minimum Cardinality Primary Key problem, we could then use it to solve arbitrary instances of the SP8 Hitting Set Problem. We would just re-encode the SP8 problem as described above, solve the Minimum Cardinality Primary Key problem, and use the strong correspondence between solutions to these two problems to map the solution back to the SP8 problem. Thus the Minimum Cardinality Primary Key problem is itself NP-hard.

What made the problem hard was, as is quite common: the solution size constraint. Without that constraint the problem is trivial. The set of all columns either forms a primary key or does not, and it is simple calculation to check that. As with the variable pruning problem we can even try step-wise deleting columns to explore subsets of columns that are also primary table keys, moving us to a non-redundant key set (but possibly not of minimal size).