Calibrating patterns in structured data: No easy answers here.

“No easy answers” . . . Hey, that’s a title that’s pure anti-clickbait, a veritable kryptonite for social media . . .

Anyway, here’s the story. Adam Przedniczek writes:

I am trying to devise new or tune up already existing statistical tests assessing rate of occurrences of some bigger compound structures, but the most tricky part is to take into account their substructures and building blocks.

To make it as simple as possible, let’s say we are particularly interested in a test for enrichment or over-representation of given structures, e.g. quadruples, over two groups. Everything is clearly depicted in this document.

And here the doubts arise: I have strong premonition that I should take into consideration their inner structure and constituent pairs. In the attachment I show such an adjustment for enrichment of pairs, but I don’t know how to extend this approach properly over higher (more compound) structures.

Hey—this looks like a fun probability problem! (Readers: click on the above link if you haven’t done so already.) The general problem reminds me of things I’ve seen in social networks, where people summarize a network by statistics such as the diameter, the number of open and closed triplets, the number of loops and disconnected components, etc.

My quick answer is that there are two completely different ways to approach the problem. It’s not clear which is best; I guess it could make sense to do both.

The first approach is with a generative model. The advantage of the generative model is that you can answer any question you’d like. The disadvantage is that with structured dependence, it can be really hard to come up with a generative model that captures much of the data features that you care about. With network data, they’re still playing around with variants of that horribly oversimplified Erdos-Renyi model of complete independence. Generative modeling can be a great way to learn, but any particular generative model can be a trap if there are important aspects of the data it does not capture.

The second approach is more phenomenological, where you compare different groups using raw data and then do some sort of permutation testing or bootstrapping to get a sense of the variation in your summary statistics. This approach has problems, too, though, in that you need to decide how to do the permutations or sampling. Complete randomness can give misleading answers, and there’s a whole literature, with no good answers, on how to bootstrap or perform permutation tests on time series, spatial, and network data. Indeed, when you get right down to it, a permutation test or a bootstrapping rule corresponds to a sampling model, and that gets you close to the difficulties of generative models that we’ve already discussed.

So . . . no easy answers! But, whatever procedure you do, I recommend you check it using fake-data simulation.