(This article was originally published at Statistics – John D. Cook, and syndicated at StatsBlogs.)

Suppose you measure people on two independent attributes, *X* and *Y*, and take those for whom *X*+*Y* is above some threshold. Then even though *X* and *Y* are uncorrelated in the full population, they will be *negatively* correlated in your sample.

This article gives the following example. Suppose beauty and acting ability were uncorrelated. Knowing how attractive someone is would give you no advantage in guessing their acting ability, and vice versa. Suppose further that successful actors have a combination of beauty and acting ability. Then among successful actors, the beautiful would tend to be poor actors, and the unattractive would tend to be good actors.

Here’s a little Python code to illustrate this. We take two independent attributes, distributed like IQs, i.e. normal with mean 100 and standard deviation 15. As the sum of the two attributes increases, the correlation between the two attributes becomes more negative.

from numpy import arange from scipy.stats import norm, pearsonr import matplotlib.pyplot as plt # Correlation. # The function pearsonr returns correlation and a p-value. def corr(x, y): return pearsonr(x, y)[0] x = norm.rvs(100, 15, 10000) y = norm.rvs(100, 15, 10000) z = x + y span = arange(80, 260, 10) c = [ corr( x[z > low], y[z > low] ) for low in span ] plt.plot( span, c ) plt.xlabel( "minimum sum" ) plt.ylabel( "correlation coefficient" ) plt.show()

**Please comment on the article here:** **Statistics – John D. Cook**