Find variables common to multiple data sets

February 6, 2013

(This article was originally published at The DO Loop, and syndicated at StatsBlogs.)

Last week the SAS Training Post blog posted a short article on an easy way to find variables in common to two data sets. The article used PROC CONTENTS (with the SHORT option) to print out the names of variables in SAS data sets so that you can visually determine whether the data sets have any variables in common. The article also mentioned using the COMPARE procedure or writing a PROC SQL query that interrogates DICTIONARY tables.

But what if you want to find variable names that are common to many data sets?

The PROC SQL approach is a programming solution, so it might be up to the challenge. A quick internet search reveals one way to use PROC SQL to find common variables in two data sets (see p. 4 of the linked paper). I am not a PROC SQL expert, but the approach in that paper seems difficult to generalize to the case of multiple data sets.

Because I like the SAS/IML language, this article shows how to find all variables that are common to multiple data sets. The following statements define six SAS data sets:

data D1 D2;
A=1; b=2; C=3; D=4; E=5; F=6; g=7; h=8; I=9; J=10;
data D3 D4;
j=1; f=2; h=3; a=4; N=7; L=6; c=7;
data D5;
J=1; D=2; A=3; g=4; h=5; P=6; q=7;
data D6;
C=1; M=2; F=3; a=4; j=5; H=6; B=7; R=8; K=9;

I would have a hard time visually determining which variables are common to all of the data sets, so I'm going to write a program. I will use two SAS/IML functions to help:

  • The CONTENTS function returns a sorted list of the variables in a SAS data set. Use the UPCASE function in Base SAS to get the names in uppercase format so that you can perform case-insensitive comparisons.
  • The XSECT function returns the intersection between two or more arrays of values.

With those two functions, you can obtain the variables names that are common to the data sets D1–D6, as follows:

proc iml;
DSNames = "D1":"D6";
InCommon = upcase(contents(DSNames[1]));     /* get all vars in D1      */
do i = 2 to ncol(DSNames);                   /* loop over data sets     */
   varNames = upcase(contents(DSNames[i]));  /* get variable names      */
   InCommon = xsect(InCommon, varNames);     /* intersect with previous */
print InCommon;

The variables that are common to all the SAS data sets are A, H, and J. If you want to generalize the problem even more, you can use the SAS/IML DATASETS function to get the names of all data sets in a library. For example, you could use DSNames = T(datasets("work")) instead of hard-coding the data set names in this example.

I invite you to submit your own solution in the comments.

tags: Data Analysis, Statistical Programming, Tips and Techniques

Please comment on the article here: The DO Loop

Tags: , , ,