James Tauber : Finding Dependencies in Tabular Data

I have a file with tabular data and I want to find out if the range of possible values in one column is affected by the value of another column. Here is how I did it in Python 2.4.

This is related to relational database normalization and the code below could be modified to specifically find functional dependencies for normalization. My goal was a little more general, however. I wanted to know any time the range of possible values for one column given another column's value is a proper subset of the possible values for the column when no other column values are fixed.

I have two data structures. One is a tuple of sets of possible values for each column. The second is a set of the rows (each a tuple).

possible_values = (set(), set(), set(), set(), set())
rows = set()

(Note, I've hard-coded the length of the possible_values tuple to 5 to match my current data.)

Next I load in the data — it's in a whitespace-delimited file, one row per line:

for line in file("data.txt"):
    row = tuple(line.strip().split())
    for col_i in range(len(row)):
        possible_values[col_i].add(row[col_i])
    rows.add(row)

Now here's my function for finding whether one column's value restricts the possible values of another.

def find_dependencies(col_i, col_j):
    for i_value in possible_values[col_i]:
        j_values = set()
        for row in rows:
            if row[col_i] == i_value:
                j_values.add(row[col_j])
        if j_values < possible_values[col_j]:
            yield i_value, j_values

It goes through each possible value in the i column and finds out if fixing that reduces the possible values in the j column.

Notice that it makes use of < as a set operator for proper subset. It's a generator too. It will yield any value in the i column that restricts the values in the j column, along with what the restriction is.

What if you want to check the dependency, not between just two columns but two groups of columns?

The solution turned out to involve a couple of cool modifications which I'll save for a followup post.