James Tauber : Finding Dependencies in Tabular Data, Part 2

Yesterday I wrote about code in Python 2.4 to find out if the range of possible values in one column of tabular data is affected by the value of another column.

I posed the question there: What if you want to check the dependency, not between just two columns but two groups of columns?

Here is the original function for reference:

def find_dependencies(col_i, col_j):
    for i_value in possible_values[col_i]:
        j_values = set()
        for row in rows:
            if row[col_i] == i_value:
                j_values.add(row[col_j])
        if j_values < possible_values[col_j]:
            yield i_value, j_values

and here is a modified version that takes two sequences of column indices (rather than two column indices):

def find_dependencies_2(cols_i, cols_j):
    for i_value in cartesian_product(non_contig_slice(possible_values, cols_i)):
        j_values = set()
        for row in rows:
            if non_contig_slice(row, cols_i) == i_value:
                j_values.add(non_contig_slice(row, cols_j))
        if j_values < set(cartesian_product(non_contig_slice(possible_values, cols_j))):
            yield i_value, j_values

So find_dependencies_2((0,1,2), (3,4)) returns which tuples made up of the 0th, 1st and 2nd columns of a row reduce the possible values that can be taken by the tuple made up of the 3rd and 4th column of the row.

What was interesting in writing it is that I merely needed to change

row[col_n]

non_contig_slice(row, cols_n)

and

possible_values[col_n]

cartesian_product(non_contig_slice(possible_values, cols_n))

Where cartesian_product is defined as:

def cartesian_product(sets, done=()):
    if sets:
        for element in sets[0]:
            for tup in cartesian_product(sets[1:], done + (element,)):
                yield tup
    else:
        yield done

and non_contig_slice is defined as:

def non_contig_slice(seq, indices):
    result = ()
    for i in indices:
        result += (seq[i],)
    return result

Successive applications of find_dependencies_2 with different combinations of column indices can be used to determine what dependencies exist between columns in tabular data.