I posed the question there: What if you want to check the dependency, not between just two columns but two groups of columns?
Here is the original function for reference:
def find_dependencies(col_i, col_j): for i_value in possible_values[col_i]: j_values = set() for row in rows: if row[col_i] == i_value: j_values.add(row[col_j]) if j_values < possible_values[col_j]: yield i_value, j_values
and here is a modified version that takes two sequences of column indices (rather than two column indices):
def find_dependencies_2(cols_i, cols_j): for i_value in cartesian_product(non_contig_slice(possible_values, cols_i)): j_values = set() for row in rows: if non_contig_slice(row, cols_i) == i_value: j_values.add(non_contig_slice(row, cols_j)) if j_values < set(cartesian_product(non_contig_slice(possible_values, cols_j))): yield i_value, j_values
So find_dependencies_2((0,1,2), (3,4)) returns which tuples made up of the 0th, 1st and 2nd columns of a row reduce the possible values that can be taken by the tuple made up of the 3rd and 4th column of the row.
What was interesting in writing it is that I merely needed to change
Where cartesian_product is defined as:
def cartesian_product(sets, done=()): if sets: for element in sets: for tup in cartesian_product(sets[1:], done + (element,)): yield tup else: yield done
and non_contig_slice is defined as:
def non_contig_slice(seq, indices): result = () for i in indices: result += (seq[i],) return result
Successive applications of find_dependencies_2 with different combinations of column indices can be used to determine what dependencies exist between columns in tabular data.
More on that soon.
The original post was in the category: python but I'm still in the process of migrating categories over.