I have a file with tabular data and I want to find out if the range of possible values in one column is affected by the value of another column. Here is how I did it in Python 2.4.
This is related to relational database normalization and the code below could be modified to specifically find functional dependencies for normalization. My goal was a little more general, however. I wanted to know any time the range of possible values for one column given another column's value is a proper subset of the possible values for the column when no other column values are fixed.
I have two data structures. One is a tuple of sets of possible values for each column. The second is a set of the rows (each a tuple).
possible_values = (set(), set(), set(), set(), set()) rows = set()
(Note, I've hard-coded the length of the possible_values tuple to 5 to match my current data.)
Next I load in the data — it's in a whitespace-delimited file, one row per line:
for line in file("data.txt"): row = tuple(line.strip().split()) for col_i in range(len(row)): possible_values[col_i].add(row[col_i]) rows.add(row)
Now here's my function for finding whether one column's value restricts the possible values of another.
def find_dependencies(col_i, col_j): for i_value in possible_values[col_i]: j_values = set() for row in rows: if row[col_i] == i_value: j_values.add(row[col_j]) if j_values < possible_values[col_j]: yield i_value, j_values
It goes through each possible value in the i column and finds out if fixing that reduces the possible values in the j column.
Notice that it makes use of < as a set operator for proper subset. It's a generator too. It will yield any value in the i column that restricts the values in the j column, along with what the restriction is.
What if you want to check the dependency, not between just two columns but two groups of columns?
The solution turned out to involve a couple of cool modifications which I'll save for a followup post.