Finding Dependencies in Tabular Data
I have a file with tabular data and I want to find out if the range of possible values in one column is affected by the value of another column. Here is how I did it in Python 2.4.
This is related to relational database normalization and the code below could be modified to specifically find functional dependencies for normalization. My goal was a little more general, however. I wanted to know any time the range of possible values for one column given another column's value is a proper subset of the possible values for the column when no other column values are fixed.
I have two data structures. One is a tuple of sets of possible values for each column. The second is a set of the rows (each a tuple).
possible_values = (set(), set(), set(), set(), set()) rows = set()
(Note, I've hard-coded the length of the possible_values tuple to 5 to match my current data.)
Next I load in the data — it's in a whitespace-delimited file, one row per line:
for line in file("data.txt"):
row = tuple(line.strip().split())
for col_i in range(len(row)):
possible_values[col_i].add(row[col_i])
rows.add(row)
Now here's my function for finding whether one column's value restricts the possible values of another.
def find_dependencies(col_i, col_j):
for i_value in possible_values[col_i]:
j_values = set()
for row in rows:
if row[col_i] == i_value:
j_values.add(row[col_j])
if j_values < possible_values[col_j]:
yield i_value, j_values
It goes through each possible value in the i column and finds out if fixing that reduces the possible values in the j column.
Notice that it makes use of < as a set operator for proper subset. It's a generator too. It will yield any value in the i column that restricts the values in the j column, along with what the restriction is.
What if you want to check the dependency, not between just two columns but two groups of columns?
The solution turned out to involve a couple of cool modifications which I'll save for a followup post.
Comments (0)
Add a Comment
Last Modified: May 26, 2005
Author: jtauber