In this post, I demonstrate how we can analyze an ARFF file to find out how frequent a given feature occurs in a certain class. For simplicity, I assume only binary features, i.e., of type NUMERIC, and either 0 or 1.
The numbers in the comments are examples from my dataset.
Find out the column number of an attribute
Determine first line containing an @ATTRIBUTE:
grep -n @ATTRIBUTE data.arff | head -1 | cut -d : -f 1 # 8
Determine line containing the desired feature (HasAtLeastOne_auch):
grep -n @ATTRIBUTE.*HasAtLeastOne_auch data.arff | head -1 | cut -d : -f 1 # 191
This means, that the desired feature is in column 191 – 8 +1 = 184, which means there are 183 columns before it:
echo "191 - 8 +1 - 1" | bc # 183
Of course, this whole calculation can be done automatically in script.
Counting the per-class occurrences of that attribute
Here is, where the simplifying assumption kicks in: The following expressions expect that all columns either contain a 0 or 1. They match a sequence of 183 0’s or 1′, separated by a comma and followed by a 1 (presence) or 0 (absence). We expect the class label (A or B) to be the last entry each line. (-E makes grep accept POSIX extended regular expressions (EREs))
# Count presence of feature for classes A and B grep -E '^([01],){183}1.*A' data.arff | wc -l # 757 grep -E '^([01],){183}1.*B' data.arff | wc -l # 196 echo "100 * 757 / (757+196)" | bc # 79.43 [%] # Count absence of feature for classes A and B grep -E '^([01],){183}0.*A' data.arff | wc -l # 740 grep -E '^([01],){183}0.*B' data.arff | wc -l # 96 echo "100 * 740 / (740+96)" | bc # 88.52 [%]