Determining per-class feature frequencies in ARFF files on the command line

In this post, I demonstrate how we can analyze an ARFF file to find out how frequent a given feature occurs in a certain class. For simplicity, I assume only binary features, i.e., of type NUMERIC, and either 0 or 1.

The numbers in the comments are examples from my dataset.

Find out the column number of an attribute

Determine first line containing an @ATTRIBUTE:

grep -n @ATTRIBUTE data.arff | head -1 | cut -d : -f 1
# 8

Determine line containing the desired feature (HasAtLeastOne_auch):

grep -n @ATTRIBUTE.*HasAtLeastOne_auch data.arff | head -1 | cut -d : -f 1
# 191

This means, that the desired feature is in column 191 – 8 +1 = 184, which means there are 183 columns before it:

echo "191 - 8 +1 - 1" | bc
# 183

Of course, this whole calculation can be done automatically in script.

Counting the per-class occurrences of that attribute

Here is, where the simplifying assumption kicks in: The following expressions expect that all columns either contain a 0 or 1. They match a sequence of 183 0’s or 1′, separated by a comma and followed by a 1 (presence) or 0 (absence). We expect the class label (A or B) to be the last entry each line. (-E makes grep accept POSIX extended regular expressions (EREs))

# Count presence of feature for classes A and B
grep -E '^([01],){183}1.*A' data.arff | wc -l # 757
grep -E '^([01],){183}1.*B' data.arff | wc -l # 196
echo "100 * 757 / (757+196)" | bc
# 79.43 [%]

# Count absence of feature for classes A and B
grep -E '^([01],){183}0.*A' data.arff | wc -l # 740
grep -E '^([01],){183}0.*B' data.arff | wc -l #  96
echo "100 * 740 / (740+96)" | bc
# 88.52 [%]

 

 

Leave a Reply