P317 Code Entropy

I took an Information Theory class back in 2013. Here's a synopsis of my final project.

Entropy is a measure of disorder. In hard sciences, it can be quantified to describe the number of possible states a system can be in. In information theory, entropy can be quantified in a similar way as the amount of information that can be acquired by observing the state of a system. A static system will have entropy zero; an un-weighted coin flip will usually have entropy around 1 bit.

Finding the entropy of natural language has certainly been done before. Any string of words may be treated as a list of characters. In turn, the probability of each character can be measured and the entropy can be found based on Shannon’s equation. However, code is not natural language. We may type it character by character, but not every character is considered equal when the computer interprets our program. The largest difference is that words using alpha-numeric characters may be grouped together. As a whole, any word and most punctuation in programming languages can be referred to as a “token”. A token is essentially any object that can be interpreted by the computer. It is the smallest possible unit that is significant in code, thus it is the smallest possible unit with which we should measure the entropy in code.

My task became clearer at this point. To find the entropy of a program, I first had to tokenize it and collect a list of all the tokens. Here's the function in charge of making that list:

def tokenize(byte):
    global token
    if byte == ' ' or byte == '\n':
        if token != '':
            tokens.append(token)
        token = ''
    elif byte in string.punctuation:
        if token != '':
            tokens.append(token)
        tokens.append(byte)
        token = ''
    else:
        token += byte

Once I iterated through the entire list, the entropy of the list of tokens could be calculated based on each member’s conditional probability. Following is the formula for Shannon Entropy:

def entropy(list):
    entropySum = 0.0
    seen = []
    for x in list:
        if x not in seen:
            probability = list.count(x) / len(list)
            entropySum += probability * log2(probability)
            seen.append(x)
    return -entropySum

Full source code can be found here

It is interesting to note the varying effects that tokenizing a string can have on its entropy. For instance, if one takes the entropy of a relatively simple program without tokenizing it and compares that to the tokenized entropy, it’s likely that the tokenized entropy will be lower. This is because a tokenized string will have less members than one in which every character is considered independent. However, the more complex a program gets, the greater the tokenized entropy will be, eventually far surpassing the un-tokenized entropy. This is due to the number of possible characters being far lesser than the number of possible combinations of those characters. This allows for entropy to follow vocabulary linearly in very long or complex code, whereas normal methods for calculating entropy of a string would follow a more logarithmic pattern as the code grows in complexity.

To compare the entropy of different languages, one should use code that accomplishes the same task. For example, here's the 'reverse string' program in each of the languages I analyzed:

Brainfuck reverse string

,----- ----- [+++++ +++++ > , ----- -----]<[.<]+++++

C++ reverse string

int main() {
    std::string s;
    std::getline(std::cin, s);
    std::reverse(s.begin(), s.end());
    std::cout << s << std::endl;
    return 0;
}

Java reverse string

public static String reverseString(String s) {
    return new StringBuffer(s).reverse().toString();

Python reverse string

string[::-1]

Scheme reverse string

(define (string-reverse s)
  (list->string (reverse (string->list s))))

Haskell reverse string

reverse = foldl (flip (:)) []

I chose these languages to represent a broad spectrum of programming languages, but if research is to be taken further it may be wise to increase the scope of the languages analyzed. My general hypothesis followed that the languages could be ordered according to ascending complexity/entropy: Brainfuck, Scheme, Haskell, Python, Java, C++. To test this, I analyzed code from each language across several tasks. I used Rosetta Code as a database to find standardized examples of different tasks in each programming language.

Results

Hello World Reverse String Bubble Sort
Brainfuck 1.85 2.05 2.59
C++ 4.28 4.40 5.48
Haskell 2.95 3.10 4.23
Java 4.50 3.76 4.80
Python 2.95 2.52 4.39
Scheme 2.25 3.01 3.59

The results roughly supported my hypothesis, with Brainfuck and Scheme being less complex, c++ and java being more complex, and python and haskell sitting around the middle. I'd be interested to see if this trend would continue when vast amounts of code are analyzed, including standard libraries.

And there you have it, my little foray into Information Theory. It was one of my favorite classes so I hope you found this project interesting!

Back