Language Model Watermarking Theory & Code

Purpose Statement

This article will explore the watermarking algorithm described in the 2023 Kirchenbauer et al. paper, which proposes a method to allow easy identification of a language-model-generated text. I shall provide code to implement the algorithm and derive some of the statistics that find the degree of certainty for a classification. The code herein is written for ease of understanding, not optimality.

Introduction

Language models, especially those able to compose coherent and original text based on a prompt, can be used maliciously for various purposes, such as academic dishonesty or mass internet propaganda. Moreover, wide-spread use of such language models pollutes blind data-scraping sources from use in future language-model training regimens because a model could ingest a large amount of non-human-generated content, leading to a situation where a new language model is being trained by past language models.

Herein is it a useful ability to classify a text as having been human written or language-model written; for thereby could a cheater be identified, propaganda be filtered, and text datasets be cleaned. Kirchenbauer et al. provide this through their watermarking algorithm.

The watermarking algorithm is implemented at generation time for an auto-regressive language model, that is a model that generates one token (word or word segment) based on all previous tokens. The algorithm embeds a pattern in the generated text, that the pattern could be detected later to distinguish between language-model and human-generated text. The algorithm functions by randomly creating green and red lists of tokens, then promoting the use of green list tokens during generation. Whereas a human-generated text is expected to have an even split of green and red tokens, a text generated by the language model will contain an unequal proportion of green tokens. Hence, a measure of the number of green tokens to the total number of tokens in a text would indicate the source of composition.

The watermarking algorithm is shown to produce reliably detectable generations without insufferable harm to the quality of the generated text. Of course, this is not a perfect solution because the algorithm need be employed at the generation stage, meaning that text generated without the algorithm could not be detected. Still, the algorithm has its use cases and exploration thereof involves some teachable tidbits.

Language Model Setup

Before we can implement watermarking, we need to have an auto-regressive language model in our code with which to generate text. We shall leverage Hugging Face in Python for ease of setup. If you have not already, you will need to install the Hugging Face transformers package for Python. Additionally, we will be using PyTorch, which also can be installed as a Python package.

First, let us make our imports and initialize our language model, which here will be GPT-2:

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

# Import factorial
from math import factorial as fac

# for random sampler
import numpy as np

# for random number generator
import random

# Pytorch functions
import torch

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

This language model continues an input text. Each token is one from the total vocabulary of size 50,257 for GPT-2. As input, the model takes a sequence of tokens, something like [30, 4801, 1289], and then returns a vector with size 50,257. Each of these 50,257 indices contains a logit value, which is a number indicating the likelihood that the corresponding token should be generated. The higher a logit value, the more favored the token should be when determining the next token to append to the sequence. For example, if the logit at index 1169 is the highest, then the token corresponding with that number, the in this case, would be the most favored token to generate next.

We here create a function to return the logit values for the next token-to-generate:

def next_logits(input_ids):
    '''Given an input sequence of tokens,
    return the logits for the next token'''
    return model(input_ids).logits[-1]

In Hugging Face, .logits gives us more than we actually want, for which we index the model's output.

Now, let us create an input using our tokenizer, which translates plain text to token values, and then get the next token by feeding the tokens into the model:

# Get tokens in PyTorch tensor
tokens = tokenizer("Love is my",
                   return_tensors = "pt").input_ids[0]

# Get logits for next token
logits = next_logits(tokens)

# Get text to append to sequence
next_token = logits.argmax()
next_text = tokenizer.decode(next_token)

print(next_text) # Out: " favorite"

This code gave us "Love is my favorite." We could then loop this to have the model continue the sequence by more than just one token.

Watermarking

Every token, under the watermarking method, is assigned either to be green or red (i.e. not green). However, each token is not assigned universally to be one of the two. Instead, a token is either green or red with respect to the previous token. While apple, for example, may be green after ate, it could be red after the.

The following code randomly assigns a token to be green or red with a 50% split given the previous token:

def is_green(token, previous):
    '''Determines if a token is green
    given the previous token'''

    # The same input will always return
    # the same output
    random.seed(f"{token}+{previous}")

    if random.random() > 0.5:
        return False
    else:
        return True

I use random.random to generate a number between 0 and 1. If the number is greater than 0.5 (, which occurs half of the time), then the token is not green (i.e. red); else, the token is green. The random.seed call ensures that the result of random.random will always be the same for a given token-previous-token pair. The seed is set by concatenating the token ID with + and then with the previous token ID.

Now that we can classify each token as either green or red, we can boost the generation of green tokens. This is done by adding some value $\delta$ to each green token's corresponding logit value before picking which token to generate next. Remember, a logit is the value associated with the favorablility of generating a certain token next; and by increasing each green token's logit value by $\delta$ , we thus increase the favorability of generating green tokens.

Here is code for a function to get logits with a watermark applied, that is with the logit values of green tokens boosted by $\delta$ :

def get_watermarked_logits(input_ids, delta):
    '''Given an input sequence of tokens and
    a delta, return the watermarked logits
    for the next token
    '''
    # Get non-watermarked logits
    logits = next_logits(input_ids)

    # Only grab the top 100 tokens for speed
    for i in logits.topk(100)[1]:
        # Boost green tokens
        if is_green(i, input_ids[-1]):
            logits[i] += delta

    return logits

Properly, we would apply the watermark to the entire logits vector. However, in order to save computation time, I have coded this only to boost the green tokens that are within the 100 highest logit-valued tokens.

The function below generates a specified number of tokens from a prompt and from a watermarking $\delta$ using our language model:

def generate(prompt, delta, num_tokens):
    '''Given a text prompt, a watermarking
    delta, and a number of tokens, generate
    num_tokens watermarked tokens with the
    language model'''

    # Get token prompt in PyTorch tensor as ids
    tokens = tokenizer(prompt,
                       return_tensors = "pt").input_ids[0]

    # Get number of tokens in prompt
    prompt_len = len(tokens)

    # Ensure generations stay consistent for testing
    np.random.seed(31415926)

    # Generate num_tokens tokens
    for _ in range(num_tokens):

        # Get logits for next token
        logits = get_watermarked_logits(tokens, delta)

        # Sample token to append to sequence
        probs = torch.softmax(logits, 0)
        next_token = np.random.choice(
            range(len(logits)),
            p = probs.detach().numpy())

        # Append generated token to sequence 
        tokens = torch.concat(
            (
                tokens, 
                torch.Tensor([next_token])
            )).type(torch.int64)

    # Return only the generated tokens
    return tokenizer.decode(tokens[prompt_len:])

The softmax function in line 24 converts the vector of logits, wherein each token has some value associated with it, to a probability distribution, wherein each token has a probability associated with it and all the probabilities together sum to 1.

Setting $\delta$ to 0 results in non-watermarked generation and setting it high results in generation of only green tokens. Calling generate("Love is my", 0, 30), which does not apply watermarking because $\delta$ is 0, we get

Love is my third trip to the show. The second quarter was comedienne Savannah Williams got herself into a fight with Carey Mulligan, and that didn't turn out

Calling generate("Love is my", 100, 30), which has such high a $\delta$ that only green tokens should be generated, we get

Love is my name," said the man. "I lost those babies today when it happens."\n\nThe woman eventually agreed to have an inquiry into the incident.

We now need a way to detect the watermark, for which we shall compose the next two functions, which count the total number of tokens and number of green tokens respectively.

def count_tokens(text):
    '''Counts the number of tokens in a
    text'''
    return len(tokenizer(text).input_ids)

def count_green(prompt, text):
    '''Counts the number of green tokens
    in a text'''

    # Get tokenized input
    tokens = tokenizer(text).input_ids

    # Include last prompt token if a prompt was input
    tokens = tokenizer(prompt).input_ids[-1:] + tokens

    # Get total number of tokens
    num_tokens = len(tokens)

    # Count number of green tokens
    num_green = 0
    for i in range(1, num_tokens):
        if is_green(tokens[i], tokens[i-1]):
            num_green += 1

    return num_green

For count_green, we need to pass in the prompt because whether or not the first generated token is green is determined by the last token in the prompt.

Let us see what the result of these functions on our example generations above is:

# With no watermark
print(count_tokens(
    generate("Love is my", 0, 30)))
print(count_green("Love is my",
    generate("Love is my", 0, 30)))
# Out [1]: 30
# Out [2]: 11

# With watermark
print(count_tokens(
    generate("Love is my", 100, 30)))
print(count_green("Love is my",
    generate("Love is my", 100, 30)))
# Out [3]: 30
# Out [4]: 30

The generation without the watermark generated a bit less than half of the tokens green. The generation with the watermark, for which we set $\delta$ so high that we should see almost all green tokens, indeed only generated green tokens. This clues us off to the fact that the language model was used for this second generation.

Detection Statistics

How many green tokens can we find in a text before we conclude that the language model generated it? Statistics can help us to answer this question.

Since we set the probability of a token being green to 50%, we expect that half of all tokens in a given non-watermarked text will be green. This is to say, a human writing a document without the watermarked language model will randomly compose the document with about half green and half red tokens.

Intuitively, if we inspect a text and find that about half or less of the tokens are green, then we can conclude that the watermarked language model was not used; but if we find that significantly more tokens than half are green, then we have reason to suspect that the watermarked model was used to compose the text.

Setup

Let us consider the probability that exactly $k$ of the tokens are green in a non-watermarked text with 100 tokens. We can assume that the color of each of the 100 tokens is statistically independent of the other 99. This means that each token has a 50% chance of being green, regardless of the color of the others.

In this situation we have 100 trials (100 chances for a token to be green) and each trial is independent and identically distributed, meaning that the outcome of one token being green does not affect another being green and each of the 100 tokens has the same chance of occurrence, in this case a 50% chance. Such situations are modeled with a binomial distribution, here with $n = 100$ and $p = 0.5$ . The chance of getting $k$ green tokens in a text is thus given by

$P(G = k) = {n \choose k} * p^k (1 - p)^{n-k} = {100 \choose k} * 0.5^n.$

${n \choose k}$ is read "en choose kay" and it is an integer value equaling the number of ways $k$ items can be chosen from a set of size $n$ . This is to say that

${n \choose k} = \frac{n!}{k!(n-k)!}.$

Remembering that we imported the factorial function as fac from the math library earlier, let us now implement the binomial distribution as a function:

def binomial(k,n,p):
    '''For a binomial with parameters n,p,
    returns the probability of k successes'''
    n_c_k = fac(n) / (fac(k) * fac(n - k))
    return  n_c_k * p**k * (1 - p) ** (n - k)

If we want the probability that $k$ successes occur in naturally composed text, i.e. that $k$ green tokens are generated in a text with $n$ tokens that was written without the watermarked language model, we can use this function. More formally, we have this: Assuming that a text of $n$ tokens was composed without knowledge of the watermark, the probability that $k$ green tokens be present in the text is binomial(k,n,0.5).

Another measure to have is the probability of $k$ or more green tokens in a non-watermarked text. We calculate this with the following function:

def cbinomial(x, n, p):
    '''For a binomial with parameters n,p,
    returns the probability of x successes
    or more'''

    total_prob = 0
    for k in range(x, n + 1):

        # Get the probability of k successes
        n_c_k = fac(n) / (fac(k) * fac(n - k))
        prob_k = n_c_k * p**k * (1 - p) ** (n - k)

        # Add the probability to the total
        total_prob += prob_k

return total_prob

The formal statement corresponding to this function is this: Assuming that a text of $n$ tokens was composed without knowledge of the watermark, the probability that $k$ or more green tokens be present in the text is cbinomial(k,n,0.5).

Detection threshold

What we want is a statement saying, For a text of $n$ tokens, we conclude it to have been written by the watermarked language model if $k$ or more tokens are green. Our final statement will depend on the level of certainty that we want for a positive classification, which says that the text was composed with the watermarked language model. For if we were trying to throw out language-model-generated text from a corpus before training, then we would be fine if some human-generated texts were thrown out as well, meaning that we could allow a lower level of certainty for positive classifications; but if we were working to prove that a student used the language model to cheat on a final paper, we should need to be extra certain of a positive classification because we would hate to fail a student who really did write his paper, only to get an unfortunate false positive by our system.

We can use cbinomial(k, n, 0.5) to see how certain we can be with a positive classification. If we find that $x$ or more green tokens appearing has a probability of 0.1, that means that one in ten human-composed texts would have $x$ or more green tokens. Therefore, if we set the threshold number of tokens to this same $x$ , we would mistakenly classify one in ten human-composed texts to have been composed by the watermarked language model. This could be fine for some situations but would definitely yield too high a false positive rate when trying to identify cheaters on a final paper.

This code finds, for different $x$ , the probability of $x$ or more green tokens in a text with $n = 100$ tokens:

print(cbinomial(50, 100, 0.5))
# Out [1]: 0.5397946186935894

print(cbinomial(51, 100, 0.5))
# Out [2]: 0.46020538130641064

print(cbinomial(55, 100, 0.5))
# Out [3]: 0.1841008086633481

print(cbinomial(60, 100, 0.5))
# Out [4]: 0.028443966820490392

print(cbinomial(70, 100, 0.5))
# Out [5]: 3.925069822796834e-05

print(cbinomial(75, 100, 0.5))
# Out [6]: 2.818141017102701e-07

print(cbinomial(80, 100, 0.5))
# Out [7]: 5.579544528625976e-10

If we pick a threshold of $80$ green tokens, then we would only be classifying 5.58e-8 percent of all human-composed texts to have been composed with the watermarked language model. This means that we would only falsely accuse 1 in about every 1.8 billion honest students as having cheated, were we to use this threshold in the aforementioned cheat-detector scenario. Therefore, we use this statement for the cheat-detector scenario: For a text of 100 tokens, we conclude it to have been written by the watermarked language model if 80 or more tokens are green.

I have a poem and I want to see if it was composed by the watermarked language model. I run the following code to see:

poem_text = '''There was a man who spoke to itching ears,
To those who sought to prosper in the earth;
He said, A man should stay away from tears
And learn to keep a life of balanced mirth.
For in a life of labor one can find
A meaningful endeavor, he would say;
And play, in moderation, serves in rhyme:
For labors merit joy throughout the day.
Yet though he led his flock beneath the sun
For meaning and to leave what's more behind,
His teaching, in the end, came well undone,
When drunk upon his vanity he died.
  In life, he hadn't shared an ounce of freedom;
  But, in his death, he left a word of wisdom.'''

# We do not know any prompt that was used
# so pass in empty prompt
num_green = count_green("", poem_text)

# Subtract one from total token count because
# we do not know whether first is green
num_tokens = count_tokens(poem_text) - 1


print(cbinomial(num_green, num_tokens, 0.5))
# Out [1] : 0.8681021272009385

The result of 86.8% does not allow me to say that it was composed by the watermarked language model because we expect 86.8% of human-generated texts of the same size to have the same number as or more green tokens than the poem.

The reason for choosing a higher threshold is to ensure, beyond reasonable doubt, that a text indeed was watermarked-language-model composed. We may still want a lower threshold, though, because higher thresholds, while avoiding false positives, also create more false negatives, which entail falsely classifying a watermarked text not to have been watermarked-language-model composed.

Conclusion

With the functions coded and discussed above, we have the tools to watermark a language model, to detect the number of green, i.e. marked, tokens in a given text, and to make statistical inference for a text based on the expected number of green tokens and the amount of green tokens present in the text. Not covered was determining the number of false negatives to expect.

Determining how many false negatives we will have is a little bit trickier, not only because need we to consider the threshold that we set, but because the number of green tokens that will appear in a watermarked text depends on the selected value of $\delta$ . For now, I leave this as something to be learned from the original paper.