NSFW: GPT-4o Tokenizer

On May 13, 2024, OpenAI released GPT-4o and its associated tokenizer o200k_base.

In the past, we have noted that multilingual models with Simplified Chinese capabilities, or those intended for deployment in China, often include specialized tokens related to political correctness. Given the recent rumors of Apple closing in on a deal with OpenAI to integrate ChatGPT into iPhones, we believe it is an opportune time to examine GPT-4o’s readiness to meet PRC regulatory requirements. Therefore, we will conduct a quick analysis of its word tokens.

Byte Pair Encoding

OpenAI employs Byte Pair Encoding (BPE) for tokenization, a method that combines the benefits of both character-level and word-level tokenization. BPE starts with individual characters and iteratively merges the most frequent pairs of characters or character sequences to form new tokens. This process continues until a predefined vocabulary size is reached.

Preprocessing

We will need to install the necessary libraries for this quick EDA.

tiktoken: OpenAI’s tokenizer library
langdetect: language detection library, ported from Google’s language-detection. not the most modern library, but does the job and don’t need to install models like fasttext or spacy.

With the necessary libraries installed, we start by getting all tokens and their respective ids into a standalone Python dictionary. We will also sort the tokens by length, as in the past

import tiktoken
import langdetect
import concurrent.futures


tokenizer = tiktoken.get_encoding("o200k_base")

vocabs = {}

for i in range(tokenizer.n_vocab):
    try:
        vocabs[i] = len(tokenizer.decode([i]))
    except:
        pass

vocabs_by_length = dict(sorted(vocabs.items(), key=lambda item: -item[1]))
print(f"Number of Tokens in gpt-4o's tokenizer 'o200k_base' is: {len(vocabs_by_length)}")

Output:

Number of Tokens in gpt-4o’s tokenizer ‘o200k_base’ is :200000

Looking at Tokens

With the tokens now in a Python dictionary, let’s grab the 100 longest tokens in Simplified Chinese zh-cn to see what they are. Note, that OpenAI uses BPE so the tokens are directly generated from its training corpus.

langdetect does run fairly slowly, we write a helper function to help us classify, concurrently, if a token is written in Simplified Chinese. With the language detection function, we can then iterate through all of the tokens that is Simplified Chinese, sorted by the length of the tokens, and print out the first 100 tokens.

def detect_language(vocab):
    try:
        token = tokenizer.decode([vocab])
        if langdetect.detect(token) == "zh-cn":
            return vocab, token
    except:
        pass

    return None

top_n = 0

with concurrent.futures.ProcessPoolExecutor() as executor:
    futures = {executor.submit(detect_language, vocab): vocab for vocab in vocabs_by_length}

    for future in concurrent.futures.as_completed(futures):
        result = future.result()
        if result:
            vocab, token = result
            print(vocab, token)
            top_n += 1
            if top_n == 100:
                break
        else:
            pass

Weird Tokens

The following is a truncated output of the above script.

185118 _日本毛片免费视频观看
116852  中国福利彩票天天
128031 久久免费热在线精品
148388  微信的天天中彩票
154809 无码不卡高清免费v
172750  大发快三大小单双
177431 给主人留下些什么吧
181679  qq的天天中彩票
184969 _日本一级特黄大片
187822  大发快三开奖结果
49649  彩神争霸邀请码
89409 免费视频在线观看
122333 无码不卡高清免费
122712 无码一区二区三区
128600  大发时时彩计划
133274 】【：】【“】【
135161  大发时时彩开奖
149168  大发时时彩怎么
150771  彩神争霸电脑版
160029  大发快三是国家
160131  大发快三是不是
176039 精品一区二区三区
186348  大发快三是什么
187516  大发快三走势图
187810 在线观看中文字幕
191179  大发快三怎么看
193825 中国特色社会主义
194062  彩神争霸是不是

Origins of These Tokens

We observe that these tokens, composed of lenghy composite phrases, are commonly found in gambling and adult websites targeting PRC nationals. They appear to be lengthy composite phrases. Now, what if we tokenize the individual components directly? Let’s break down the full phrase into 无码 and 不卡高清免费v.

print(f"token_id for '无码不卡高清免费v': {tokenizer.encode('无码不卡高清免费v')}")
print(f"token_id for '无码': {tokenizer.encode('无码')}\ntoken_id for '不卡高清免费v': {tokenizer.encode('不卡高清免费v')}")
print(f"token_id for '不卡': {tokenizer.encode('不卡')}\ntoken_id for '高清免费': {tokenizer.encode('高清免费')}")

Outputs:

token_id for ‘无码不卡高清免费v’: [154809]
token_id for ‘无码’: [9070]
token_id for ‘不卡高清免费v’: [20652, 63642, 85]
token_id for ‘不卡’: [20652]
token_id for ‘高清免费’: [63642]

From this decomposition, we found that some sub-phrases such as 无码 in the phrase 无码不卡高清免费v have their own individual tokens. This indicates that these long composite phrase tokens likely result from token extension during either continual pretraining or parallel tokenization process.

Further investigation of the tokenizer revealed additional NSFW terms commonly used on adult websites. These tokens are also composites and have high token IDs. Since BPE tokenizers are trained from a set corpus, the presence of these composite tokens suggests that they might have been added or extended through a later or parallel tokenization process on a disjoint corpus.

import tiktoken


tokenizer = tiktoken.get_encoding("o200k_base")
adult_tokens = [182974, 191391, 191547, 197701]

for t in adult_tokens:
    print(f"token_id {t} decodes into {tokenizer.decode([t])}")

token_id 182974 decodes into gangbang
token_id 191391 decodes into analsex
token_id 191547 decodes into JAV
token_id 197701 decodes into bbc

Implications

These tokens reminded us of the SolidGoldMagikarp tokens for GPT-3.5 in early 2023. They are potential attack vectors against GPT-4o to elicit unexpected behaviors, as shown below:

alt text

There could be potential security implications for bad actors utilizing these tokens to alter model and LLM system behaviors.

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {NSFW: GPT-4o Tokenizer},
    year = {2024},
    month = {05},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2024/05/09/gpt-4o-tokenizer/}
}