Mini essays,  AI

Tokenization and embedding of song lyrics “We will, we will…”

Tokenization and embedding of song lyrics “We will, we will…” i know you know how it ends. but have You wondered what would an LLM say ? Let us find out. I want to ask the Clause Sonnet 4.5 about the embeddings, tokenizations and probability of figuring out the lyrics for “We will, we will…” prompt 🙂

Can you show me the tokenization, embeddings, emtadata and probabilities for “We will, we will …”

Tokenization

Any kind of machine learning uses numbers in the back stage. The text is split into tokens, each mapped to an ID ( example values ) :

Token IDToken TextTypePositionLength
1234WePRON02
5678willAUX34
9012,PUNCT71
1234wePRON92
5678willAUX124

– Case sensitivity and diactricts could be normalized
– Tokenizer treats comma as separate token
– Repetitions are tokenized identically
– That’s 6 tokens for 21 characters – about 3.5 characters per token.

Through 32 transformer layers, tokens “look at” each other. For example, “will” focuses 70% on itself. The model builds context and patterns.


Embeddings

Each token becomes a 4,096-dimensional vector. Six tokens = 24,576 numbers. This is where we discover meaning, relationships, context.

TokenDim 1Dim 2Dim 3Dim 4Dim 5Dim 6Dim 7Dim 8
we0.82-0.310.450.67-0.120.280.91-0.54
will0.340.78-0.620.190.88-0.430.150.72
context vector0.580.24-0.090.430.38-0.080.530.09

Semantic embedding properties
– Dim 1-2: Collectivity, “we” → high value
– Dim 3-4: Future modality, “will” → high value
– Dim 5-6: Determination/intention
– Dim 7-8: Emotional arousal, energy
– Dim ++ Many many more …

Embeddings are 4,600× larger than raw text—but they encode meaning, not just letters. Think of raw text as a name tag and embeddings as a full personality profile.

Contextual Metadata

Tokenization and embedding lead to another step. The one that categorizes the input i.ex syntax, sentiment, thmatic association so we can better refine the predictions. Adding even more context.

PropertyValueWeight
POS Tags[PRON, AUX, PUNCT]1.0
Dependencynsubj → aux0.95
Sentiment PolarityPositive(+0.72)0.88
Arousal LevelHigh(0.89)0.91
SubjectSport/Music/Motivation0.85
Colocationrock you (0.94), champions (0.87)0.92
Fhrase frequencyVery high (top 0.1%)0.98

Continuation Probabilities

Tokenization and embeddings are the base for calculating the probability for next token to emerge based on the probabilities already derived from the context, training data, prompt and any other inputs. “Rrock” dominates because of cultural context and repetition patterns.

The model has seen this phrase thousands of times in lyrics, articles, and memes. It recognizes repetition, rhythm, and cultural context.

RankTokenProbabilityLogitPerplexity
1rock0.4373.821.18
2you0.3123.451.29
3we0.0892.122.41
4not0.0431.673.78
5never0.0291.234.92
6always0.0210.986.14
7fight0.0180.877.32
8stand0.0140.718.91
9march0.0110.5810.5
10sing0.0090.4312.8

Most Valuable Features for Prompts

Features discovered through analysis that are driving strong predictions for the outputs.

FeatureImportance ScoreUse Case
Structural repetition0.94Chorus generation
Modal verb “will”0.91Expressing determination
Pronoun “we”0.89Collective actions
Rhythmic phrase0.87Song lyrics/chants
Emotional energy0.85Motivational content
Association with “rock you”0.82Pop culture references

Cosine Similarity to Other Phrases

Semantic relation (closeness) for the related music phrases and motivational speeches. Uses a calculations based on sinus function like add position information to each dimension value[i] += sin(position / 10000^(i/4096)).

PhraseCosine SimilarityContext
we are the champions0.91Queen, sports
never give up0.84Motivation
fight for your right0.79Rock, rebellion
we shall overcome0.77Anthems, struggle
yes we can0.73Politics, unity
here we go0.68Cheerleading

Summary

Tokenization and embedding of song lyrics is a nice example cause You can predict the outcome with your common sense. Those two steps are the most crucial part. Clean, proper and wel maintained data is the MVP in all of that. We need to have clean predictions and data that we can trust. Now we know a bit more on how AI tranforms and creates all those embeddings to derive value and udnerstanding. From token ID to embedding, probability distributions and more add meaning and context.

Tokenization is just the begining. In the background ( backend ?:) ) AI transforms simple text into massive vectors, applies attention algorythms, sin functions for distance analysis and predicts based on patterns is know

The LLM model predicts “rock” not by chance but because of learned patterns: cultural, rhythm and semantic associations. Keep it in mind when You prompt !!
When You prompt “Queen” it will be a Queen, but a “rock Queen” will be a lot more precise, this is why context and providing specific inout is so crucial.

Piotr Kowalski