Tokenization and embedding of song lyrics “We will, we will…”
Tokenization and embedding of song lyrics “We will, we will…” i know you know how it ends. but have You wondered what would an LLM say ? Let us find out. I want to ask the Clause Sonnet 4.5 about the embeddings, tokenizations and probability of figuring out the lyrics for “We will, we will…” prompt 🙂
Can you show me the tokenization, embeddings, emtadata and probabilities for “We will, we will …”
Tokenization
Any kind of machine learning uses numbers in the back stage. The text is split into tokens, each mapped to an ID ( example values ) :
| Token ID | Token Text | Type | Position | Length |
|---|---|---|---|---|
| 1234 | We | PRON | 0 | 2 |
| 5678 | will | AUX | 3 | 4 |
| 9012 | , | PUNCT | 7 | 1 |
| 1234 | we | PRON | 9 | 2 |
| 5678 | will | AUX | 12 | 4 |
– Case sensitivity and diactricts could be normalized
– Tokenizer treats comma as separate token
– Repetitions are tokenized identically
– That’s 6 tokens for 21 characters – about 3.5 characters per token.
Through 32 transformer layers, tokens “look at” each other. For example, “will” focuses 70% on itself. The model builds context and patterns.

Embeddings
Each token becomes a 4,096-dimensional vector. Six tokens = 24,576 numbers. This is where we discover meaning, relationships, context.
| Token | Dim 1 | Dim 2 | Dim 3 | Dim 4 | Dim 5 | Dim 6 | Dim 7 | Dim 8 |
|---|---|---|---|---|---|---|---|---|
| we | 0.82 | -0.31 | 0.45 | 0.67 | -0.12 | 0.28 | 0.91 | -0.54 |
| will | 0.34 | 0.78 | -0.62 | 0.19 | 0.88 | -0.43 | 0.15 | 0.72 |
| context vector | 0.58 | 0.24 | -0.09 | 0.43 | 0.38 | -0.08 | 0.53 | 0.09 |
Semantic embedding properties
– Dim 1-2: Collectivity, “we” → high value
– Dim 3-4: Future modality, “will” → high value
– Dim 5-6: Determination/intention
– Dim 7-8: Emotional arousal, energy
– Dim ++ Many many more …
Embeddings are 4,600× larger than raw text—but they encode meaning, not just letters. Think of raw text as a name tag and embeddings as a full personality profile.
Contextual Metadata
Tokenization and embedding lead to another step. The one that categorizes the input i.ex syntax, sentiment, thmatic association so we can better refine the predictions. Adding even more context.
| Property | Value | Weight |
|---|---|---|
| POS Tags | [PRON, AUX, PUNCT] | 1.0 |
| Dependency | nsubj → aux | 0.95 |
| Sentiment Polarity | Positive(+0.72) | 0.88 |
| Arousal Level | High(0.89) | 0.91 |
| Subject | Sport/Music/Motivation | 0.85 |
| Colocation | rock you (0.94), champions (0.87) | 0.92 |
| Fhrase frequency | Very high (top 0.1%) | 0.98 |

Continuation Probabilities
Tokenization and embeddings are the base for calculating the probability for next token to emerge based on the probabilities already derived from the context, training data, prompt and any other inputs. “Rrock” dominates because of cultural context and repetition patterns.
The model has seen this phrase thousands of times in lyrics, articles, and memes. It recognizes repetition, rhythm, and cultural context.
| Rank | Token | Probability | Logit | Perplexity |
|---|---|---|---|---|
| 1 | rock | 0.437 | 3.82 | 1.18 |
| 2 | you | 0.312 | 3.45 | 1.29 |
| 3 | we | 0.089 | 2.12 | 2.41 |
| 4 | not | 0.043 | 1.67 | 3.78 |
| 5 | never | 0.029 | 1.23 | 4.92 |
| 6 | always | 0.021 | 0.98 | 6.14 |
| 7 | fight | 0.018 | 0.87 | 7.32 |
| 8 | stand | 0.014 | 0.71 | 8.91 |
| 9 | march | 0.011 | 0.58 | 10.5 |
| 10 | sing | 0.009 | 0.43 | 12.8 |

Most Valuable Features for Prompts
Features discovered through analysis that are driving strong predictions for the outputs.
| Feature | Importance Score | Use Case |
|---|---|---|
| Structural repetition | 0.94 | Chorus generation |
| Modal verb “will” | 0.91 | Expressing determination |
| Pronoun “we” | 0.89 | Collective actions |
| Rhythmic phrase | 0.87 | Song lyrics/chants |
| Emotional energy | 0.85 | Motivational content |
| Association with “rock you” | 0.82 | Pop culture references |
Cosine Similarity to Other Phrases
Semantic relation (closeness) for the related music phrases and motivational speeches. Uses a calculations based on sinus function like add position information to each dimension value[i] += sin(position / 10000^(i/4096)).
| Phrase | Cosine Similarity | Context |
|---|---|---|
| we are the champions | 0.91 | Queen, sports |
| never give up | 0.84 | Motivation |
| fight for your right | 0.79 | Rock, rebellion |
| we shall overcome | 0.77 | Anthems, struggle |
| yes we can | 0.73 | Politics, unity |
| here we go | 0.68 | Cheerleading |

Summary
Tokenization and embedding of song lyrics is a nice example cause You can predict the outcome with your common sense. Those two steps are the most crucial part. Clean, proper and wel maintained data is the MVP in all of that. We need to have clean predictions and data that we can trust. Now we know a bit more on how AI tranforms and creates all those embeddings to derive value and udnerstanding. From token ID to embedding, probability distributions and more add meaning and context.
Tokenization is just the begining. In the background ( backend ?:) ) AI transforms simple text into massive vectors, applies attention algorythms, sin functions for distance analysis and predicts based on patterns is know
The LLM model predicts “rock” not by chance but because of learned patterns: cultural, rhythm and semantic associations. Keep it in mind when You prompt !!
When You prompt “Queen” it will be a Queen, but a “rock Queen” will be a lot more precise, this is why context and providing specific inout is so crucial.


