Tokenization and embedding of song lyrics "We will, we will..."

Tokenization and embedding of song lyrics “We will, we will…” i know you know how it ends. but have You wondered what would an LLM say ? Let us find out. I want to ask the Clause Sonnet 4.5 about the embeddings, tokenizations and probability of figuring out the lyrics for “We will, we will…” prompt 🙂

Table of Contents

Can you show me the tokenization, embeddings, emtadata and probabilities for “We will, we will …”

Tokenization

Any kind of machine learning uses numbers in the back stage. The text is split into tokens, each mapped to an ID ( example values ) :

Token ID	Token Text	Type	Position	Length
1234	We	PRON	0	2
5678	will	AUX	3	4
9012	,	PUNCT	7	1
1234	we	PRON	9	2
5678	will	AUX	12	4

– Case sensitivity and diactricts could be normalized
– Tokenizer treats comma as separate token
– Repetitions are tokenized identically
– That’s 6 tokens for 21 characters – about 3.5 characters per token.

Through 32 transformer layers, tokens “look at” each other. For example, “will” focuses 70% on itself. The model builds context and patterns.

Embeddings

Each token becomes a 4,096-dimensional vector. Six tokens = 24,576 numbers. This is where we discover meaning, relationships, context.

Token	Dim 1	Dim 2	Dim 3	Dim 4	Dim 5	Dim 6	Dim 7	Dim 8
we	0.82	-0.31	0.45	0.67	-0.12	0.28	0.91	-0.54
will	0.34	0.78	-0.62	0.19	0.88	-0.43	0.15	0.72
context vector	0.58	0.24	-0.09	0.43	0.38	-0.08	0.53	0.09

Semantic embedding properties
– Dim 1-2: Collectivity, “we” → high value
– Dim 3-4: Future modality, “will” → high value
– Dim 5-6: Determination/intention
– Dim 7-8: Emotional arousal, energy
– Dim ++ Many many more …

Embeddings are 4,600× larger than raw text—but they encode meaning, not just letters. Think of raw text as a name tag and embeddings as a full personality profile.

Contextual Metadata

Tokenization and embedding lead to another step. The one that categorizes the input i.ex syntax, sentiment, thmatic association so we can better refine the predictions. Adding even more context.

Property	Value	Weight
POS Tags	[PRON, AUX, PUNCT]	1.0
Dependency	nsubj → aux	0.95
Sentiment Polarity	Positive(+0.72)	0.88
Arousal Level	High(0.89)	0.91
Subject	Sport/Music/Motivation	0.85
Colocation	rock you (0.94), champions (0.87)	0.92
Fhrase frequency	Very high (top 0.1%)	0.98

Continuation Probabilities

Tokenization and embeddings are the base for calculating the probability for next token to emerge based on the probabilities already derived from the context, training data, prompt and any other inputs. “Rrock” dominates because of cultural context and repetition patterns.

The model has seen this phrase thousands of times in lyrics, articles, and memes. It recognizes repetition, rhythm, and cultural context.

Rank	Token	Probability	Logit	Perplexity
1	rock	0.437	3.82	1.18
2	you	0.312	3.45	1.29
3	we	0.089	2.12	2.41
4	not	0.043	1.67	3.78
5	never	0.029	1.23	4.92
6	always	0.021	0.98	6.14
7	fight	0.018	0.87	7.32
8	stand	0.014	0.71	8.91
9	march	0.011	0.58	10.5
10	sing	0.009	0.43	12.8

Most Valuable Features for Prompts

Features discovered through analysis that are driving strong predictions for the outputs.

Feature	Importance Score	Use Case
Structural repetition	0.94	Chorus generation
Modal verb “will”	0.91	Expressing determination
Pronoun “we”	0.89	Collective actions
Rhythmic phrase	0.87	Song lyrics/chants
Emotional energy	0.85	Motivational content
Association with “rock you”	0.82	Pop culture references

Cosine Similarity to Other Phrases

Semantic relation (closeness) for the related music phrases and motivational speeches. Uses a calculations based on sinus function like add position information to each dimension value[i] += sin(position / 10000^(i/4096)).

Phrase	Cosine Similarity	Context
we are the champions	0.91	Queen, sports
never give up	0.84	Motivation
fight for your right	0.79	Rock, rebellion
we shall overcome	0.77	Anthems, struggle
yes we can	0.73	Politics, unity
here we go	0.68	Cheerleading

Summary

Tokenization and embedding of song lyrics is a nice example cause You can predict the outcome with your common sense. Those two steps are the most crucial part. Clean, proper and wel maintained data is the MVP in all of that. We need to have clean predictions and data that we can trust. Now we know a bit more on how AI tranforms and creates all those embeddings to derive value and udnerstanding. From token ID to embedding, probability distributions and more add meaning and context.

Tokenization is just the begining. In the background ( backend ?:) ) AI transforms simple text into massive vectors, applies attention algorythms, sin functions for distance analysis and predicts based on patterns is know

The LLM model predicts “rock” not by chance but because of learned patterns: cultural, rhythm and semantic associations. Keep it in mind when You prompt !!
When You prompt “Queen” it will be a Queen, but a “rock Queen” will be a lot more precise, this is why context and providing specific inout is so crucial.

Can you show me the tokenization, embeddings, emtadata and probabilities for “We will, we will …”

Tokenization

Embeddings

Continuation Probabilities

Most Valuable Features for Prompts

Cosine Similarity to Other Phrases

Summary

You May Also Like

Estimating complexity vs time ….

AppSync in AWS cloud. Pros, cons and key considerations

Adr vs readme vs changelog