Yet another basic AI glossary part 1
AI & Machine Learning Glossary for Beginners
Yet another basic AI glossary part 1. This is base what i need to learn better and understand all that „AI” and „LLMs”. Feel free to go through all of it and dive deeper on those subjects. Defining here ai concepts, ideas, math functions, slang and anything that might be helpful in better understanding „the whole lot”.
1. Logit
A logit is the raw output we get from a model before any functions are applied. Before the softmax functions. During classification logit shows the confidence of the model about everypossible output.
2. Logit Definition (Mathematical View)
In math logits are real numbers output by an LLM final layer ( and we have multiple of those). After proceeding with softmax function logits become values between 0 and 1. Those are probabilities that sum to 1.
3. Softmax function
Converts a vector of logits into a probability distribution. You can run an integral on it ! After applying the functions each number becomes a value between 0 and 1, with all probabilities adding up to 1. This is deterministic so we end up with a table of values with biggest propabilities ( to chose for the next output token).
4. Temperature (in AI sampling)
Temperature controls randomness (variability) in generative llms / AIs. Low temperature (e.g., 0.2) makes the model choose more predictable words ( giving more stable results, repetitive). High temperature on the other hand (e.g., 1.0 or 1.5) give model permission to choose from more distinct suggestion creating more random output. We people call it „creativity”.
5. Top-k Sampling
Top-k sampling limits the model to the k most likely words according to the probability distribution, introducing controlled randomness while maintaining coherence.
6. Top-p (Nucleus Sampling)
Top-p sampling (also nucleus sampling) selects the smallest ( starting from the top with the token having the highest probability), set of tokens whose combined probability exceeds p (e.g., 0.9). This helps keep outputs diverse but contextually grounded.
When You define bioth functions its the matter of values which one will be satisfied and 'end’ reading of the new tokens.
| Token | Probability | Top-k=2 | Top-p=0.8 | Cumulative Probability |
|---|---|---|---|---|
| eat | 0.40 | ✅ Yes | ✅ Yes | 0.40 |
| sleep | 0.25 | ✅ Yes | ✅ Yes | 0.65 |
| play | 0.17 | ❌ No | ✅ Yes | 0.82 |
| run | 0.13 | ❌ No | ❌ No | 0.95 |
| jump | 0.05 | ❌ No | ❌ No | 1.00 |
7. Random Forest
A random forest is a set of multiple decition nodes ( trees). Each tree votes independently and we summarize the results. Given majority decides about classification. This approach is improving accuracy whiel avoiding overfitting by averaging predictions from multiple trees trained on different data samples.
RANDOM FOREST TREE #1 (Trained on Bootstrap Sample A)
Root Node (n=150 samples)
│
├── petal length (≤ 2.45 cm?) [gini=0.667, samples=50/150]
│ │
│ ├── YES (42 samples) → petal width (≤ 1.75 cm?) [gini=0.168]
│ │ │
│ │ ├── YES (37 samples) → setosa [100%, gini=0.0] ⭐ FINAL CLASS
│ │ └── NO (5 samples) → versicolor [93%, gini=0.124]
│ │
│ └── NO (8 samples) → petal width (≤ 1.75 cm?) [gini=0.375]
│ │
│ ├── YES (4 samples) → versicolor [100%, gini=0.0]
│ └── NO (4 samples) → virginica [100%, gini=0.0]
│
└── petal length (> 2.45 cm?) [gini=0.500, samples=100/150]
│
├── petal width (≤ 1.75 cm?) [gini=0.160, samples=54/100]
│ │
│ ├── YES (48 samples) → versicolor [95%, gini=0.095]
│ └── NO (6 samples) → virginica [100%, gini=0.0]
│
└── petal width (> 1.75 cm?) [gini=0.032, samples=46/100]
│
├── petal length (≤ 4.95 cm?) [gini=0.199, samples=24/46]
│ ├── YES (12 samples) → versicolor [92%, gini=0.160]
│ └── NO (12 samples) → virginica [100%, gini=0.0]
│
└── petal length (> 4.95 cm?) → virginica [100%, samples=22]
8. Euclidean Distance
Euclidean distance is a straight-line distance between two points in space. In vector math, it’s used to evaluate how far apart data points or embeddings are. The further apart the less likely they are connected in the given context.
9. Cosine Similarity
Cosine similarity measures how similar two vectors are by calculating the cosine of the angle between them. It ranges from -1 to 1, where 1 means identical direction (perfect similarity). Less means the vector are aligned giving the same context space.
10. Dot Product
The dot product of two vectors is the sum of the products of their corresponding elements. It reflects how much two vectors point in the same direction.
Formula: A⋅B=∣A∣∣B∣cos(θ)
11. Dot Product and Cosine Relationship
For unit vectors (where ∣A∣=1 and ∣B∣=1), the dot product directly equals the cosine of the angle between them:
A⋅B=cos(θ).
In simple terms, this means the dot product measures directional similarity.
12. Embedding
Embedding process transforms any kind of data into a vector of numbers that represent „meaning”. Embeddings are key in search, recommendation systems, and semantic similarity.
„Hello world!” → [0.12, -0.05, 0.34, 0.21, -0.08]
| Word | Sample 5D Embedding | Intuition |
|---|---|---|
| „Hello” | [0.42, 0.15, -0.03, 0.28, 0.11] | Greeting vector (warm, social dim high) |
| „world” | [-0.18, 0.07, 0.41, 0.14, -0.25] | Global/universal concept (broad dim high) |
| Average | [0.12, 0.11, 0.19, 0.21, -0.07] | Combined „welcoming to all” |
cos(„Hello world!”, „Hi everyone!”) ≈ 0.92 → Very similar (greetings)
cos(„Hello world!”, „Goodbye moon”) ≈ 0.15 → Unrelated
cos(„Hello world!”, „Shut up!”) ≈ -0.23 → Opposite sentiment
13. Token
A token is a chunk of text (e.g., a word, part of a word, or punctuation) that an AI model processes. Large language models (LLMs) work by predicting one token at a time.
14. Instruction File
An instruction file defines how an AI model should behave, what tasks it should prioritize, or what tone to adopt. For example, it can set boundaries or style preferences for generation.
title: „Pinia Stores Setup Guide – Vue 3 + Cache”
description: „Complete Vue.js Pinia with localStorage caching for cart, and preferences. Refresh-proof state management.”
keywords: „vue 3, pinia, localstorage cache, vue store, embeddings cache, woocommerce cart”
tech: „Vue 3.5+, Pinia 2.2+, Vite, pinia-plugin-persistedstate”
category: „Frontend”
15. Fine-tuning
Fine-tuning involves training a pre-existing model on specific data to adapt it for a particular task or domain, such as customer support or law-related text generation.
16. Prompt
A prompt is the input text or command given to an AI model. Well-crafted prompts guide the model to deliver precise, high-quality outcomes.
TASK
Refactor the legacy code to use the new library version while:
- Preserve exact functionality (zero behavior change)
- Modern Vue 3 Composition API (no Options API)
- TypeScript interfaces (strict types)
- Pinia store integration (if state-related)
- Error handling (try/catch + user-friendly messages)
- Performance (no memory leaks, reactive cleanup)
CONSTRAINTS
- No breaking changes to public API
- Handle edge cases from old code
- 100% backward compatible inputs/outputs
- Remove deprecated methods
- Add JSDoc comments
EXAMPLE: Lodash → Native (common case)
LEGACY:
„`js
// Lodash 4.x
import _ from 'lodash’
const users = .groupBy(data, 'status’) const active = .filter(users.active, u => u.age > 18)
17. Context Window
Context window is the maximum amount of data in tokens a model can “remember” in a single conversation / session. Larger windows allow for longer, more coherent outputs. Too much data can spoil the answers. When context window will run out of capacity a shortening of context might trigger. Instead of 45 672 tokens it will make a summary for example around 10 000 tokens. You might say it will run a garbage collect on unused data in context.
| Model | Parameters | Size (GB, FP16) | Context Window (Input Tokens) | Output Tokens | Notes |
|---|---|---|---|---|---|
| GPT-4o | ~1.76T | ~3.5 TB (cloud) | 128,000 | 4,096 | Default/main model in Copilot Chat |
| GPT-4.1 | ~1.8T | ~3.6 TB (cloud) | 128,000 | 4,096 | High-quality coding model |
| GPT-5 | 2T+ | ~4+ TB (cloud) | 128,000 | 8,192 | New flagship (preview models vary) |
| GPT-5 mini | ~100B | ~200 GB (cloud) | 128,000 | 4,096 | Fast/lightweight |
| GPT-5.1-Codex-Max | ~500B | ~1 TB (cloud) | 400,000 | 128,000 | Max context preview (code-focused) |
| Gemini 3 Pro (Preview) | 1.5T | ~3 TB (cloud) | 128,000 | 64,000 | Google model, codebase indexing |
| Gemini 2.5 Pro | ~500B | ~1 TB (cloud) | 64,000–128,000* | Varies | Often limited vs native 1M |
| Claude Sonnet 4 | 400B | ~800 GB (cloud) | 80,000–128,000* | 4,096 | Preview models capped; native up to 1M |
| o4-mini | ~50B | ~100 GB (cloud) | ~100,000 | 4,096 | Fast OpenAI variant |
| Qwen4 | 32B–72B | 64–144 GB (local) | 128K–1M | 8K–32K | Alibaba; excellent code/math; Ollama-friendly |
| DeepSeek R1 | 671B (37B active MoE) | ~1.3 TB (cloud/~74 GB active) | 128,000 | 32,768 | Reasoning/coding beast; distilled 7B–70B local |
| Ollama 3.2 (Llama 3.2 base) | 3B–405B* | 6–810 GB (local, quantized ~2–200 GB) | 128K (configurable) | Varies | Your local setup; num_ctx: 131072; vision support |
18. Inference
Inference refers to the process of running a trained model to generate predictions or responses, as distinct from training (which adjusts internal weights).
19. Model Parameters
Parameters are the internal values in a neural network (often millions or billions) that determine how inputs transform into outputs. They are adjusted during training. Inference happens when a trained AI model (like GPT-4o or your Ollama Qwen4) takes new input and generates predictions/output using patterns learned during training. No learning happens—inference applies fixed knowledge to fresh data
20. Plan / Subscription Tier
Most AI platforms offer plans or tiers that define limits such as context length, tokens to use (input and output), model choice or request volume. Paid plans often allow access to more advanced models and APIs.
21. Vector Database
Vector databases store embeddings (numerical vectors from text/images) and enable similarity search using cosine distance. Perfect for semantic search.
22. Gradient Descent
Gradient descent is a core optimization algorithm that updates model parameters by finding the direction that reduces error step by step. It is the core optimization algorithm that trains AI models by iteratively minimizing a loss function—finding the best weights/parameters. Think „hill descending” in parameter space.
23. Loss Function
A loss function measures how far the model’s predictions deviate from the correct answers. Lower loss means better performance during training and stable output.
| Task | Loss Function | Formula | Example | When |
|---|---|---|---|---|
| Regression (numbers) | MSE | (pred - actual)² | Recipe IG: predict 45, actual 42 → loss=9 | Embeddings, prices |
| Regression (robust) | MAE | ` | pred – actual | ` |
| Classification (binary) | Cross-Entropy | -[y*log(pred) + (1-y)*log(1-pred)] | „Is fasting?” 1→0.9: loss=0.1 | Spam/not spam |
| Classification (multi) | Categorical Cross-Entropy | Extension of binary | Recipe tags | Multiple categories |
| Margin (SVM) | Hinge | max(0, 1 - margin) | Class separation | Robust classification |
Want more ?
Read more !


