Why Log, Softmax & Likelihood
Every loss function in deep learning — cross-entropy, NLL, contrastive loss, DPO, CLIP — is secretly the same idea repeated. Once you understand why we use probability, what likelihood truly means, and why we take its log, every loss function will click immediately.
What Is Probability?
A probability is a number between 0 and 1 expressing how likely something is. 0 means impossible. 1 means certain. The golden rule: all probabilities of possible outcomes must add up to exactly 1.
In machine learning, the model outputs a probability distribution over classes — a list of probabilities that sum to 1, telling you how confident the model is about each possibility.
What Is Likelihood? — The Most Important Concept
Likelihood and probability use the same formula, but they ask completely different questions. This distinction is one of the most important in all of machine learning.
Probability asks: "Given this model, how likely is this data?"
Likelihood asks: "Given this data, how well does this model explain it?"
More formally — likelihood measures: "How good is my model at explaining the data I actually observed?"
Formal Definition
Given a dataset of N examples with inputs x and labels y:
Binary Classification Example
Suppose you have a model and three data points:
| Input | True Label y | Model P(y=1|x) | Probability Used |
|---|---|---|---|
| x₁ | 1 (positive) | 0.9 | P(y=1) = 0.9 |
| x₂ | 0 (negative) | 0.2 | P(y=0) = 1−0.2 = 0.8 |
| x₃ | 1 (positive) | 0.8 | P(y=1) = 0.8 |
When true label y=1, use P(y=1|x) directly. When true label y=0, use 1−P(y=1|x). This ensures you always measure the probability of the correct label.
Probability vs Likelihood — Side by Side
Same formula. Completely different meaning depending on what you hold fixed and what you vary.
| Aspect | Probability | Likelihood |
|---|---|---|
| What's fixed? | Parameters θ | Observed data |
| What varies? | Possible outcomes | Parameters θ |
| Real-world use | Forecasting, simulation | Model fitting, training |
| In deep learning | Model's output probabilities | The objective we maximize |
The Coin Story — Probability vs Likelihood in Action
This story will make the distinction unforgettable. Follow it carefully — it's the same mental model that underlies every training loop in deep learning.
Part 1 — Using Probability (θ known, data unknown)
Suppose you already know θ = 0.7 (70% chance of Heads). Now you ask: "What is the probability of getting H, T, H in three tosses?"
Part 2 — Using Likelihood (data known, θ unknown)
Now the situation is reversed. You toss the coin and observe: H, H, T, H (fixed). You ask: "What value of θ best explains what I saw?"
Probability: "I know the coin, what will happen?" — Likelihood: "I saw what happened, what is the coin?"
Training a neural network is exactly this: the data is fixed (your training set), and you search over all possible parameter values θ (model weights) to find those that make the observed labels as probable as possible. That is MLE.
Why the Log Function? — Four Deep Reasons
This is the section most courses skip. We will not. The log function is not an arbitrary choice — it's the natural measure of information and surprise.
Reason 1 — Products Become Sums
For independent events (like all training examples), probabilities multiply together. The fundamental property of logarithms converts multiplication into addition:
Reason 2 — Preventing Numerical Underflow
Reason 3 — Logarithmic Perception Matches Intuition
Reason 4 — Penalizes Confident Mistakes Severely
This is the training signal reason. The loss uses −log(p). Watch what happens as the model becomes confident about the correct answer:
Maximum Likelihood Estimation (MLE)
MLE answers one central question: "Given the data I observed, what model parameters best explain it?"
The MLE Formula
When you write F.cross_entropy() or F.nll_loss() in PyTorch — that IS maximum likelihood estimation. The framework is: "find model parameters that make the observed training labels as probable as possible." Every training loop in deep learning is MLE in disguise.
What "As Probable As Possible" Means Concretely
In a 3-class problem, imagine the model predicts these probabilities for a cat image:
Dog → 0.07
Car → 0.03
Dog → 0.80
Car → 0.15
MLE training pushes the model from the right column toward the left column — adjusting weights so the correct label gets higher and higher probability.
Softmax — Converting Logits to Probabilities
Neural networks output raw numbers called logits. These can be any value — positive, negative, large, small. They are not probabilities. Softmax converts them to a valid probability distribution.
Why Use e (the Exponential)?
Three reasons. First, e^x is always positive regardless of x — so we never get negative probabilities even when logits are negative. Second, it preserves order — a larger logit always produces a larger probability. Third, the exponential amplifies differences — a logit of 4 vs. 2 creates a much bigger probability gap than a linear ratio would, making the model more decisive in its predictions.
Temperature — The Hidden Confidence Knob
You may have seen a "temperature" setting in ChatGPT or LLM APIs. Temperature modifies softmax by dividing logits before exponentiation:
Cross-Entropy Loss — The Standard Loss Function
Cross-entropy loss — information theory framing: how many "bits" does it take to communicate the true label under the model's predicted distribution?
Negative Log-Likelihood (NLL) — probability framing: how unlikely was the true label under the model? High NLL = the correct answer was very unexpected (high "unlikeliness").
Maximum Likelihood Estimation (MLE) — optimization framing: find parameters that maximize the probability of the data. Minimizing NLL = maximizing likelihood. These are mathematically identical.
The Full Formula — Role of the Summation and yᵢ
You may have seen cross-entropy written as ℒ = −Σᵢ yᵢ log(pᵢ) with a summation. But when you compute it in practice, the sum disappears and you're left with just −log(correct class probability). Why?
In multi-label tasks (where multiple classes can be simultaneously correct), many yᵢ values can be 1 at once, and the full summation matters. In standard single-label classification, the one-hot structure collapses the sum to a single term — which is why the simplified ℒ = −log(pᵢ_true) version is usually shown.
KL Divergence — How Different Are Two Distributions?
KL divergence measures how different one probability distribution is from another. It appears everywhere in modern deep learning.
The Three Concepts Together — Entropy, Cross-Entropy, KL
| Concept | Analogy | What changes during training |
|---|---|---|
| Entropy H(P) | How hard the exam questions are | Fixed — determined by your data |
| Cross-Entropy H(P,Q) | How many mistakes the student (model) makes | Decreases as model improves |
| KL Divergence | How far the student is from the teacher | Decreases toward 0 as model converges |
| NLL | How wrong on a single question | The per-sample loss we backprop through |
VAEs: KL(encoder output ‖ Gaussian) keeps latent space organized. Without this, the latent space becomes irregular and sampling from it produces garbage.
PPO (RLHF): KL(new policy ‖ old policy) prevents the model from changing too drastically in one step.
DPO (alignment): β × KL(π ‖ π_ref) controls how far the model drifts from the base model. β is the regularization strength.
Knowledge Distillation: KL(teacher ‖ student) trains a small model to match a big model's full output distribution, not just its top prediction.
The Gradient of Cross-Entropy — The Most Elegant Result in DL
Here is one of the most beautiful results in deep learning. When you combine softmax and cross-entropy loss, the gradient with respect to the logits simplifies to something shockingly simple.
∂ℒ/∂zₖ = pₖ − yₖ i.e., gradient = prediction − truth
Let's verify this with numbers. Suppose the model outputs for a cat image:
| Class | Prediction p | True label y | Gradient (p − y) | What happens |
|---|---|---|---|---|
| Cat (true) | 0.7 | 1 | 0.7 − 1 = −0.3 | Negative → logit pushed UP ↑ |
| Dog | 0.2 | 0 | 0.2 − 0 = +0.2 | Positive → logit pushed DOWN ↓ |
| Car | 0.1 | 0 | 0.1 − 0 = +0.1 | Positive → logit pushed DOWN ↓ |
The gradient automatically knows what to do: push the correct class up, push wrong classes down — proportional to how wrong each prediction was. The model that was 70% confident about the right answer gets a gentle nudge (−0.3). If it had been 10% confident, the nudge would be −0.9 (much stronger). The learning signal is proportional to the mistake.
This also explains why this is called the "wrongness meter": each gradient term is exactly the model's prediction error. Large error → large gradient → large update. The optimizer is measuring precisely how wrong each class was and correcting accordingly.
Deriving the Gradient — Where Does p − y Come From?
Let's prove the result from the previous section. This is pure calculus using the quotient rule and chain rule. Follow step by step — every line comes from the one above.
Step 1 — Write Softmax as a Fraction
Step 2 — Apply the Quotient Rule to ∂pᵢ/∂zₖ
Step 3 — Put It Together
Step 4 — Apply Chain Rule for the Full Gradient
This result — gradient = prediction − truth — is not just mathematically elegant. It tells you how quickly a model learns: the larger the prediction error, the larger the gradient, the larger the weight update. It also shows that softmax and cross-entropy are "made for each other" — their combination produces the cleanest possible gradient. This is why F.cross_entropy(logits, targets) in PyTorch applies softmax internally and should be used instead of applying softmax separately and then using NLL loss.
The Complete Picture — Everything Connected
Every concept from today flows into the others. Here is the full training loop expressed as a unified diagram:
Training a neural network is just iteratively asking: "How surprised was my model by the correct answers?" — and nudging the parameters to be less surprised next time. Log, softmax, likelihood, and cross-entropy are all different words for the same mathematical story.
Code — Everything From Scratch
import numpy as np import torch import torch.nn.functional as F # ───────────────────────────────────────────────────────────── # 1. LOG OF PROBABILITIES # ───────────────────────────────────────────────────────────── print("Probability → −log(p) [the loss]") for p in [0.99, 0.5, 0.1, 0.01]: print(f" p = {p:.2f} → loss = {-np.log(p):.3f}") # p=0.99 → 0.010 (model was right, tiny loss) # p=0.01 → 4.605 (model was catastrophically wrong) # ───────────────────────────────────────────────────────────── # 2. LIKELIHOOD — the coin toss example # ───────────────────────────────────────────────────────────── # Observed: H, H, T, H (data is FIXED) # Try different theta values to find maximum likelihood def likelihood(theta): """L(theta | H,H,T,H) = theta³ × (1-theta)""" return theta**3 * (1 - theta) print("\nLikelihood for H,H,T,H:") for theta in [0.3, 0.5, 0.75, 0.8, 0.9]: print(f" theta={theta} → L={likelihood(theta):.4f}") # theta=0.75 gives maximum → that's the MLE estimate (3/4 were Heads) # ───────────────────────────────────────────────────────────── # 3. SOFTMAX + TEMPERATURE # ───────────────────────────────────────────────────────────── def softmax(z, tau=1.0): """Softmax with temperature τ. Standard softmax when τ=1.""" z_scaled = z / tau exp_z = np.exp(z_scaled - np.max(z_scaled)) # subtract max for stability return exp_z / np.sum(exp_z) logits = np.array([2.1, 1.0, -0.5]) print("\nTemperature effect on same logits [2.1, 1.0, -0.5]:") for tau in [0.5, 1.0, 3.0]: probs = softmax(logits, tau) print(f" τ={tau}: {probs.round(3)}") # τ=0.5 → [0.93, 0.06, 0.01] ← sharp/confident # τ=1.0 → [0.71, 0.24, 0.05] ← standard # τ=3.0 → [0.43, 0.35, 0.22] ← flat/creative # ───────────────────────────────────────────────────────────── # 4. CROSS-ENTROPY LOSS — manually, with summation and y_i # ───────────────────────────────────────────────────────────── def cross_entropy_full(probs, one_hot_labels): """ Full formula: L = −Σᵢ yᵢ·log(pᵢ) Shows the summation and yᵢ explicitly. """ # Each term: y_i * log(p_i). All terms with y_i=0 vanish. terms = one_hot_labels * np.log(probs + 1e-10) print(f" Terms (y_i * log(p_i)): {terms.round(3)}") print(f" Non-zero terms: only the true class survives!") return -np.sum(terms) # 3-class: cat (true), dog, car probs_good = np.array([0.7, 0.2, 0.1]) probs_bad = np.array([0.1, 0.8, 0.1]) y_true = np.array([1.0, 0.0, 0.0]) # one-hot: cat is correct print("\nGood model (cat=0.7):") loss_good = cross_entropy_full(probs_good, y_true) print(f" Loss = {loss_good:.3f}") # ≈ 0.357 print("\nBad model (cat=0.1):") loss_bad = cross_entropy_full(probs_bad, y_true) print(f" Loss = {loss_bad:.3f}") # ≈ 2.303 # ───────────────────────────────────────────────────────────── # 5. GRADIENT = prediction − truth (the beautiful result) # ───────────────────────────────────────────────────────────── def compute_gradient(logits, true_class_idx): """ Gradient of cross-entropy loss wrt logits. Result: gradient[k] = p[k] - y[k] (prediction minus truth) """ probs = softmax(logits) y = np.zeros_like(probs) y[true_class_idx] = 1.0 gradient = probs - y # gradient = p - y return gradient, probs logits_example = np.array([1.2, 0.5, 0.1]) # cat, dog, car grad, probs = compute_gradient(logits_example, true_class_idx=0) # cat is true print("\nGradient computation:") classes = ["cat (true)", "dog", "car"] for i, (c, p, g) in enumerate(zip(classes, probs, grad)): direction = "push UP ↑" if g < 0 else "push DOWN ↓" print(f" {c:12s}: p={p:.3f} grad={g:+.3f} → {direction}") # ───────────────────────────────────────────────────────────── # 6. KL DIVERGENCE # ───────────────────────────────────────────────────────────── def kl_divergence(P, Q): """KL(P || Q). P = true dist, Q = model. Always >= 0.""" P, Q = np.array(P), np.array(Q) + 1e-10 return np.sum(P * np.log(P / Q)) true_dist = [0.7, 0.2, 0.1] close_pred = [0.6, 0.3, 0.1] # close to true far_pred = [0.1, 0.1, 0.8] # far from true print(f"\nKL(true ‖ close): {kl_divergence(true_dist, close_pred):.4f}") print(f"KL(true ‖ far): {kl_divergence(true_dist, far_pred):.4f}") # ───────────────────────────────────────────────────────────── # 7. VERIFY WITH PYTORCH — our scratch results should match # ───────────────────────────────────────────────────────────── logits_t = torch.tensor([[1.2, 0.5, 0.1]]) target_t = torch.tensor([0]) # cat = class 0 pt_loss = F.cross_entropy(logits_t, target_t) manual = -np.log(softmax(np.array([1.2, 0.5, 0.1]))[0]) print(f"\nManual loss: {manual:.4f}") print(f"PyTorch loss: {pt_loss.item():.4f}") # Should match exactly — confirming our understanding is correct
Blog Post Summary
- Probability is between 0 and 1. A probability distribution over K classes sums to exactly 1. Models output raw logits — softmax converts them to probabilities.
- Likelihood asks: "How well does this model explain the observed data?" Probability holds θ fixed and varies data. Likelihood holds data fixed and varies θ. This single distinction is the foundation of all model training.
- The coin story makes this concrete: probability says "given θ=0.7, what sequence will I see?" Likelihood says "given I saw H,H,T,H, what θ best explains this?" MLE finds the θ with the highest likelihood.
- Log is used because: (1) it turns products into sums — critical for multiplying millions of probabilities together; (2) it prevents numerical underflow (0.9^million → zero); (3) −log(p) penalizes confident mistakes severely (p=0.01 → loss=4.6).
- Softmax converts logits to probabilities using eᶻⁱ/Σeᶻʲ. The exponential keeps outputs positive, preserves order, and amplifies differences making the model more decisive.
- Temperature τ scales logits before softmax. Low τ → sharp/confident. High τ → flat/random. This is the literal temperature knob in ChatGPT/Claude.
- Cross-entropy loss = NLL = MLE. ℒ = −log P(y_true|x). Only the true class matters; the summation Σᵢ yᵢ log pᵢ collapses to one term because yᵢ=0 for all wrong classes.
- KL divergence = H(P,Q) − H(P). Minimizing cross-entropy is equivalent to minimizing KL(true ‖ predicted). Appears in VAEs, PPO, DPO, and knowledge distillation.
- The gradient of softmax + cross-entropy = pₖ − yₖ. Prediction minus truth. Derived from quotient rule → chain rule → simplification. Correct class gets pushed up, wrong classes get pushed down — proportional to the error.