Why Log, Softmax & Likelihood

Every loss function in deep learning — cross-entropy, NLL, contrastive loss, DPO, CLIP — is secretly the same idea repeated. Once you understand why we use probability, what likelihood truly means, and why we take its log, every loss function will click immediately.

What Is Probability?

A probability is a number between 0 and 1 expressing how likely something is. 0 means impossible. 1 means certain. The golden rule: all probabilities of possible outcomes must add up to exactly 1.

🎲

The Coin Toss

Flip a fair coin. P(Heads) = 0.5, P(Tails) = 0.5. Sum = 1.0. Every outcome gets a slice of the pie. The full pie is always exactly 1. In classification, this pie is split among all classes: P(cat) + P(dog) + P(car) = 1.0 always.

In machine learning, the model outputs a probability distribution over classes — a list of probabilities that sum to 1, telling you how confident the model is about each possibility.

P(cat) = 0.70 P(dog) = 0.20 P(car) = 0.10 ——————————— Sum = 1.00 ✓ A valid probability distribution. Raw neural network outputs (logits) are NOT probabilities yet — that is what Softmax fixes.

What Is Likelihood? — The Most Important Concept

Likelihood and probability use the same formula, but they ask completely different questions. This distinction is one of the most important in all of machine learning.

The Core Question Likelihood Answers

Probability asks: "Given this model, how likely is this data?"

Likelihood asks: "Given this data, how well does this model explain it?"

More formally — likelihood measures: "How good is my model at explaining the data I actually observed?"

🕵️

Detective Analogy

A detective arrives at a crime scene (the observed data is fixed). They consider three suspects (three different models/parameters). The best suspect is the one whose known behaviour most likely would have produced this crime scene. The detective doesn't change the evidence — they change their hypothesis until one fits best. That search for the best-fitting hypothesis is Maximum Likelihood Estimation.

Formal Definition

Given a dataset of N examples with inputs x and labels y:

Likelihood(θ) = ∏ᵢ P(yᵢ | xᵢ; θ) The product of the probability of each correct label, given model parameters θ. We treat this as a function of θ (the model), with the data held fixed. Log-Likelihood(θ) = Σᵢ log P(yᵢ | xᵢ; θ) Taking log converts the product into a sum — numerically stable, easier to optimize.

Binary Classification Example

Suppose you have a model and three data points:

Input	True Label y	Model P(y=1\|x)	Probability Used
x₁	1 (positive)	0.9	P(y=1) = 0.9
x₂	0 (negative)	0.2	P(y=0) = 1−0.2 = 0.8
x₃	1 (positive)	0.8	P(y=1) = 0.8

Likelihood = 0.9 × 0.8 × 0.8 = 0.576 High likelihood → the model's predictions match the true labels well. Log-Likelihood = log(0.9) + log(0.8) + log(0.8) ≈ −0.105 − 0.223 − 0.223 = −0.551 The log version is easier to work with computationally and mathematically.

Important Pattern

When true label y=1, use P(y=1|x) directly. When true label y=0, use 1−P(y=1|x). This ensures you always measure the probability of the correct label.

Probability vs Likelihood — Side by Side

Same formula. Completely different meaning depending on what you hold fixed and what you vary.

Probability

θ is fixed (model is known)

Data varies (ask about outcomes)

Question: "If my model has θ=0.7, what's P(H,T,H)?"

Direction: cause → effect

Thinking: forward — predict what happens

Likelihood

Data is fixed (observed results)

θ varies (we search for best model)

Question: "Given I observed H,H,T,H, what θ explains it best?"

Direction: effect → cause

Thinking: reverse engineering — find the cause

Aspect	Probability	Likelihood
What's fixed?	Parameters θ	Observed data
What varies?	Possible outcomes	Parameters θ
Real-world use	Forecasting, simulation	Model fitting, training
In deep learning	Model's output probabilities	The objective we maximize

The Coin Story — Probability vs Likelihood in Action

This story will make the distinction unforgettable. Follow it carefully — it's the same mental model that underlies every training loop in deep learning.

🪙

Setup

You find a mysterious coin. You don't know if it's fair. The unknown property is θ = probability of getting Heads. Your mission: figure out θ by tossing it and observing results.

Part 1 — Using Probability (θ known, data unknown)

Suppose you already know θ = 0.7 (70% chance of Heads). Now you ask: "What is the probability of getting H, T, H in three tosses?"

P(H, T, H | θ=0.7) = 0.7 × (1−0.7) × 0.7 = 0.7 × 0.3 × 0.7 = 0.147 θ is fixed. We compute the probability of a specific sequence of outcomes. Forward thinking: model → data.

Part 2 — Using Likelihood (data known, θ unknown)

Now the situation is reversed. You toss the coin and observe: H, H, T, H (fixed). You ask: "What value of θ best explains what I saw?"

L(θ | H,H,T,H) = θ × θ × (1−θ) × θ = θ³(1−θ) Data is fixed: H, H, T, H. We try different values of θ to find which explains the data best. θ = 0.5 → 0.5³ × 0.5 = 0.0625 θ = 0.8 → 0.8³ × 0.2 = 0.1024 ← highest! θ = 0.9 → 0.9³ × 0.1 = 0.0729 θ = 0.8 gives the highest likelihood → this is the Maximum Likelihood Estimate (MLE). Three out of four flips were Heads, so 0.75 is mathematically optimal, and 0.8 is close.

The One-Line Summary

Probability: "I know the coin, what will happen?" — Likelihood: "I saw what happened, what is the coin?"

Training a neural network is exactly this: the data is fixed (your training set), and you search over all possible parameter values θ (model weights) to find those that make the observed labels as probable as possible. That is MLE.

Why the Log Function? — Four Deep Reasons

This is the section most courses skip. We will not. The log function is not an arbitrary choice — it's the natural measure of information and surprise.

Natural log values you must know: log(1.0) = 0 ← certain event → zero surprise log(0.5) ≈ −0.693 ← 50% likely → moderate surprise log(0.1) ≈ −2.303 ← 10% likely → high surprise log(0.01) ≈ −4.605 ← 1% likely → very high surprise Key: log of a probability is always ≤ 0. The smaller the probability, the more negative (and larger in magnitude) the log.

Reason 1 — Products Become Sums

For independent events (like all training examples), probabilities multiply together. The fundamental property of logarithms converts multiplication into addition:

log(a × b × c) = log(a) + log(b) + log(c) This is the most important property. It turns a product of N probabilities into a sum of N log-probabilities — and sums are infinitely easier to work with mathematically and computationally.

Reason 2 — Preventing Numerical Underflow

💻

The Underflow Problem

Computers store numbers with limited precision. Numbers smaller than ~10⁻³⁰⁸ get rounded to zero — called "underflow". If you train on 1 million examples and each prediction has probability 0.9, multiplying them gives 0.9^1,000,000 ≈ 10^(−45,757). This is so tiny computers store it as exactly zero, making training impossible. Taking the log converts this to 1,000,000 × log(0.9) ≈ −105,361 — a perfectly normal number computers handle easily.

Reason 3 — Logarithmic Perception Matches Intuition

📊

The Richter Scale Analogy

The Richter scale for earthquakes is logarithmic. A magnitude 7 earthquake is not twice as powerful as magnitude 6 — it is 10× stronger. This matches human intuition: the difference between 0.001 and 0.01 feels as significant as the difference between 0.1 and 1.0. The log function captures exactly this — equal ratios feel equally different. log(0.001)=−6.9 and log(0.01)=−4.6 are separated by the same gap as log(0.1)=−2.3 and log(1)=0.

Reason 4 — Penalizes Confident Mistakes Severely

This is the training signal reason. The loss uses −log(p). Watch what happens as the model becomes confident about the correct answer:

Interactive — −log(p): the loss as a function of model confidence

Model confidence p = 0.70

Probability p 0.70

Model is 70% confident

Loss = −log(p) 0.357

Small loss — good prediction

Figure 1 — The −log(p) loss curve. High confidence in correct answer (p near 1) → near-zero loss. Low confidence or wrong prediction (p near 0) → loss shoots to infinity. This asymmetry is what drives learning.

Maximum Likelihood Estimation (MLE)

MLE answers one central question: "Given the data I observed, what model parameters best explain it?"

🎯

The Shooter Analogy

You're watching two archers: one expert, one beginner. The expert hits the bullseye 9/10 times. The beginner hits 3/10. You observe someone hit 9/10 times. Which archer is it most likely to be? The expert — because their known skill level most likely produces this result. That search — "which parameters make this data most probable?" — is MLE.

The MLE Formula

Likelihood(θ) = ∏ᵢ P(yᵢ | xᵢ; θ) Product of probabilities of all correct labels under parameters θ. Data is fixed; we vary θ. MLE Goal: maximize Σᵢ log P(yᵢ | xᵢ; θ) [log-likelihood] Maximizing log-likelihood is identical to maximizing likelihood (log is monotonically increasing). We use log for numerical stability. Equivalently: minimize −Σᵢ log P(yᵢ | xᵢ; θ) [negative log-likelihood = NLL] We minimize (not maximize) because all optimization algorithms in deep learning do gradient descent. Negating the objective converts a maximization into a minimization.

The Big Revelation

When you write F.cross_entropy() or F.nll_loss() in PyTorch — that IS maximum likelihood estimation. The framework is: "find model parameters that make the observed training labels as probable as possible." Every training loop in deep learning is MLE in disguise.

What "As Probable As Possible" Means Concretely

In a 3-class problem, imagine the model predicts these probabilities for a cat image:

Good Model

Cat → 0.90 ← correct class
Dog → 0.07
Car → 0.03

Correct class gets high probability → high likelihood

Bad Model

Cat → 0.05 ← correct class
Dog → 0.80
Car → 0.15

Correct class gets low probability → low likelihood

MLE training pushes the model from the right column toward the left column — adjusting weights so the correct label gets higher and higher probability.

Softmax — Converting Logits to Probabilities

Neural networks output raw numbers called logits. These can be any value — positive, negative, large, small. They are not probabilities. Softmax converts them to a valid probability distribution.

softmax(zᵢ) = eᶻⁱ / Σⱼ eᶻʲ For class i: raise e to the power of that logit, divide by the sum of e to the power of ALL logits. Example: z = [2.1, 1.0, −0.5] e^2.1 = 8.17, e^1.0 = 2.72, e^(−0.5) = 0.61 Sum = 8.17 + 2.72 + 0.61 = 11.5 P(cat) = 8.17/11.5 = 0.71 ✓ P(dog) = 2.72/11.5 = 0.24 ✓ P(car) = 0.61/11.5 = 0.05 ✓ [sum = 1.00]

Figure 2 — Softmax converts raw logits (any real number) into a valid probability distribution summing to 1.

Why Use e (the Exponential)?

Three reasons. First, e^x is always positive regardless of x — so we never get negative probabilities even when logits are negative. Second, it preserves order — a larger logit always produces a larger probability. Third, the exponential amplifies differences — a logit of 4 vs. 2 creates a much bigger probability gap than a linear ratio would, making the model more decisive in its predictions.

Linear comparison: 4 vs 2 → ratio = 2x (mild difference) Exponential: e⁴ ≈ 54.6 vs e² ≈ 7.4 → ratio = 7.4x (amplified!) The exponential turns a subtle score difference into a clear probability winner, which is exactly what a classifier needs.

Temperature — The Hidden Confidence Knob

You may have seen a "temperature" setting in ChatGPT or LLM APIs. Temperature modifies softmax by dividing logits before exponentiation:

softmax(zᵢ / τ) where τ = temperature Dividing by τ before softmax controls how "sharp" or "flat" the resulting distribution is. τ → 0 : distribution collapses to one-hot (always picks the highest logit, 100% confident) τ = 1 : standard softmax (default behavior) τ → ∞ : distribution becomes uniform (completely random, like guessing)

🌡️

Intuition — The Opinion Analogy

Logits are the "strength of opinion" the model has about each class. Temperature is "how strictly you follow the strongest opinion." Low temperature: always follow the strongest opinion (deterministic). High temperature: give equal weight to everyone's opinion regardless of confidence (random). This is literally what temperature does in ChatGPT — low temp = factual answers, high temp = creative writing.

Figure 3 — Temperature controls sharpness. Low τ: model picks confidently. High τ: probability spreads evenly across options. Used in LLM inference to control creativity vs determinism.

Cross-Entropy Loss — The Standard Loss Function

🏹

Intuition — The Surprise Meter

Cross-entropy measures: "How surprised is my model by the true answer?" A model that assigned 99% probability to the correct class is not surprised at all — low cross-entropy. A model that assigned 1% probability to the correct class is extremely surprised — high cross-entropy. Training minimizes this surprise. Think of it as a "shock meter" for the model.

For a single example: ℒ = − log P(y_true | x) Take the negative log of the probability assigned to the correct class. That's it. Over N training examples: ℒ = −(1/N) Σᵢ log P(yᵢ | xᵢ; θ) Average the negative log-probability of the correct answer across all training examples.

Figure 4 — Cross-entropy only cares about the probability given to the correct class. The bad model assigns 0.05 to cat (the true label) — a loss ~28× higher than the good model.

Cross-Entropy = NLL = MLE — Three Names, One Truth

Cross-entropy loss — information theory framing: how many "bits" does it take to communicate the true label under the model's predicted distribution?

Negative Log-Likelihood (NLL) — probability framing: how unlikely was the true label under the model? High NLL = the correct answer was very unexpected (high "unlikeliness").

Maximum Likelihood Estimation (MLE) — optimization framing: find parameters that maximize the probability of the data. Minimizing NLL = maximizing likelihood. These are mathematically identical.

The Full Formula — Role of the Summation and yᵢ

You may have seen cross-entropy written as ℒ = −Σᵢ yᵢ log(pᵢ) with a summation. But when you compute it in practice, the sum disappears and you're left with just −log(correct class probability). Why?

Full formula: ℒ = −Σᵢ yᵢ · log(pᵢ) Sum over all classes i. yᵢ = 1 if class i is correct, 0 otherwise. pᵢ = model's predicted probability for class i. For a cat image with 3 classes, true label = cat: yᵢ values: y_cat = 1, y_dog = 0, y_car = 0 (one-hot encoding) pᵢ values: p_cat = 0.7, p_dog = 0.2, p_car = 0.1 ℒ = −(1·log(0.7) + 0·log(0.2) + 0·log(0.1)) ℒ = −(log(0.7) + 0 + 0) ℒ = −log(0.7) = 0.357 The zero terms vanish! Only the true class survives because yᵢ=1 for only one class and yᵢ=0 for all others. The summation and yᵢ are not unnecessary — they are a "selector" that picks out the correct class from all classes.

Why The Full Formula Exists

In multi-label tasks (where multiple classes can be simultaneously correct), many yᵢ values can be 1 at once, and the full summation matters. In standard single-label classification, the one-hot structure collapses the sum to a single term — which is why the simplified ℒ = −log(pᵢ_true) version is usually shown.

KL Divergence — How Different Are Two Distributions?

KL divergence measures how different one probability distribution is from another. It appears everywhere in modern deep learning.

🗺️

The Map Analogy

You have a perfect weather map (distribution P) and your friend has a rough hand-drawn map (distribution Q). KL(P‖Q) measures how much extra confusion you'd experience navigating with your friend's map versus the perfect one. If the maps are identical, KL = 0. The more they differ, the higher the KL.

KL(P ‖ Q) = Σᵢ P(i) × log[ P(i) / Q(i) ] Sum over all outcomes of: true probability × log(true/model). Always ≥ 0. Equals 0 only when P = Q exactly. KL is asymmetric: KL(P‖Q) ≠ KL(Q‖P) "How different is my model from the data?" is a different question than "How different is the data from my model?"

The Three Concepts Together — Entropy, Cross-Entropy, KL

Entropy H(P) = −Σᵢ P(i) log P(i) Entropy of the TRUE distribution only. Measures the intrinsic uncertainty/complexity in the data itself. Fixed — we can't control it. Cross-Entropy H(P,Q) = −Σᵢ P(i) log Q(i) How many bits needed to encode data drawn from P using a code designed for Q. Lower when model Q matches reality P. Key relationship: H(P,Q) = H(P) + KL(P ‖ Q) Cross-entropy = data's inherent entropy + the extra cost from using the wrong distribution. Since H(P) is fixed, minimizing cross-entropy = minimizing KL divergence.

Concept	Analogy	What changes during training
Entropy H(P)	How hard the exam questions are	Fixed — determined by your data
Cross-Entropy H(P,Q)	How many mistakes the student (model) makes	Decreases as model improves
KL Divergence	How far the student is from the teacher	Decreases toward 0 as model converges
NLL	How wrong on a single question	The per-sample loss we backprop through

Where KL Divergence Appears in Modern ML

VAEs: KL(encoder output ‖ Gaussian) keeps latent space organized. Without this, the latent space becomes irregular and sampling from it produces garbage.

PPO (RLHF): KL(new policy ‖ old policy) prevents the model from changing too drastically in one step.

DPO (alignment): β × KL(π ‖ π_ref) controls how far the model drifts from the base model. β is the regularization strength.

Knowledge Distillation: KL(teacher ‖ student) trains a small model to match a big model's full output distribution, not just its top prediction.

The Gradient of Cross-Entropy — The Most Elegant Result in DL

Here is one of the most beautiful results in deep learning. When you combine softmax and cross-entropy loss, the gradient with respect to the logits simplifies to something shockingly simple.

The Result

∂ℒ/∂zₖ = pₖ − yₖ i.e., gradient = prediction − truth

Let's verify this with numbers. Suppose the model outputs for a cat image:

Class	Prediction p	True label y	Gradient (p − y)	What happens
Cat (true)	0.7	1	0.7 − 1 = −0.3	Negative → logit pushed UP ↑
Dog	0.2	0	0.2 − 0 = +0.2	Positive → logit pushed DOWN ↓
Car	0.1	0	0.1 − 0 = +0.1	Positive → logit pushed DOWN ↓

Why This Is Beautiful

The gradient automatically knows what to do: push the correct class up, push wrong classes down — proportional to how wrong each prediction was. The model that was 70% confident about the right answer gets a gentle nudge (−0.3). If it had been 10% confident, the nudge would be −0.9 (much stronger). The learning signal is proportional to the mistake.

This also explains why this is called the "wrongness meter": each gradient term is exactly the model's prediction error. Large error → large gradient → large update. The optimizer is measuring precisely how wrong each class was and correcting accordingly.

Deriving the Gradient — Where Does p − y Come From?

Let's prove the result from the previous section. This is pure calculus using the quotient rule and chain rule. Follow step by step — every line comes from the one above.

Step 1 — Write Softmax as a Fraction

pᵢ = eᶻⁱ / S where S = Σⱼ eᶻʲ (the denominator, the sum of all exponentials)

Step 2 — Apply the Quotient Rule to ∂pᵢ/∂zₖ

d/dz (A/B) = (A'B − AB') / B² Here A = eᶻⁱ and B = S. We need to differentiate both with respect to zₖ. (a) Numerator derivative: ∂eᶻⁱ/∂zₖ = eᶻⁱ if i=k, else 0 This can be written compactly as eᶻⁱ · δᵢₖ, where δᵢₖ (Kronecker delta) = 1 if i=k, else 0. (b) Denominator derivative: ∂S/∂zₖ = ∂(Σⱼ eᶻʲ)/∂zₖ = eᶻₖ Only the term where j=k survives differentiation.

Step 3 — Put It Together

∂pᵢ/∂zₖ = (eᶻⁱ·δᵢₖ · S − eᶻⁱ · eᶻₖ) / S² = (eᶻⁱ/S)(δᵢₖ − eᶻₖ/S) = pᵢ · (δᵢₖ − pₖ) Two cases: if i=k then δᵢₖ=1, giving pᵢ(1−pᵢ). If i≠k then δᵢₖ=0, giving −pᵢpₖ. A larger probability for any class comes at the expense of all others — they're coupled.

Step 4 — Apply Chain Rule for the Full Gradient

∂ℒ/∂zₖ = Σᵢ (∂ℒ/∂pᵢ) · (∂pᵢ/∂zₖ) Chain rule: loss → probability → logit. ∂ℒ/∂pᵢ = −yᵢ/pᵢ (derivative of −Σ yᵢ log pᵢ with respect to pᵢ) Substituting: ∂ℒ/∂zₖ = Σᵢ (−yᵢ/pᵢ) · pᵢ(δᵢₖ − pₖ) = Σᵢ −yᵢ(δᵢₖ − pₖ) = −yₖ + pₖ · Σᵢ yᵢ Since labels are one-hot: Σᵢ yᵢ = 1 (exactly one class is correct) ∂ℒ/∂zₖ = pₖ − yₖ ✓ Prediction minus truth. Clean, simple, and makes perfect intuitive sense.

Why This Derivation Matters for Research

This result — gradient = prediction − truth — is not just mathematically elegant. It tells you how quickly a model learns: the larger the prediction error, the larger the gradient, the larger the weight update. It also shows that softmax and cross-entropy are "made for each other" — their combination produces the cleanest possible gradient. This is why F.cross_entropy(logits, targets) in PyTorch applies softmax internally and should be used instead of applying softmax separately and then using NLL loss.

The Complete Picture — Everything Connected

Every concept from today flows into the others. Here is the full training loop expressed as a unified diagram:

Figure 5 — The complete forward/backward pass. Raw input → neural net → softmax → probability → cross-entropy loss (= NLL = KL divergence). Gradient = p − y flows back to update θ. Repeat until convergence.

The Unified View

Training a neural network is just iteratively asking: "How surprised was my model by the correct answers?" — and nudging the parameters to be less surprised next time. Log, softmax, likelihood, and cross-entropy are all different words for the same mathematical story.

Code — Everything From Scratch

Python · NumPy + PyTorch Log, Softmax, Likelihood, Cross-Entropy, KL, Gradient — all from scratch

import numpy as np
import torch
import torch.nn.functional as F

# ─────────────────────────────────────────────────────────────
# 1. LOG OF PROBABILITIES
# ─────────────────────────────────────────────────────────────
print("Probability  →  −log(p)  [the loss]")
for p in [0.99, 0.5, 0.1, 0.01]:
    print(f"  p = {p:.2f}  →  loss = {-np.log(p):.3f}")
# p=0.99 → 0.010  (model was right, tiny loss)
# p=0.01 → 4.605  (model was catastrophically wrong)


# ─────────────────────────────────────────────────────────────
# 2. LIKELIHOOD — the coin toss example
# ─────────────────────────────────────────────────────────────
# Observed: H, H, T, H (data is FIXED)
# Try different theta values to find maximum likelihood
def likelihood(theta):
    """L(theta | H,H,T,H) = theta³ × (1-theta)"""
    return theta**3 * (1 - theta)

print("\nLikelihood for H,H,T,H:")
for theta in [0.3, 0.5, 0.75, 0.8, 0.9]:
    print(f"  theta={theta} → L={likelihood(theta):.4f}")
# theta=0.75 gives maximum → that's the MLE estimate (3/4 were Heads)


# ─────────────────────────────────────────────────────────────
# 3. SOFTMAX + TEMPERATURE
# ─────────────────────────────────────────────────────────────
def softmax(z, tau=1.0):
    """Softmax with temperature τ. Standard softmax when τ=1."""
    z_scaled = z / tau
    exp_z = np.exp(z_scaled - np.max(z_scaled))  # subtract max for stability
    return exp_z / np.sum(exp_z)

logits = np.array([2.1, 1.0, -0.5])

print("\nTemperature effect on same logits [2.1, 1.0, -0.5]:")
for tau in [0.5, 1.0, 3.0]:
    probs = softmax(logits, tau)
    print(f"  τ={tau}: {probs.round(3)}")
# τ=0.5 → [0.93, 0.06, 0.01]  ← sharp/confident
# τ=1.0 → [0.71, 0.24, 0.05]  ← standard
# τ=3.0 → [0.43, 0.35, 0.22]  ← flat/creative


# ─────────────────────────────────────────────────────────────
# 4. CROSS-ENTROPY LOSS — manually, with summation and y_i
# ─────────────────────────────────────────────────────────────
def cross_entropy_full(probs, one_hot_labels):
    """
    Full formula: L = −Σᵢ yᵢ·log(pᵢ)
    Shows the summation and yᵢ explicitly.
    """
    # Each term: y_i * log(p_i). All terms with y_i=0 vanish.
    terms = one_hot_labels * np.log(probs + 1e-10)
    print(f"  Terms (y_i * log(p_i)): {terms.round(3)}")
    print(f"  Non-zero terms: only the true class survives!")
    return -np.sum(terms)

# 3-class: cat (true), dog, car
probs_good = np.array([0.7, 0.2, 0.1])
probs_bad  = np.array([0.1, 0.8, 0.1])
y_true     = np.array([1.0, 0.0, 0.0])  # one-hot: cat is correct

print("\nGood model (cat=0.7):")
loss_good = cross_entropy_full(probs_good, y_true)
print(f"  Loss = {loss_good:.3f}")  # ≈ 0.357

print("\nBad model (cat=0.1):")
loss_bad = cross_entropy_full(probs_bad, y_true)
print(f"  Loss = {loss_bad:.3f}")   # ≈ 2.303


# ─────────────────────────────────────────────────────────────
# 5. GRADIENT = prediction − truth  (the beautiful result)
# ─────────────────────────────────────────────────────────────
def compute_gradient(logits, true_class_idx):
    """
    Gradient of cross-entropy loss wrt logits.
    Result: gradient[k] = p[k] - y[k]  (prediction minus truth)
    """
    probs = softmax(logits)
    y = np.zeros_like(probs)
    y[true_class_idx] = 1.0
    gradient = probs - y          # gradient = p - y
    return gradient, probs

logits_example = np.array([1.2, 0.5, 0.1])  # cat, dog, car
grad, probs = compute_gradient(logits_example, true_class_idx=0)  # cat is true

print("\nGradient computation:")
classes = ["cat (true)", "dog", "car"]
for i, (c, p, g) in enumerate(zip(classes, probs, grad)):
    direction = "push UP ↑" if g < 0 else "push DOWN ↓"
    print(f"  {c:12s}: p={p:.3f}  grad={g:+.3f}  → {direction}")


# ─────────────────────────────────────────────────────────────
# 6. KL DIVERGENCE
# ─────────────────────────────────────────────────────────────
def kl_divergence(P, Q):
    """KL(P || Q). P = true dist, Q = model. Always >= 0."""
    P, Q = np.array(P), np.array(Q) + 1e-10
    return np.sum(P * np.log(P / Q))

true_dist  = [0.7, 0.2, 0.1]
close_pred = [0.6, 0.3, 0.1]  # close to true
far_pred   = [0.1, 0.1, 0.8]  # far from true

print(f"\nKL(true ‖ close): {kl_divergence(true_dist, close_pred):.4f}")
print(f"KL(true ‖ far):   {kl_divergence(true_dist, far_pred):.4f}")


# ─────────────────────────────────────────────────────────────
# 7. VERIFY WITH PYTORCH — our scratch results should match
# ─────────────────────────────────────────────────────────────
logits_t = torch.tensor([[1.2, 0.5, 0.1]])
target_t = torch.tensor([0])  # cat = class 0

pt_loss = F.cross_entropy(logits_t, target_t)
manual  = -np.log(softmax(np.array([1.2, 0.5, 0.1]))[0])

print(f"\nManual loss:  {manual:.4f}")
print(f"PyTorch loss: {pt_loss.item():.4f}")
# Should match exactly — confirming our understanding is correct

Blog Post Summary

📝

Why Log, Softmax & Likelihood — The Language of Every Loss Function

Probability is between 0 and 1. A probability distribution over K classes sums to exactly 1. Models output raw logits — softmax converts them to probabilities.
Likelihood asks: "How well does this model explain the observed data?" Probability holds θ fixed and varies data. Likelihood holds data fixed and varies θ. This single distinction is the foundation of all model training.
The coin story makes this concrete: probability says "given θ=0.7, what sequence will I see?" Likelihood says "given I saw H,H,T,H, what θ best explains this?" MLE finds the θ with the highest likelihood.
Log is used because: (1) it turns products into sums — critical for multiplying millions of probabilities together; (2) it prevents numerical underflow (0.9^million → zero); (3) −log(p) penalizes confident mistakes severely (p=0.01 → loss=4.6).
Softmax converts logits to probabilities using eᶻⁱ/Σeᶻʲ. The exponential keeps outputs positive, preserves order, and amplifies differences making the model more decisive.
Temperature τ scales logits before softmax. Low τ → sharp/confident. High τ → flat/random. This is the literal temperature knob in ChatGPT/Claude.
Cross-entropy loss = NLL = MLE. ℒ = −log P(y_true|x). Only the true class matters; the summation Σᵢ yᵢ log pᵢ collapses to one term because yᵢ=0 for all wrong classes.
KL divergence = H(P,Q) − H(P). Minimizing cross-entropy is equivalent to minimizing KL(true ‖ predicted). Appears in VAEs, PPO, DPO, and knowledge distillation.
The gradient of softmax + cross-entropy = pₖ − yₖ. Prediction minus truth. Derived from quotient rule → chain rule → simplification. Correct class gets pushed up, wrong classes get pushed down — proportional to the error.