Loss Functions in Artificial Neural Networks: Types, Math & PyTorch

A complete reference covering every major loss function: mathematical formulas, gradient behaviour, outlier sensitivity, class imbalance handling, PyTorch implementations with real code snippets, custom loss function patterns, and a decision guide for picking the right loss for your task.

PyTorch Code Full Math Decision Guide
By Bimal Ghimire • Published December 12, 2025 • Updated February 26, 2026 • 20 min read

What Is a Loss Function?

A loss function (also called a cost function or objective function) is a mathematical function that measures the discrepancy between a neural network's predictions and the ground truth labels. During training, an optimiser (SGD, Adam, RMSProp) minimises this scalar value by computing its gradient with respect to every trainable parameter via backpropagation.

The general form of the loss over a dataset is:

General Loss Function
$$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell\bigl(f(x_i;\,\theta),\; y_i\bigr)$$

Where $x_i$ is the input, $y_i$ the ground truth, $\hat{y}_i = f(x_i;\theta)$ the model prediction parameterised by weights $\theta$, $\ell$ the per-sample loss, and $N$ the number of samples. The optimiser updates $\theta$ using the gradient $\nabla_\theta \mathcal{L}$.

MSE / MAE
Regression defaults
Cross-Entropy
Classification default
Focal
Class imbalance
Triplet
Metric / embedding learning

The choice of loss function is not cosmetic: it defines what the model is optimising for. A model trained with MSE learns to minimise average squared error and will produce conditional mean predictions. A model trained with MAE learns conditional median predictions. A model trained with cross-entropy maximises predicted probability at the correct class. Getting this choice wrong can lead to models that are technically converging but learning the wrong objective entirely.

Loss function vs metric: The loss function must be differentiable (for gradient computation) and often smooth. The evaluation metric (accuracy, F1, AUC, RMSE) is what you actually care about but may not be differentiable. Common pattern: train with cross-entropy loss, evaluate with accuracy or F1.

Regression Loss Functions

Mean Squared Error (MSE / L2 Loss)
Regression Outlier Sensitive
$$\mathcal{L}_{MSE} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2$$

MSE is the most widely used regression loss. Squaring the residual has two effects: it makes all terms non-negative (enabling summation), and it assigns disproportionately high penalty to large errors (the gradient scales linearly with the residual magnitude). A prediction that is 3 units off contributes 9x more loss than a prediction 1 unit off.

Gradient: $\frac{\partial\mathcal{L}}{\partial\hat{y}_i} = \frac{2}{N}(\hat{y}_i - y_i)$. Smooth everywhere, enabling efficient gradient descent convergence. Models trained with MSE produce conditional mean predictions. PyTorch: nn.MSELoss().

Mean Absolute Error (MAE / L1 Loss)
Regression Outlier Robust
$$\mathcal{L}_{MAE} = \frac{1}{N}\sum_{i=1}^{N}|y_i - \hat{y}_i|$$

MAE treats all residuals equally regardless of magnitude, making it outlier-robust. A 3-unit error contributes only 3x more than a 1-unit error. Models trained with MAE produce conditional median predictions, which is often more representative for skewed data distributions. The gradient is constant ($\pm 1/N$) everywhere except at zero, which can cause oscillation near the optimum. PyTorch: nn.L1Loss().

Huber Loss (Smooth L1)
Regression Outlier Balanced
$$\mathcal{L}_{\delta}(y,\hat{y}) = \begin{cases} \frac{1}{2}(y-\hat{y})^2 & |y-\hat{y}| \leq \delta \\ \delta\left(|y-\hat{y}| - \frac{\delta}{2}\right) & |y-\hat{y}| > \delta \end{cases}$$

Huber loss is quadratic for small residuals (like MSE, with smooth gradients near the optimum) and linear for large residuals (like MAE, ignoring outliers). The hyperparameter $\delta$ controls the transition threshold. It is the loss function of choice for robust regression tasks and is widely used in reinforcement learning (Deep Q-Networks) and object detection regression heads. PyTorch: nn.HuberLoss(delta=1.0) or nn.SmoothL1Loss(beta=1.0).

Log-Cosh Loss
Regression Twice Differentiable
$$\mathcal{L} = \sum_{i=1}^{N}\log\bigl(\cosh(\hat{y}_i - y_i)\bigr)$$

Log-Cosh behaves approximately like $\frac{1}{2}e^2$ for small errors and like $|e| - \log 2$ for large errors. Critically, it is twice differentiable everywhere (unlike MAE, which has a non-differentiable kink at zero and unlike Huber, which has a discontinuous second derivative at $\delta$). This makes it particularly suitable for second-order optimisers (L-BFGS). Not natively in PyTorch; trivially implemented as a custom function.

Mean Absolute Percentage Error (MAPE)
Regression Undefined at y=0
$$\mathcal{L}_{MAPE} = \frac{100\%}{N}\sum_{i=1}^{N}\left|\frac{y_i - \hat{y}_i}{y_i}\right|$$

MAPE expresses error as a percentage of the true value, making it scale-independent and interpretable across datasets with different units. Major limitation: undefined when $y_i = 0$, and heavily penalises underpredictions vs overpredictions asymmetrically. sMAPE (symmetric MAPE) addresses the asymmetry issue. Used primarily as an evaluation metric rather than a training loss in neural networks due to gradient instability near zero.

Classification Loss Functions

Binary Cross-Entropy (BCE) Loss
Binary Classification
$$\mathcal{L}_{BCE} = -\frac{1}{N}\sum_{i=1}^{N}\bigl[y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\bigr]$$

BCE (negative log-likelihood for Bernoulli distribution) is the standard loss for binary classification. $\hat{y}_i \in (0,1)$ is a sigmoid-activated probability. The log terms penalise confident wrong predictions extremely heavily: if the model outputs 0.99 probability for class 0 when the true class is 1, the loss is $-\log(0.01) \approx 4.6$. PyTorch: nn.BCELoss() (expects sigmoid-activated output) or nn.BCEWithLogitsLoss() (numerically stable, accepts raw logits).

Numerical stability: Always prefer BCEWithLogitsLoss over manually applying sigmoid then BCELoss. The combined version uses the log-sum-exp trick internally, preventing overflow/underflow for extreme logit values.

Categorical Cross-Entropy (Softmax + NLL)
Multi-class Classification
$$\mathcal{L}_{CE} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_{i,c}\,\log(\hat{y}_{i,c})$$
With softmax: $\hat{y}_{i,c} = \dfrac{e^{z_{i,c}}}{\sum_{k=1}^{C}e^{z_{i,k}}}$

For $C$ mutually exclusive classes. With one-hot labels, this simplifies to $-\log(\hat{y}_{i,c^*})$ where $c^*$ is the true class. PyTorch nn.CrossEntropyLoss() internally combines log-softmax and NLLLoss, so pass raw logits (no softmax applied in the model). This is the most common loss for image classification, NLP token prediction, and any fixed-class output head.

Focal Loss
Class Imbalance Object Detection
$$\mathcal{L}_{FL} = -\frac{1}{N}\sum_{i=1}^{N}\alpha_t(1-\hat{y}_{i,t})^\gamma\,\log(\hat{y}_{i,t})$$

Introduced by Lin et al. (Facebook AI Research, 2017) for RetinaNet, Focal Loss addresses extreme class imbalance in object detection where easy negatives dominate. The modulating factor $(1-p_t)^\gamma$ down-weights the loss for well-classified easy examples (high $p_t$, factor near 0) and focuses training on hard misclassified examples (low $p_t$, factor near 1). $\gamma = 2$ and $\alpha = 0.25$ are the paper's recommended defaults. Not natively in PyTorch; widely available via torchvision.ops.sigmoid_focal_loss (since torchvision 0.11).

Cross-Entropy with Label Smoothing
Classification Regularisation
$$y_{i,c}^{smooth} = \begin{cases} 1 - \varepsilon & c = c^* \\ \varepsilon/(C-1) & c \neq c^* \end{cases}$$

Label smoothing distributes a small probability mass $\varepsilon$ (typically 0.05 to 0.1) uniformly across incorrect classes instead of using hard 0/1 targets. This prevents the model from becoming overconfident (logits approaching $\pm\infty$), regularises the output distribution, and significantly improves calibration. Widely used in transformer models (ViT, BERT, GPT fine-tuning). PyTorch: nn.CrossEntropyLoss(label_smoothing=0.1).

Hinge Loss (SVM-style)
Binary Classification
$$\mathcal{L}_{hinge} = \frac{1}{N}\sum_{i=1}^{N}\max\bigl(0,\;1 - y_i \cdot \hat{y}_i\bigr)$$
where $y_i \in \{-1, +1\}$, $\hat{y}_i$ is the raw score (logit)

Hinge loss maximises the decision margin between classes. Examples classified correctly with confidence exceeding 1 contribute zero loss; only misclassified or low-confidence examples contribute. The sparsity of the gradient (zero for confidently correct predictions) gives it implicit regularisation properties similar to L1. PyTorch: nn.HingeEmbeddingLoss() or custom implementation.

KL Divergence Loss
Distribution Matching VAE / Knowledge Distillation
$$D_{KL}(P \| Q) = \sum_{x} P(x)\log\frac{P(x)}{Q(x)}$$

KL Divergence measures how much a probability distribution $Q$ (model prediction) diverges from a reference distribution $P$ (target). It is asymmetric: $D_{KL}(P\|Q) \neq D_{KL}(Q\|P)$. Used in: Variational Autoencoders (regularising the latent space distribution toward a Gaussian prior); knowledge distillation (matching student softmax output to teacher soft labels); and reinforcement learning (PPO policy update constraint). PyTorch: nn.KLDivLoss() (expects log-probabilities as input).

Ranking, Contrastive & Metric Learning Losses

Triplet Loss

$$\mathcal{L}_{triplet} = \frac{1}{N}\sum_{i=1}^{N}\max\!\bigl(0,\; \|f(a_i)-f(p_i)\|_2^2 - \|f(a_i)-f(n_i)\|_2^2 + \alpha\bigr)$$
$a$ = anchor, $p$ = positive (same class), $n$ = negative (different class), $\alpha$ = margin

Triplet loss learns an embedding space where same-class samples are pulled together and different-class samples are pushed apart by at least margin $\alpha$. Used in face recognition (FaceNet), image retrieval, and person re-identification. Efficient mining of hard triplets (hard negatives close to the anchor) is critical for convergence. PyTorch: nn.TripletMarginLoss(margin=1.0).

Contrastive Loss (SimCLR / Self-Supervised)

$$\mathcal{L}_{i,j} = -\log\frac{\exp(\text{sim}(z_i,z_j)/\tau)}{\sum_{k=1}^{2N}\mathbf{1}_{[k\neq i]}\exp(\text{sim}(z_i,z_k)/\tau)}$$
NT-Xent (Normalised Temperature-scaled Cross-Entropy) loss used in SimCLR (Chen et al., Google Brain 2020)

NT-Xent pulls together augmented views of the same image (positive pairs) while pushing apart all other images in the batch (negatives). This self-supervised contrastive learning approach enabled SimCLR to achieve supervised-competitive performance on ImageNet without labels. The temperature $\tau$ controls how sharply the loss focuses on hard negatives (lower $\tau$ = harder).

InfoNCE Loss (used in CLIP, DALL-E)

$$\mathcal{L}_{InfoNCE} = -\frac{1}{N}\sum_{i=1}^{N}\log\frac{\exp(q_i \cdot k_i / \tau)}{\sum_{j=1}^{N}\exp(q_i \cdot k_j / \tau)}$$

InfoNCE (van den Oord et al., DeepMind 2018) maximises mutual information between query $q$ and its matching key $k$ (e.g., image and caption in CLIP). It is the foundational loss for multi-modal contrastive learning. CLIP (OpenAI) trains bidirectional InfoNCE loss on 400M image-text pairs, creating the powerful cross-modal embeddings used in DALL-E, Stable Diffusion, and countless downstream tasks.

Full Loss Function Comparison Table

Loss FunctionTaskOutlier SensitivityGradient at 0Class ImbalancePyTorch ClassWhen to Use
MSE (L2)RegressionHigh0 (smooth)N/Ann.MSELoss()Default regression; clean data; smooth gradient needed
MAE (L1)RegressionLowUndefined (subgradient ±1)N/Ann.L1Loss()Noisy/outlier-heavy data; median prediction desired
Huber / Smooth L1RegressionLow-Moderate0 (smooth)N/Ann.HuberLoss()Best of both worlds; object detection bbox regression
Log-CoshRegressionLow0 (twice smooth)N/ACustomWhen twice-differentiability required (2nd order optim.)
BCEBinary classif.N/A-Handles via pos_weightnn.BCEWithLogitsLoss()Binary classification; sigmoid output
Categorical CEMulti-classN/A-Handles via weight paramnn.CrossEntropyLoss()Default multi-class; always pass raw logits
CE + Label SmoothingMulti-classN/A-Moderatenn.CrossEntropyLoss(label_smoothing=0.1)Preventing overconfidence; transformers; calibration
Focal LossBinary/multi-classN/A-Excellenttorchvision.ops.sigmoid_focal_loss()Severe class imbalance; dense object detection
Hinge LossBinary classif.N/A-Poornn.HingeEmbeddingLoss()SVM-style margin maximisation; structured prediction
KL DivergenceDistribution matchN/A-N/Ann.KLDivLoss()VAE latent regularisation; knowledge distillation
Triplet LossMetric learningN/A-N/Ann.TripletMarginLoss()Face recognition; image retrieval; re-ID
NT-Xent / InfoNCESelf-supervisedN/A-N/ACustomContrastive self-supervised learning; CLIP-style multi-modal

PyTorch Implementation Examples

1. Regression with MSE and Huber Loss

import torch import torch.nn as nn # Sample regression data targets = torch.tensor([1.0, 2.0, 3.0, 100.0]) # note the outlier 100.0 predictions = torch.tensor([1.2, 1.8, 3.1, 4.0]) mse_loss = nn.MSELoss() mae_loss = nn.L1Loss() huber_loss = nn.HuberLoss(delta=1.0) print(f"MSE: {mse_loss(predictions, targets):.4f}") # ~2401 (outlier dominates) print(f"MAE: {mae_loss(predictions, targets):.4f}") # ~24.2 (robust to outlier) print(f"Huber: {huber_loss(predictions, targets):.4f}") # ~24.1 (linear for large error)

2. Binary Classification with BCEWithLogitsLoss and class weighting

import torch import torch.nn as nn # Imbalanced dataset: 90% class 0, 10% class 1 -> pos_weight=9 pos_weight = torch.tensor([9.0]) criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight) # Raw logits (pre-sigmoid) and binary labels logits = torch.tensor([ 0.5, -1.2, 2.1, -0.3]) targets = torch.tensor([ 1.0, 0.0, 1.0, 0.0]) loss = criterion(logits, targets) print(f"Weighted BCE loss: {loss.item():.4f}")

3. Multi-class Classification with CrossEntropyLoss and Label Smoothing

import torch import torch.nn as nn # 2 samples, 3 classes; raw logits (no softmax applied in model) logits = torch.tensor([[0.1, 0.2, 0.7], [0.3, 0.5, 0.2]]) targets = torch.tensor([2, 1]) # class indices # Standard CE ce = nn.CrossEntropyLoss() # CE with 10% label smoothing ce_ls = nn.CrossEntropyLoss(label_smoothing=0.1) # CE with class weights (e.g. class 2 is rare) weights = torch.tensor([1.0, 1.0, 5.0]) ce_w = nn.CrossEntropyLoss(weight=weights) print(f"CE: {ce(logits, targets):.4f}") # ~0.813 print(f"CE + smoothing: {ce_ls(logits, targets):.4f}") # ~0.963 print(f"CE + weighting: {ce_w(logits, targets):.4f}") # higher, class 2 upweighted

4. Complete Training Loop (MNIST classification)

import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader from torchvision import datasets, transforms class SimpleNN(nn.Module): def __init__(self): super().__init__() self.net = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Dropout(0.2), nn.Linear(256, 10) # raw logits, no softmax ) def forward(self, x): return self.net(x.view(-1, 784)) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = SimpleNN().to(device) criterion = nn.CrossEntropyLoss(label_smoothing=0.05) optimizer = optim.Adam(model.parameters(), lr=1e-3) train_loader = DataLoader( datasets.MNIST('./data', train=True, download=True, transform=transforms.ToTensor()), batch_size=128, shuffle=True ) for epoch in range(10): model.train() running_loss = 0.0 for inputs, labels in train_loader: inputs, labels = inputs.to(device), labels.to(device) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() running_loss += loss.item() print(f"Epoch {epoch+1}: loss={running_loss/len(train_loader):.4f}")

Custom Loss Functions in PyTorch

Custom loss functions in PyTorch are implemented by subclassing nn.Module or simply as plain Python functions using PyTorch tensor operations (which automatically build the autograd graph). The key constraint is that all operations must use PyTorch tensor ops to ensure gradients flow correctly.

Pattern 1: Functional custom loss (recommended for simple cases)

import torch import torch.nn.functional as F def log_cosh_loss(predictions: torch.Tensor, targets: torch.Tensor) -> torch.Tensor: """Log-Cosh loss: smooth, twice-differentiable, outlier-robust.""" error = predictions - targets return torch.mean(torch.log(torch.cosh(error + 1e-12))) def focal_loss_manual(logits, targets, gamma=2.0, alpha=0.25): """Focal loss for binary classification.""" bce = F.binary_cross_entropy_with_logits(logits, targets, reduction='none') pt = torch.exp(-bce) focal = alpha * (1 - pt) ** gamma * bce return focal.mean()

Pattern 2: nn.Module subclass (recommended for stateful or parameterised losses)

import torch import torch.nn as nn class WeightedCombinedLoss(nn.Module): """Combine MSE and MAE with learnable or fixed weighting.""" def __init__(self, alpha=0.5): super().__init__() self.alpha = alpha self.mse = nn.MSELoss() self.mae = nn.L1Loss() def forward(self, pred, target): return self.alpha * self.mse(pred, target) + (1 - self.alpha) * self.mae(pred, target) class DiceLoss(nn.Module): """Dice loss for binary segmentation (class imbalance robust).""" def __init__(self, smooth=1.0): super().__init__() self.smooth = smooth def forward(self, pred, target): pred = torch.sigmoid(pred).view(-1) target = target.view(-1) intersection = (pred * target).sum() dice = (2 * intersection + self.smooth) / (pred.sum() + target.sum() + self.smooth) return 1 - dice

Gradient checking your custom loss: Always verify your custom loss produces correct gradients using torch.autograd.gradcheck(your_loss_fn, inputs) with double-precision inputs. This numerically estimates gradients via finite differences and compares to your autograd-computed gradients, catching implementation errors before they silently degrade training.

How to Choose the Right Loss Function

Regression, clean data

MSE (nn.MSELoss). Smooth gradients everywhere. Penalises large errors quadratically, encouraging tight predictions.

Regression, noisy or outlier-heavy data

Huber Loss (nn.HuberLoss) or MAE (nn.L1Loss). Huber is usually preferred for being smooth near the optimum while remaining linear for large residuals.

Binary classification, balanced classes

BCEWithLogitsLoss. Always use the logits version for numerical stability. Never apply sigmoid before passing to this loss.

Binary classification, severe imbalance

BCEWithLogitsLoss with pos_weight parameter, or Focal Loss (torchvision.ops). Focal Loss is preferred when positive class frequency is below 10%.

Multi-class classification (mutually exclusive)

CrossEntropyLoss. Pass raw logits. Add label_smoothing=0.1 if the model is overconfident or if you are fine-tuning a large pre-trained model.

Multi-label classification (multiple classes per sample)

BCEWithLogitsLoss applied per class independently. Not CrossEntropyLoss (which assumes mutual exclusivity via softmax).

Image segmentation

Dice Loss + BCE combined (often 0.5*Dice + 0.5*BCE). Dice Loss is inherently insensitive to class imbalance, critical for medical imaging where foreground (organ/lesion) is rare.

Self-supervised or contrastive learning

NT-Xent / InfoNCE. Use large batch sizes (256 to 4096) for sufficient negatives per anchor. Temperature $\tau = 0.07$ is a common starting point.

Face recognition or image retrieval

Triplet Margin Loss with hard negative mining. ArcFace loss (angular margin softmax) is the current state of the art for face recognition as of 2025.

Knowledge distillation

KL Divergence between teacher soft labels (with temperature T) and student output. Typical recipe: 0.9 * KLDiv(soft) + 0.1 * CE(hard).

Frequently Asked Questions

1. What is the primary role of a loss function in neural network training?

The loss function quantifies the error between the model's predictions and the ground truth targets, producing a scalar that the optimiser minimises via backpropagation. Every weight update in gradient descent is computed as the gradient of the loss with respect to that weight. The choice of loss function defines what the model is fundamentally optimising for.

2. Why is MSE sensitive to outliers?

MSE squares the residual $(y - \hat{y})^2$, so a prediction that is 10 units off contributes 100x more loss than one that is 1 unit off (not 10x). In a training batch containing a single extreme outlier, the squared error from that sample can dominate the entire gradient update, pulling all weights toward reducing that one error at the expense of correctly predicting all other samples.

3. When should I use cross-entropy loss vs MSE for classification?

Always use cross-entropy for classification. MSE applied to classification outputs (probabilities or one-hot labels) creates a non-convex loss surface with vanishing gradients for confidently wrong predictions. Cross-entropy with sigmoid/softmax activation is a proper scoring rule that penalises confident wrong predictions asymptotically (log(p) approaches infinity as p approaches 0), creating strong gradient signal throughout training.

4. What is the difference between BCE and categorical cross-entropy?

BCE (Binary Cross-Entropy) is for single-output binary classification where the model outputs one sigmoid-activated probability. Categorical cross-entropy is for multi-class problems with C mutually exclusive classes where the model outputs C logits (softmax over the class dimension). For multi-label classification (where multiple classes can be true simultaneously), use BCE per class independently.

5. Why use BCEWithLogitsLoss instead of BCELoss in PyTorch?

BCEWithLogitsLoss accepts raw logits (pre-sigmoid values) and internally computes sigmoid + binary cross-entropy using the log-sum-exp trick for numerical stability. If you manually apply sigmoid then pass to BCELoss, you risk overflow/underflow for very large or very small logit values, producing NaN losses. BCEWithLogitsLoss avoids this entirely.

6. What is Focal Loss and when should I use it?

Focal Loss (Lin et al., Facebook AI Research 2017) adds a modulating factor $(1-p_t)^\gamma$ to the standard cross-entropy that reduces the loss contribution of well-classified easy examples. It was designed for dense object detection (RetinaNet) where there are ~100,000 background proposals (easy negatives) versus a handful of objects. Use Focal Loss when your positive class frequency is below 5 to 10% and standard weighted BCE is not sufficient.

7. Can I create custom loss functions in PyTorch?

Yes. Implement them as plain Python functions using PyTorch tensor operations (which automatically build the autograd computational graph), or as nn.Module subclasses for stateful or parameterised losses. Verify gradients with torch.autograd.gradcheck() using double-precision inputs.

8. What is the role of the loss function in backpropagation?

Backpropagation computes the gradient of the loss scalar with respect to every parameter by applying the chain rule backwards through the network. The loss function is the root of this chain: its gradient with respect to the final layer output initiates the backward pass. Without a differentiable loss, there is no gradient signal and no learning.

9. Is MAE always better than MSE for noisy regression data?

Not always. MAE is more robust to outliers but its gradient is a constant $\pm 1/N$ everywhere, which causes oscillation near the optimum (the gradient doesn't decrease as the model improves). MSE has a gradient proportional to the error magnitude, providing strong updates early in training and fine-grained updates near convergence. Huber Loss combines the advantages of both and is usually the better practical choice for noisy regression.

10. What is Label Smoothing and when should I use it?

Label smoothing replaces hard one-hot targets with soft targets that assign small probability mass $ arepsilon/(C-1)$ to each incorrect class. It prevents the model from becoming overconfident (pushing logits to $\pm\infty$), acts as a regulariser, and significantly improves model calibration. Use it when fine-tuning large transformer models (ViT, BERT, GPT), when your model is consistently overconfident on the training set, or when calibration quality matters.

11. What is KL Divergence loss used for in neural networks?

KL Divergence measures distribution mismatch. In Variational Autoencoders (VAE), the KL term in the ELBO regularises the encoder's latent distribution toward the prior (typically standard Gaussian). In knowledge distillation, KL between teacher soft labels (at temperature T) and student output transfers the teacher's class similarity structure to the student. In RL (PPO), KL divergence constraints prevent policy updates that are too large.

12. How does Triplet Loss work for face recognition?

Triplet Loss trains on anchor-positive-negative triplets: (anchor face, same-person face, different-person face). It minimises distance between anchor and positive while maximising distance to the negative by at least margin $\alpha$. The critical implementation detail is hard negative mining: naively random triplets are often trivially satisfied (negative already far from anchor). Hard negative mining selects negatives closer to the anchor, providing more informative gradients. ArcFace (Deng et al. 2019) improved further by using an angular margin in the softmax space.

13. What is InfoNCE loss and how is it used in CLIP?

InfoNCE (Noise Contrastive Estimation) maximises mutual information between query and matching key by treating all other keys in the batch as negatives. In CLIP (OpenAI), image embeddings and their paired text embeddings are the positive pairs; all other image-text combinations in the batch are negatives. Training on 400M image-text pairs with bidirectional InfoNCE creates powerful cross-modal embeddings that enable zero-shot image classification.

14. How do I handle class imbalance in loss functions?

Options in increasing order of sophistication: (1) class_weight parameter in CrossEntropyLoss or pos_weight in BCEWithLogitsLoss (inverse frequency weighting); (2) Focal Loss (down-weights easy examples, focuses on hard misclassified ones); (3) oversampling the minority class (SMOTE, ROS) before training; (4) Dice Loss for segmentation (inherently imbalance-robust). Monitor both training loss and class-specific metrics (precision, recall per class) to assess effectiveness.

15. Why monitor validation loss during training?

Validation loss on a held-out dataset estimates generalisation performance. If training loss decreases while validation loss plateaus or increases, the model is overfitting (memorising training data rather than learning general patterns). This is the primary signal for early stopping. Compare also training and validation accuracy or the target metric; if both training loss and metric are improving but validation metric is not, check for data leakage or distribution mismatch.

Explore More Machine Learning Guides

Browse our library of deep learning, civil engineering, and data science articles.

Visit Blog Try Our Tools