Skip to content

Files

Latest commit

 

History

History
69 lines (49 loc) · 2.1 KB

activation_functions.md

File metadata and controls

69 lines (49 loc) · 2.1 KB

Advanced Activation Functions

This library implements several advanced activation functions used in modern neural networks. These functions are particularly important in transformer architectures and have shown significant improvements over traditional activation functions.

Overview

The activation functions in this library are implemented with a focus on:

  • Numerical stability
  • Performance optimization
  • Memory efficiency
  • Clear mathematical foundations

Function Implementations

Swish

Swish is an activation function introduced by Google Brain that often outperforms ReLU. It provides smoother gradients and better information flow.

swish(x) = x * sigmoid(β * x)

Where β is typically set to 1.0.

Key properties:

  • Non-monotonic function
  • Smooth gradient
  • Bounded below but unbounded above
  • Automatically learns the degree of non-linearity needed

GLU (Gated Linear Unit)

GLU was introduced in "Language Modeling with Gated Convolutional Networks" and is used in many transformer architectures. It provides a multiplicative interaction between the input and a gating mechanism.

GLU(x, W, V, b, c) = (x·W + b) ⊗ σ(x·V + c)

Where:

  • ⊗ is element-wise multiplication
  • σ is the sigmoid function
  • W, V are weight matrices
  • b, c are bias vectors

Key properties:

  • Provides controlled information flow
  • Helps mitigate vanishing gradient problem
  • Effective for sequence modeling tasks

SwiGLU (Swish Gated Linear Unit)

SwiGLU is a variant of GLU that uses the Swish activation function instead of sigmoid. This combination has shown improved performance in many tasks.

SwiGLU(x, W, V, b, c) = (x·W + b) ⊗ swish(x·V + c)

Key properties:

  • Combines benefits of Swish and gating mechanisms
  • Better gradient flow than standard GLU
  • Particularly effective in deep transformer networks

Implementation Details

In our implementation, these activation functions are:

  1. Vectorized for performance
  2. Numerically stable through careful handling of edge cases
  3. Memory efficient with in-place operations where possible
  4. Thoroughly tested with comprehensive unit tests