This library implements several advanced activation functions used in modern neural networks. These functions are particularly important in transformer architectures and have shown significant improvements over traditional activation functions.
The activation functions in this library are implemented with a focus on:
- Numerical stability
- Performance optimization
- Memory efficiency
- Clear mathematical foundations
Swish is an activation function introduced by Google Brain that often outperforms ReLU. It provides smoother gradients and better information flow.
swish(x) = x * sigmoid(β * x)
Where β is typically set to 1.0.
Key properties:
- Non-monotonic function
- Smooth gradient
- Bounded below but unbounded above
- Automatically learns the degree of non-linearity needed
GLU was introduced in "Language Modeling with Gated Convolutional Networks" and is used in many transformer architectures. It provides a multiplicative interaction between the input and a gating mechanism.
GLU(x, W, V, b, c) = (x·W + b) ⊗ σ(x·V + c)
Where:
- ⊗ is element-wise multiplication
- σ is the sigmoid function
- W, V are weight matrices
- b, c are bias vectors
Key properties:
- Provides controlled information flow
- Helps mitigate vanishing gradient problem
- Effective for sequence modeling tasks
SwiGLU is a variant of GLU that uses the Swish activation function instead of sigmoid. This combination has shown improved performance in many tasks.
SwiGLU(x, W, V, b, c) = (x·W + b) ⊗ swish(x·V + c)
Key properties:
- Combines benefits of Swish and gating mechanisms
- Better gradient flow than standard GLU
- Particularly effective in deep transformer networks
In our implementation, these activation functions are:
- Vectorized for performance
- Numerically stable through careful handling of edge cases
- Memory efficient with in-place operations where possible
- Thoroughly tested with comprehensive unit tests