KAN: Is this the end of feedforward networks?

blog preview

Multi-Layer Perceptrons (MLPs), also known as fully-connected feedforward networks, are the backbone of the most powerful AI models available today. They are used from machine learning tasks (like feature extraction, regression, classification) up to many advanced deep learning models such as Convolutional Neural Networks (CNNs) and Transformers.

But are MLPs truly the ultimate solution? A recent paper has shed light on a rediscovered type of network: Kolmogorov-Arnold Networks (KANs). KANs use spline functions instead of static weights to achieve their incredible accuracy and interpretability at very small network sizes. But could these networks represent the next big leap in AI, potentially replacing MLPs?

In this blog post, we will explore the differences between these two model architectures and highlight the advantages and potentials of KANs.

Difference between MLP and KAN

MLP vs. KANMLP vs. KAN

Multi-Layer Perceptrons (MLPs)

MLPs, or Multi-Layer Perceptrons, are a sophisticated type of artificial neural network. These networks are designed with an input layer, multiple hidden layers, and an output layer. Each layer is comprised of nodes, typically referred to as 'neurons', which are interconnected with nodes in the layers directly before and after them.

The operational flow in an MLP begins at the input layer and progresses through the hidden layers to reach the output layer. Each node performs a series of operations: the input values are first multiplied by learned weights (predetermined from the model learning), these products are then summed, and the result is passed through an fixed activation function. This activation function is critical as it introduces non-linearity into the system, a necessary feature for dealing with complex, non-linear problems.

Through these mechanisms, MLPs are capable of modeling intricate relationships between input features and outputs. This ability is grounded in the Universal Approximation Theorem, which posits that a neural network with at least one hidden layer can approximate any continuous function, given sufficient neurons and the right set of weights. This theorem underpins the versatility and power of MLPs, making them particularly useful in fields like image recognition, speech processing, and predictive analytics, where they can identify patterns and relationships in data that are not immediately obvious.

Kolmogorov-Arnold Networks (KANs)

Unlike MLPs, which are based on the "Universal Approximation Theorem," KANs are founded on the "Kolmogorov-Arnold Representation Theorem". The Kolmogorov-Arnold Theorem, also known as the Superposition Theorem, was developed by Andrey Kolmogorov and Vladimir Arnold. Kolmogorov's work on representing continuous functions by superpositions of functions with fewer variables was published in 1956, while Arnold's work on functions of three variables, closely related to the theorem, was published in 1957. These works established that any continuous multivariate function can be represented as a superposition of continuous univariate functions and addition.

Similar to MLPs, KANs also consist of an input layer, an output layer, and one or more hidden layers in between. Unlike MLPs, the 'nodes' in KANs only perform a summation of the incoming 'signals' without a fixed activation function, and the 'edges' use univariate functions instead of static weights. These univariate functions are learned during the training process, much like the weights in MLPs. In the experiments presented in the paper, these functions are B-splines. Comparable to learning the weights in MLPs, the "knots" of the splines in KANs are learned. B-splines have significant advantages for this application in KANs.

Advantages of B-splines:

  • B-splines, or splines in general, face challenges in approximating high-dimensional functions, known as the curse of dimensionality (COD) problem, due to their inability to exploit compositional structures. However, for univariate functions, as required here, splines can provide very accurate results.
  • They allow easy adjustment of the resolution of functions, and thereby the overall accuracy of the model, by increasing or decreasing the grid size.
  • Additionally, the local independence of the knots in B-splines positively impacts the learning process, particularly in continuous learning.

To accurately learn a function, which is the fundamental goal of any model, a model must capture both the compositional structure (external degrees of freedom) and approximate the univariate functions well (internal degrees of freedom). KANs achieve this by combining the strengths of both MLPs and splines. They incorporate an MLP-like structure on the outside and splines on the inside. This allows KANs not only to learn features (due to their external similarity to MLPs) but also to optimize these learned features with high accuracy (thanks to their internal similarity to splines).

KANs offer several significant advantages:

  • Accuracy: Due to the use of different learned functions at the edges, KANs can model the influences of individual features very well and achieve high accuracy even with small networks.

  • Efficiency: Although the KANs in the experiments described in the paper required significantly longer training times compared to MLPs with the same number of parameters, KANs are considered much more efficient. This is because their high accuracy allows for a disproportionately reduction in network size and number of parameters to achieve the same accuracy as an MLP. For example, the authors statet that "for PDE solving, a 2-Layer width-10 KAN is 100 times more accurate than a 4-Layer width-100 MLP (10^−7 vs 10^−5 MSE) and 100 times more parameter efficient (10^2 vs 10^4 parameters)."

  • Interpretability: The use of functions at the edges makes the influences of individual features clear and easily interpretable. In MLPs, these relationships can only be represented over large parts of the network and through numerous nodes and edges, making interpretability much poorer. The straightforward interpretability of KANs means they could also be useful in scientific fields like physics and mathematics.

  • Continuous Learning: Due to the local independence of the B-spline functions used, these networks can apparently support "Continuous Learning", where training different parts of the network does not lead to "catastrophic forgetting" in the already learned sections, unlike in MLPs.

Despite their apparent innovation, basing artificial neural networks on the Kolmogorov-Arnold representation theorem is not entirely new. Some studies in the past have used this approach but stuck with the original depth-2 width-(2n + 1) representation and did not leverage more modern techniques, such as back propagation, to train the networks. The groundbreaking aspect of the recently presented paper is the generalization of this concept to arbitrary network sizes and the use of modern techniques like back propagation.

Results

The paper presented various tests demonstrating the potential and influencing factors of these networks. Tests were conducted with different grid sizes for the spline functions, and a variety of combined and specialized mathematical functions were attempted to be learned. Special attention was given to the internal representation in the networks and their interpretability. The continuous learning capability was also demonstrated, along with a special pruning technique for the network and an alternative to symbolic regression. Additionally, the scaling laws of these networks were illustrated, and examples were provided showing how these networks can be used in supervised and unsupervised learning scenarios.

However, covering all these findings is beyond the scope of this article. Instead, we will focus on two specific points: grid extension and interpretability.

Grid extension

In principle, a spline can achieve high accuracy by making the grid finer, a feature that KANs inherit. Unlike MLPs, which lack the concept of "fine-graining," KANs can improve accuracy by refining their spline grids without needing to retrain the model from scratch. Although increasing the width and depth of MLPs can enhance performance, this process is slow and costly. With KANs, one can start with fewer parameters and later extend to more parameters by simply refining the spline grids.

Grid extension of univariate B-spline functionsGrid extension of univariate B-spline functions

Using the mathematical function f(x, y) = exp(sin(πx) + y^2) as toy example, the research team illustrate the effect of grid extension on KAN performance and therefore on the training and test RMSE of the model.

Effect of extending the spline grid on training and test RMSEEffect of extending the spline grid on training and test RMSE

Starting with a small number of grid points and increasing in steps up to 1000 points, it shows that training loss consistently and immediately decreases as the grid is extended. Except at the highest point, due to poor optimization, the RMSE increases again. This U-shape of the loss function highlights the bias-variance tradeoff (underfitting vs. overfitting). Optimal test loss occurs when the number of parameters matches the number of data points, aligning with the expected interpolation threshold. For the given example, this means the following: Given that there are 1000 training samples and the total parameters of a [2, 5, 1] KAN amount to 15G (where G represents the number of grid intervals), the interpolation threshold is expected to be G = 1000/15 ≈ 67. This approximately matches the experimentally observed value of G ∼ 50.

With knowledge of the function being learned, the ideal solution would be a [2, 1, 1] KAN instead of the chosen [2, 5, 1] KAN. This means a smaller network with fewer nodes which would require less computation and training effort would also achieve even better results, as shown in the given graph with training and test RMSE.

NOTE: The notation '[2, 5, 1]' used in the paper describes the structure of the KANs, indicating the layers of the KANs and the number of nodes in each layer. In this example, there are 2 nodes in the input layer, 5 nodes in the (single) hidden layer, and 1 node in the output layer. An illustration of this network can be found in the next section, see Figure "Symbolic Regression with KAN".

To prevent such unnecessary and problematic oversizing of the network in practice, the paper also introduced a technique to adjust/reduce the KAN structure according to the given requirements.

Simplifying KANs and making them interactive

As already described, reducing the nodes of a KAN and thus approximating the actual/ideal structure for a given problem can lead to both a reduction in training and computational efforts and an improvement in the model's accuracy. Since this "ideal" structure is not known in practice, an automated process is necessary to decide which nodes can or should be removed.

The paper introduces a pruning technique to adjust the KAN structure dynamically. This process involves evaluating the importance of each node and removing those that contribute least to the network's performance. The pruning algorithm helps in systematically reducing the network size while maintaining or even enhancing accuracy. This method ensures that the network remains efficient and well-suited to the specific problem at hand, optimizing both resource usage and model performance.

Symbolic Regression with KANSymbolic Regression with KAN

To perform this sparsification, a sparsification penalty is introduced during training. This penalty considers the average magnitude of an activation function over all inputs, the sum of these values for a given layer, and the layer's entropy. After this special training, the network is pruned at the node level by removing nodes with low incoming and outgoing scores.

Given that the relationships of the individual features in the model are directly evident and highly interpretable through the spline functions, the researchers went a step further. They visualized the network with its splines and presented it as an interactive graph. This approach allowed the learned functions to be replaced with symbolic functions (with some automatic adjustments for shifts and scalings). As a result, KANs provide an excellent alternative to traditional methods of symbolic regression and can be used in various mathematical and physical fields to help identify internal structures.

Conclusion

Despite their sophisticated mathematical foundation, KANs are essentially combinations of splines and MLPs, leveraging their respective strengths while avoiding their respective weaknesses.

Will KANs totally replace MLPs? Probably not anytime soon, especially given the significantly faster training times of MLPs. However, we are likely to see substantial improvements and an increasing number of applications for KANs.

The authors of the paper are very confident about the possibilities and potential of KANs, as demonstrated by their decision tree "Should I use KANs or MLPs?" They see the only drawback in the slow training process and also note, "we did not try hard to optimize KANs’ efficiency though, so we deem KANs’ slow training more as an engineering problem to be improved in the future rather than a fundamental limitation."

KAN vs. MLP decision tree provided by paper authorsKAN vs. MLP decision tree provided by paper authors

But also in this area of KAN training, there are already advances. Just a few weeks after the papers publication the freely available GitHub repository "efficient-kan" popped up which provides an improved implementation.

Only time will tell if KANs will ultimately replace MLPs as the authors predict. One thing is certain: we are eager to witness the progress and innovations that will unfold in this exciting field in the near future.

Further Reading

More information on our managed RAG solution?
To Pondhouse AI
More tips and tricks on how to work with AI?
To our Blog