Neuton Neural Network Framework

Building an Efficient Self-Organizing Neural Network

The Neuton Neural Network Framework is based on a patented machine learning algorithm that forgoes error backpropagation and stochastic gradient descent. It provides automatic neuron-by-neuron network structure growth, and allows for minimum-size models with excellent generalizing capability, and without a loss of accuracy.

Traditional approaches to building neural networks

Neural networks created today contain more and more coefficients and neurons and require ever-increasing processing power. Hundreds of thousands of neural network parameters have long ceased to be anything surprising or unique. However, it is now obvious that this approach has its limitations and will soon hit insurmountable limits to hardware capacity.

When neural network structures are being built it is, generally, a highly manual and somewhat random process. The reality is that one simply has to adjust too many variables simultaneously to build an optimal model from the size and accuracy perspective, including, but not limited to:

Random Seed

Number on Neurons

Number of Layers

Activation Function (Sigmoid, ReLU, etc)

Learning Rate

Number of epoches

Cross Validation Folds

Dropout

An overwhelming majority of modern neural networks are based on a predetermined architecture (structure) defined by the researcher and the method of stochastic gradient descent with minor modifications for parametric identification. Only neuron parameters undergo optimization, while the architecture itself remains predetermined, defined by the researcher and constant. This is the main cause of the unnecessary growth of network sizes, which leads to increased prediction costs using redundant calculations of a determined network.

The world scientific community is seriously concerned with solving this problem. Two main approaches to reducing network volume that can currently be distinguished are:

1. Optimizing the structures of already-trained networks

2. Automated neural architecture search (NAS)

The methods and algorithms that implement the first approach mainly come down to discarding "ineffective" neurons and connections in an already-trained network that meet certain criteria. The inevitable trade-off for reduced network volume is loss of accuracy. Furthermore, the large network still has to be trained. In other words, the issue of large size can be solved at the operational stage only.

The second, and undoubtedly more promising approach, allows for generation of optimized network architectures that match or exceed the performance of manually created architectures. Attempts at including an automated neural network structure definition in the optimization circuit mostly lead to intelligent enumeration of finished architectures, parametric identification and selection of the best option. Each time, the model is fully trained using the candidate architecture. Thus the process of building a network using this approach is as follows:

Note that this is a very extensive and resource-intensive process, so in a real-life scenario, it is severely restricted to the search space of various architectures combinations. We are forced to perform a “highly discretized” enumeration and, as a result, end up with an non-ideal architecture. It is also worth noting that in order to achieve a consistent outcome, the approach needs to be implemented in the context of model cross-validation, which multiplies the already-high overhead by several times.

Another obstacle to obtaining an efficiently sized, highly accurate model is the choice of the optimization algorithm. The widely-known problem of local extremes and plateaus significantly reduces the efficiency of using stochastic gradient descent for these purposes. In addition, significant variation in the hyperparameters, such as the learning rate, batch size or weight initialization technique, as well as complex and ambiguous detection of when the learning ends, all add a lot of unknowns to this process thereby increasing the cost of each step and making the process inefficient.

Let us list the main problems that arise with the use of local gradient optimization methods in modern neural network frameworks:

Getting stuck in multiple local minima or at saddle points. Due to the complex landscape of the target function, the plateau regions alternate with regions of strong non-linearity. The derivative on the plateau is almost zero, and a sudden drop, on the contrary, can guide the search algorithm too far from the desired optimum.

Non-uniform parameter updates. Certain parameters are updated much less frequently than others, especially where the data contains informative but rare attributes. This adversely affects the subtleties of the network generalization rule. That said, assigning too much importance to all rare attributes can lead to overfitting.

Undetermined learning rate. A learning rate that is too low causes the algorithm to take a very long time to converge, getting stuck in local minima. Conversely, a very high learning rate leads to skipping of preferred minima or even to divergence.

The issue of vanishing and exploding gradients. The presence of a large number of successive layers in a neural network leads to an uncontrolled decrease or increase in the error gradient as weight correction progresses from the network output to the input. This is reflected in the learning efficiency of the neural network layers that are located far from the output.

Major modifications of stochastic gradient descent use various heuristics in an attempt to address these challenges. The most popular of these is the idea of accumulating momentum when moving along the gradient and the idea of weaker weight updates for typical attributes. A whole series of algorithms has been created from these ideas: Nesterov Accelerated Gradient, Adagrad, Momentum, RMSProp, Adadelta, Adam, Adamax. However, even such a large number of algorithms cannot guarantee a high-quality solution to all of the problems mentioned above, simply demonstrating that the scientific community continues to pursue an intensive search in this direction.

To sum up the above, we will note that successful implementation of an algorithm for creating a neural network with an ideal structure requires a drastic change in the approach to building neural networks. In particular, this calls for a solution to two main problems: the inefficiency of the training algorithm and the discreteness of selecting an ideal architecture.

Neuton

After analyzing and summarizing the experiences of the world’s scientific communities, we designed a completely different approach to creation of perceptron neural networks with an optimal architecture that is free from the aforementioned flaws.

Unlike most NAS methods, based on intelligent enumeration of predetermined neural network structures, we use neuron-by-neuron network structure growth, with the minimum structural unit involved in the optimization process being the neuron’s input. This allows minimization of the “discretization” of the architecture search and creates minimum-size neural networks with no loss of accuracy.

We thereafter created and patented our own highly effective global optimization algorithm, as an efficient solution to the problems of local extremes and plateaus. Using the algorithm for identifying network parameters helps to significantly improve each neuron’s efficiency in the network and to reduce the network’s volume as a result. The patented algorithm has enormous potential for parallelizing (multiple hosts, multiple GPUs) without loss of accuracy, allowing us to solve cross-validation problems while training a neural network within an acceptable time.

We have named the neural network framework “Neuton”. It is based on a process of automatic neuron-by-neuron network growth with overfitting control. This approach allows dynamic growth of the neural network until it achieves its maximum generalization ability. The use of our own global optimization algorithm when learning the parameters of each neuron allows for a significant reduction in the volume of the network, while maintaining its accuracy characteristics.

The key differences between Neuton and traditional neural network frameworks are:

Fully automatic creation of a neural network structure without a data scientist’s involvement

Built-in cross-validation algorithm

Built-in overfitting control

Application of global optimization methods

A high parallelizing ability with no loss of accuracy

The minimum size of a trained neural network possible without a loss of accuracy

Maximum prediction speed

It is important to note that all of the above-mentioned benefits of Neuton are not separate algorithm settings, but rather are all implemented automatically, by default. Data, a target variable and a metric name are fed to the algorithm, and the entire process of training, validation and production of the best model happens without a need for a data scientist.

Experiments (https://neuton.ai/frmwrk#benchmarks) demonstrate that neural networks created with Neuton possess a high level of accuracy, at minimal model size, relative to alternative solutions.