Artificial Neural Networks – Part 2: MLP Implementation for XOr


As promised in part one, this second part details a java implementation of a multilayer perceptron (MLP) for the XOr problem.  Actually, as you will see, the core classes are designed to implement any MLP implementation with a single hidden layer.

First, it will help to introduce a quick overview of how MLP networks can be used to make predictions for the XOr problem.  For a more detailed explanation, please review part one of this post.

The image at the top of this article depicts the architecture for a multilayer perceptron network designed specifically to solve the XOr problem.  It contains integer inputs that can each hold the value of 1 or 0, a hidden layer with two units (not counting the dashed bias units which we will ignore for for now for the sake of simplicity) and a single output unit.  Once the network is trained, the output unit should predict the output you would expect if you were to parse the input values through an XOr logic gate.  That is, if the two input values are not equal (1,0 or 0, 1), the output should be 1, otherwise the output should be 0.

Forward propagation is completed first, which parses the input values through the network and eventually to the output unit, making a prediction of what the output should be, given the input.  This is then compared to the expected output and the prediction error is calculated.  Part of the process of passing through the network involves multiplying the values by small weights.  If the weights are correctly configured, the output prediction will be correct.

This configuration is not a manual process, but rather something that occurs automatically through a process known as backward propagation.  Backward propagation considers the error of the prediction made by forward propagation and parses those values backwards through the network, adjusting the weights to values that slightly improve prediction accuracy.

Forward and backward propagation are repeated for each training example in the dataset many times until the weights of the network are tuned to values that result in forward propagation producing accurate output.

The following sections will provide more detail on how the process works by detailing a fully functional Java implementation.

Class Structure and Dependencies

The implementation consists of four classes.  These include ExecuteXOr, which is the highest level class containing all of the XOr specific code; MultiLayerPerceptron, which contains all of the code used to implement a generic multilayer perceptron with with a single hidden layer; iActivation, the activation interface and SigmoidActivation, which is a concrete activation class specifically implementing the sigmoid activation function.

MLP Implementation UML Class Diagram Showing Dependencies and Methods

Class: ExecuteXOR

The first class we will review is ExecuteXOr.  This is the highest level class, containing the main method used to kick off the the experiment.  Here the user defines the dimensions of the input and output values, instantiates the activation class and MultiLayerPerceptron class.  Note that, for the XOr problem, we have two input units, two hidden units and one output unit.  As such, X is a two dimensional array, y is a one dimensional array and the first three input parameters for the MultilayerPerceptron class denote the dimensions of the neural network. These are set to 2 (input layer), 2 (hidden layer) and 1 (output layer).

package Experiments;

import NeuralNets.MultilayerPerceptron;
import Activation.SigmoidActivation;

public class ExecuteXOr {
	public static double[][] X; // Matrix to contain input values
	public static double[] y; // Vector to contain expected values
	public static MultilayerPerceptron NN; // Neural Network

The main method instantiates the dimension of the X and y arrays, along with the multilayerPerceptron class, before kicking off the methods used to initialise the dataset, train and test the network. We have chosen sigmoid as the activation function in this case.  However, as implied by the existence of the Activation interface, there are many other possible activation functions.

The first three parameters of the MultilayerPerceptron constructor define the dimensions of the network. In this case we have defined two input units, two hidden units and one output unit, as is required for this architecture.

The input parameters for trainNetwork define the maximum number of iterations to complete during training and the target error rate. In this case, no more than 200 thousand iterations (epochs) will be completed and training will also stop if the error rate drops to 0.01 or less.

	public static void main(String[] args) {
		X=new double[4][2]; // Input values
		y=new double[4]; // Target values
		// Instantiate the neural network class
		NN = new MultilayerPerceptron(2,2,1,new SigmoidActivation());

The initialiseDataSet method provides the values for the X and y arrays where X contains all possible input values and y contains the expected output, given each set of input values. See part one of this post for more details on the structure of the XOr problem to understand why these values are as they are, but it is important to understand that this is an exhaustive list of possible inputs and outputs, rather than a sample of possible values.  Because we’re not using a sample of possible values we need not be concerned about overfitting and thus do not require separate training and testing datasets.

	private static void initialiseDataSet(){
		X[0][0] = 0;
		X[0][1] = 0;
		y[0] = 0;
		X[1][0] = 0;
		X[1][1] = 1;
		y[1] = 1;
		X[2][0] = 1;
		X[2][1] = 0;
		y[2] = 1;

		X[3][0] = 1;
		X[3][1] = 1;
		y[3] = 0;

For each iteration (epoch), the trainNetwork method iterates over every training example, completing a forward propagation and backward propagation step for every training example before updating the weights. Weights are updated with a learning rate of 0.1, which slows the rate at which the weights of the network are updated. This helps to prevent the system from overshooting the ideal weight setup during updates.

Another potential issue to rectify is the tendency, depending on how the weights are initialised, for the algorithm to occasionally enter a search space that is difficult to exit, resulting in very slow progress in training. To avoid these scenarios, I have borrowed a method from combinatorial optimisation known as random restart. After every 500th epoch, a check is performed to see if the error score is improving too slowly. If progress is extremely slow, the weights are reinitialised, effectively kicking the incumbent solution into a new neighbourhood from which we will hopefully achieve better training performance.

At each epoch a check is also performed to see if the current error exceeds the target. If so, weights we have identified are close enough to the optimal configuration to produce highly accurate predictions, so the method is exited.

	private static void trainNetwork(int epochs, double targetError){
		double error=0;
		double baseError=Double.POSITIVE_INFINITY;

		// Iterate over the number of epochs to be completed
		for(int epoch = 0; epoch<epochs; epoch++){
			// Iterate over all training examples
			for(int i = 0; i<X.length; i++){
				// Feed forward
				// Run backpropagation and add the squared error to the sum of error for this epoch
				// update the weights
			// Every 500th epoch check whether progress is too slow and if so, reset the weights
			if(epoch % 500==0) {
				// If not
				if(baseError-error<0.00001) {
					NN.kick(); // Kick the candidate solution into a new neighbourhood with random restart
					baseError=error; // Record the base error
			// Print the sum of squared error for the current epoch to the console
			System.out.println("Epoch: " + epoch + " - Sum of Squared Error: " + error);
			// If the error is smaller than 0.01 stop training

Once the network has been trained, we can test its accuracy.   The testNetwork method iterates over each of the examples in X and runs forward propagation, which returns the neural network’s prediction for what the output should be. The prediction is in the form of a value falling between something very close to zero and something very close to one. Any value of 0.5 or higher is deemed to predict 1, while anything lower than 0.5 is deemed to predict 0. This is compared to the actual expected output and the proportion of correct predictions (prediction accuracy) is returned to the console. If the network is working correctly, the returned value should always be 1.0.

	private static void testNetwork(){
		double correct=0;
		double incorrect=0;
		double output=0;
		// Iterate over the testing examples (which happen to double as training examples)
		for(int i = 0; i<X.length; i++){ output=NN.forwardProp(X[i])[0]; // Feed forward to get the output for the current example // If the output is >= 0.5, we deem the output to be 1.0
			else // Otherwise it is 0.0
			// If the output value matches the target
				correct++; // Increment the number of successful classifications
				incorrect++; // Increment the number of unsuccessful classifications
		// Print the test accuracy to the console
		System.out.println("Test Accuracy: " + String.valueOf((correct/(incorrect+correct))));

Interface: iActivation

The iActivation interface defines the methods that must be implemented by any activation class.  The only two methods required are getActivation and getDerivative.

An implementation of the getActivation method takes as input, the sum of values input into a single unit and outputs the activation value for the unit.  This is used in forward propagation to compress the sum of values received by  a unit, usually into a value of 0 or 1, or something very close to 0 or 1.  However, there are exceptions such as the tanh activation function which compresses the input value into a value of between -1 and 1.

The getDerivative method represents the computed derivative of the activation function.  This is used by backward propagation to influence how the weights are to be updated in order to change them to a value that slightly improves the accuracy of the network.  The function determines the direction in which the weights should be changed.  How the partial derivative is computed is beyond the scope of this post, but anyone interested may wish to review this report.

package Activation;

public interface iActivation {
	double getDerivative(double x);
	double getActivation(double x);

Class: SigmoidActivation

The SigmoidActivation class is an implementation of iActivation, implementing the activation and derivative Sigmoid functions.

The Sigmoid activation function is g(x)=\frac{1}{1+e^{-x}}, while its derivative is {g'(x)=x(1-x)}.

Plotting the Sigmoid function clearly shows how it converts almost all presented values to something very close to 0 or 1.  The value 0 is converted to 0.5, but any values larger or smaller quickly approach the plateaus toward the left and the right of the plot.

Sigmoid Plot

The derivative calculates the direction of the slope of this line (gradient), making it possible to force the weight updates into a direction that improves the accuracy of the classification process.  Because different activation functions compress their input values in different ways, they require different derivative functions to calculate their respective gradients.  The use of polymorphism here allows for the potential implementation of any activation function to be encapsulated alongside its own derivative function.

package Activation;

public class SigmoidActivation implements iActivation {

	public double getDerivative(double x) {
		return (x * (1.0-x));

	public double getActivation(double x) {
		double expVal=Math.exp(-x);
		return 1.0/(1.0+expVal);

Class: MultilayerPerceptron

The MultilayerPerceptron class encapsulates all of the data for the neural network and the methods required to intialise and propagate the network.

Data held by the class include the dimensions of the network (InputCount, hiddenCount and outputCount), the first and second layer of weights (W1 and W2), The first and second layer of delta weights (DW1 and DW2), the pre-activation values for each of the units (Z1 and Z2) the input values (InputValues) and the activation values output by each subsequent layer of the network (HiddenValues, OutputValues).

package NeuralNets;

import java.util.Random;
import Activation.iActivation;

public class MultilayerPerceptron {
	int inputCount; // Number of input units
	int hiddenCount; // Number of hidden units
	int outputCount; // Number of output units
	// Weight matrices
	double[][] W1; // First Layer of Weights
	double[][] W2; // Second Layer of Weights
	// Delta weight matrices
	double[][] DW1; // First Layer of Delta Weights
	double[][] DW2; // Second Layer of Delta Weights
	// The values at the time of activation (the values input to the hidden and output units)
	double[] Z1; // First Layer pre-activation values
	double[] Z2; // Second Layer pre-activation values
	iActivation activation=null; // Activation Class

	// The values actually output by the units in each layer after being squashed by the activation function
	double[] InputValues; // Inputs layer Values
	double[] HiddenValues; // Hidden layer Values
	double[] OutputValues; // Output layer Values

The constructor method for the class receives the dimensions of the network, along with the activation function, as input and proceeds to use that information to initialise the variables that are global to the class.

	public MultilayerPerceptron(int inputCount, int hiddenCount ,int outputCount, iActivation activation){
		// Initialise the first layer weight and delta weight matrices (accounting for bias unit)
		W1 = initialiseWeights(new double[hiddenCount][inputCount+1]);
		DW1 =initialiseDeltaWeights(new double[hiddenCount][inputCount+1]);
		// Initialise the second layer weight and delta weight matrices (accounting for bias unit)
		W2 = initialiseWeights(new double[outputCount][hiddenCount+1]);
		DW2 = initialiseDeltaWeights(new double[outputCount][hiddenCount+1]);
		// Initialise the activation vectors
		Z1 = initialiseActivations(new double[hiddenCount]);
		Z2 = initialiseActivations(new double[outputCount]);
		// Initialise the hidden and output value vectors (same dimensions as activation vectors)

The initialisation methods randomly initialise the weight arrays (W1 and W2) within a reasonable range of small values relative to the size of the InputValues array. All other arrays are initialised with zero values, while the kick method exists to accommodate the random restart process, essentially reinitialising the weight arrays with random values. This is called when the network is stuck in a search neighbourhood where training occurs at an extremely slow rate.

	private double[][] initialiseWeights(double w[][]){
		Random rn = new Random();

		double offset = 1/(Math.sqrt(inputCount));
		for(int i=0; i<w.length; i++){
			for(int j=0; j<w[i].length;j++){ // No bias unit

		return w;

	private double[][] initialiseDeltaWeights(double w[][]){
		for(int i=0; i<w.length; i++){
			for(int j=0; j<w[i].length;j++){ 
		return w;
	private double[] initialiseDelta(double d[]){
		for(int i=0; i<d.length; i++){
		return d;

	private double[] initialiseActivations(double z[]){
		for(int i=0; i<z.length; i++){
		return z;

	public void kick(){
		// Kick the candidate solution into a new neighbourhood (random restart)
		W2 = initialiseWeights(new double[outputCount][hiddenCount+1]); // Account for bias unit
		W1 = initialiseWeights(new double[hiddenCount][inputCount+1]); // Account for bias unit

The forwardProp method completes forward propagation. First a bias unit is added to the input values and hidden unit values. The non-bias pre-activation values (Z values) are calculated by summing the input values after multiplying them by their weights. The hidden unit activations are then calculated by parsing the resulting Z values through the activation function.

The hidden layer activations then serve as input to the output layer and the process is repeated for that layer with the activation for the output unit serving as the prediction.

	public double[] forwardProp(double[] inputs){
		this.InputValues = new double[(inputs.length+1)];
		// Add bias unit to inputs
		for(int i = 1; i<InputValues.length; i++){
		HiddenValues = new double[hiddenCount+1];
		HiddenValues[0]=1; // Add bias unit
		// Get hidden layer activations
		for(int i = 0; i<Z1.length;i++){
			for(int j = 0; j<InputValues.length; j++){
				// Hidden Layer Activation
				Z1[i]+=W1[i][j] * InputValues[j]; 

			// Hidden Layer Output value
			HiddenValues[i+1]= activation.getActivation(Z1[i]);
		// Get output layer activations
		for(int i = 0; i<Z2.length;i++){
			for(int j=0; j<HiddenValues.length; j++){
				// Get output layer Activation
				Z2[i] += W2[i][j] * HiddenValues[j]; 
			// Get output layer output Value

		return OutputValues;

There are two backProp methods for completing backward propagation, the first accommodating the use of a single target value and the second accommodating an array of target values, for architectures containing more than one output unit. Delta weights are the values by which the weights can be updated to improve the performance of the network. The outputs of the backward propagation process include the delta weights for the first and second layer of weights in the network and the half of the sum of the outputs of the cost function, otherwise known as the error.

As the name suggests, backward propagation begins with the output layer and propagates backwards through the network. The error is calculated using the classic squared error cost function, in which half of the squared sum of the difference between the target values and the actual predictions is calculated. Squaring the error ensures that all error values are both positive in value and are magnified, exaggerating the degree of error.

The deltas for the second layer of weights are calculated by taking the difference between the target values and predictions, multiplying that by the activation derivative of the predictions.

Once the second layer delta weights have been calculated, the first layer delta weights are also updated. These are calculated by multiplying the second layer deltas by the second layer of weights and multiplying that by the activation derivative of the hidden layer activation values. These values are then multiplied by the input layer values.

Worth noting, is that the backProp method does not actually update the weights, it merely provides the delta weight values that can later be used to perform the update.

	// Support a single double value as the target
	public double backProp(double targets){
		double[] t = new double[1];
		return backProp(t);
	public double backProp(double[] targets){
		double error=0;
		double errorSum=0;
		double[] D1; // First Layer Deltas
		double[] D2; // Second Layer Deltas
		D1 = initialiseDelta(new double[hiddenCount]);
		D2 = initialiseDelta(new double[outputCount]);
		// Calculate Deltas for the second layer and the error
		for(int i = 0;i<D2.length; i++){
			D2[i]=(targets[i]-OutputValues[i]) * activation.getDerivative(OutputValues[i]);
			errorSum+= Math.pow((targets[i]-OutputValues[i]),2); // Squared Error
		error = errorSum / 2;
		// Update Delta Weights for second layer
		for(int i = 0; i<outputCount; i++){
			for(int j = 0; j<hiddenCount+1; j++){ 
				DW2[i][j] += D2[i] * HiddenValues[j];
		// Calculate Deltas for first layer of weights
		for(int j = 0; j<hiddenCount; j++){
			for(int k = 0; k<outputCount; k++){
				D1[j] += (D2[k] * W2[k][j+1]) * activation.getDerivative(HiddenValues[j+1]);
		// Update first layer deltas
		for(int i=0; i<hiddenCount;i++){
			for(int j=0; j<inputCount+1; j++){ // Account for bias unit
				DW1[i][j] += D1[i] * InputValues[j];

		return error;

In this particular implementation of ExecuteXOr, the weights are updated in every epoch. However, in some circumstances it can be advantageous to update the weights less regularly. The updateWeights method allows for updating of the weights to occur at any interval as directed by the calling method. Each time the weights are updated, the DW1 and DW2 arrays are reset to zero values. For as long as the weights are not updated, the delta weights accumulate.

The only input parameter for the method is the learningRate. When learningRate is set to 1, the entire delta weight is used in the update. When it is set to a fractional value, the rate at which the weights are updated is reduced. ExecuteXor is using a learning rate of 0.1, ensuring that only 10% of the delta weight value is used to update the weights at each epoch.

	public void updateWeights(double learningRate){
		// Update first layer of weights
		for(int i = 0; i<W2.length; i++){
			for(int j = 0; j<W2[i].length; j++){
				W2[i][j] += learningRate * DW2[i][j];
		// Update second layer of weights
		for(int i = 0; i<W1.length; i++){
			for(int j = 0; j<W1[i].length; j++){
				W1[i][j] += learningRate * DW1[i][j];
		// Reset delta weights
		DW1 = initialiseDeltaWeights(new double[hiddenCount][inputCount+1]);
		DW2 = initialiseDeltaWeights(new double[outputCount][hiddenCount+1]);



This post is the second part of an article on multilayer perceptron (MLP) artificial neural networks for the XOr problem.  It has demonstrated a Java implementation of an MLP.  The article began with a brief recap of the XOr problem and a summary of the processes used to train an MLP, including a high-level discussion of forward propagation and backward propagation.

The class structure and dependencies for the implementation were then detailed before stepping through each class in the implementation, complete with Java code and detailed explanations of each of the methods.

Artificial Neural Networks – Part 1: The XOr Problem

This is the first in a series of posts exploring artificial neural network (ANN) implementations.  The purpose of the article is to help the reader to gain an intuition of the basic concepts prior to moving on to the algorithmic implementations that will follow.

[Update: Part two of this article is now available here]

No prior knowledge is assumed, although, in the interests of brevity, not all of the terminology is explained in the article.  Instead hyperlinks are provided to Wikipedia and other sources where additional reading may be required.

This is a big topic. ANNs have a wide variety of applications and can be used for supervised, unsupervised, semi-supervised and reinforcement learning. That’s before you get into problem-specific architectures within those categories. But we have to start somewhere, so in order to narrow the scope, we’ll begin with the application of ANNs to a simple problem.

The XOr Problem
The XOr, or “exclusive or”, problem is a classic problem in ANN research. It is the problem of using a neural network to predict the outputs of XOr logic gates given two binary inputs. An XOr function should return a true value if the two inputs are not equal and a false value if they are equal. All possible inputs and predicted outputs are shown in figure 1.

Figure 1: XOr Inputs and Expected Outputs

XOr is a classification problem and one for which the expected outputs are known in advance.  It is therefore appropriate to use a supervised learning approach.

On the surface, XOr appears to be a very simple problem, however, Minksy and Papert (1969) showed that this was a big problem for neural network architectures of the 1960s, known as perceptrons.

Like all ANNs, the perceptron is composed of a network of units, which are analagous to biological neurons.  A unit can receive an input from other units.  On doing so, it takes the sum of all values received and decides whether it is going to forward a signal on to other units to which it is connected.  This is called activation.  The activation function uses some means or other to reduce the sum of input values to a 1 or a 0 (or a value very close to a 1 or 0) in order to represent activation or lack thereof.  Another form of unit, known as a bias unit, always activates, typically sending a hard coded 1 to all units to which it is connected.

Perceptrons include a single layer of input units — including one bias unit  — and a single output unit (see figure 2).  Here a bias unit is depicted by a dashed circle, while other units are shown as blue circles. There are two non-bias input units representing the two binary input values for XOr.  Any number of input units can be included.

Figure 2: Single Layer Perceptron Network

The perceptron is a type of feed-forward network, which means the process of generating an output — known as forward propagation — flows  in one direction from the input layer to the output layer.  There are no connections between units in the input layer.  Instead, all units in the input layer are connected directly to the output unit.

A simplified explanation of the forward propagation process is that the input values X1 and X2, along with the bias value of 1, are multiplied by their respective weights W0..W2,  and parsed to the output unit.  The output unit takes the sum of those values and employs an activation function — typically the Heaviside step function — to convert the resulting value to a 0 or 1, thus classifying the input values as 0 or 1.

It is the setting of the weight variables that gives the network’s author control over the process of converting input values to an output value.  It is the weights that determine where the classification line, the line that separates data points into classification groups,  is drawn.  If all data points on one side of a classification line are assigned the class of 0, all others are classified as 1.

A limitation of this architecture is that it is only capable of separating data points with a single line. This is unfortunate because the XOr inputs are not linearly separable. This is particularly visible if you plot the XOr input values to a graph. As shown in figure 3, there is no way to separate the 1 and 0 predictions with a single classification line.

Figure 3: Plotted XOr Inputs with Colour Coded Expected Outputs (red=0; green=1)

Multilayer Perceptrons
The solution to this problem is to expand beyond the single-layer architecture by adding an additional layer of units without any direct access to the outside world, known as a hidden layer.  This kind of architecture — shown in Figure 4 — is another feed-forward network known as a multilayer perceptron (MLP).

Figure 4: Multilayer Pereceptron Architecture for XOr

It is worth noting that an MLP can have any number of units in its input, hidden and output layers.  There can also be any number of hidden layers.  The architecture used here is designed specifically for the XOr problem.

Similar to the classic perceptron, forward propagation begins with the input values and bias unit from the input layer being multiplied by their  respective weights, however, in this case there is a weight for each combination of input (including the input layer’s bias unit) and hidden unit (excluding the hidden layer’s bias unit).  The products of the input layer values and their respective weights are parsed as input to the non-bias units in the hidden layer.  Each non-bias hidden unit invokes an activation function — usually the classic sigmoid function in the case of the XOr problem — to squash the sum of their input values down to a value that falls between 0 and 1 (usually a value very close to either 0 or 1), or in the case of tanh, a value close to either -1 or 1.  The outputs of each hidden layer unit, including the bias unit, are then multiplied by another set of respective weights and parsed to an output unit.  The output unit also parses the sum of its input values through an activation function — again, the sigmoid function is appropriate here — to return an output value falling between 0 and 1.  This is the predicted output.

This architecture, while more complex than that of the classic perceptron network, is capable of achieving non-linear separation.  Thus, with the right set of weight values, it can provide the necessary separation to accurately classify the XOr inputs.

Non-linear Separation Made Possible by MLP Architecture
Non-linear Separation Made Possible by MLP Architecture

The elephant in the room, of course, is how one might come up with a set of weight values that ensure the network produces the expected output.  In practice, trying to find an acceptable set of weights for an MLP network manually would be an incredibly laborious task.  In fact, it is NP-complete (Blum and Rivest, 1992).  However, it is fortunately possible to learn a good set of weight values automatically through a process known as backpropagation.  This was first demonstrated to work well for the XOr problem by Rumelhart et al. (1985).

The backpropagation algorithm begins by comparing the actual value output by the forward propagation process to the expected value and then moves backward through the network, slightly adjusting each of the weights in a direction that reduces the size of the error by a small degree.  Both forward and back propagation are re-run thousands of times on each input combination until the network can accurately predict the expected output of the possible inputs using forward propagation.

For the xOr problem, 100% of possible data examples are available to use in the training process. We can therefore expect the trained network to be 100% accurate in its predictions and there is no need to be concerned with issues such as bias and variance in the resulting model.

In this post, the classic ANN XOr problem was explored. The problem itself was described in detail, along with the fact that the inputs for XOr are not linearly separable into their correct classification categories. A non-linear solution — involving an MLP architecture — was explored at a high level, along with the forward propagation algorithm used to generate an output value from the network and the backpropagation algorithm, which is used to train the network.

Part two of this post features a Java implementation of the MLP architecture described here, including all of the components necessary to train the network to act as an XOr logic gate.

Blum, A. Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Neural Networks, 5(1), 117-127.

Minsky, M. Papert, S. (1969). Perceptron: an introduction to computational geometry. The MIT Press, Cambridge, expanded edition, 19(88), 2.

Rumelhart, D. Hinton, G. Williams, R. (1985). Learning internal representations by error propagation (No. ICS-8506). California University San Diego LA Jolla Inst. for Cognitive Science.

[Update: Part two of this article is now available here]