Artificial Neural Networks – Part 2: MLP Implementation for XOr


As promised in part one, this second part details a java implementation of a multilayer perceptron (MLP) for the XOr problem.  Actually, as you will see, the core classes are designed to implement any MLP implementation with a single hidden layer.

First, it will help to introduce a quick overview of how MLP networks can be used to make predictions for the XOr problem.  For a more detailed explanation, please review part one of this post.

The image at the top of this article depicts the architecture for a multilayer perceptron network designed specifically to solve the XOr problem.  It contains integer inputs that can each hold the value of 1 or 0, a hidden layer with two units (not counting the dashed bias units which we will ignore for for now for the sake of simplicity) and a single output unit.  Once the network is trained, the output unit should predict the output you would expect if you were to parse the input values through an XOr logic gate.  That is, if the two input values are not equal (1,0 or 0, 1), the output should be 1, otherwise the output should be 0.

Forward propagation is completed first, which parses the input values through the network and eventually to the output unit, making a prediction of what the output should be, given the input.  This is then compared to the expected output and the prediction error is calculated.  Part of the process of passing through the network involves multiplying the values by small weights.  If the weights are correctly configured, the output prediction will be correct.

This configuration is not a manual process, but rather something that occurs automatically through a process known as backward propagation.  Backward propagation considers the error of the prediction made by forward propagation and parses those values backwards through the network, adjusting the weights to values that slightly improve prediction accuracy.

Forward and backward propagation are repeated for each training example in the dataset many times until the weights of the network are tuned to values that result in forward propagation producing accurate output.

The following sections will provide more detail on how the process works by detailing a fully functional Java implementation.

Class Structure and Dependencies

The implementation consists of four classes.  These include ExecuteXOr, which is the highest level class containing all of the XOr specific code; MultiLayerPerceptron, which contains all of the code used to implement a generic multilayer perceptron with with a single hidden layer; iActivation, the activation interface and SigmoidActivation, which is a concrete activation class specifically implementing the sigmoid activation function.

MLP Implementation UML Class Diagram Showing Dependencies and Methods

Class: ExecuteXOR

The first class we will review is ExecuteXOr.  This is the highest level class, containing the main method used to kick off the the experiment.  Here the user defines the dimensions of the input and output values, instantiates the activation class and MultiLayerPerceptron class.  Note that, for the XOr problem, we have two input units, two hidden units and one output unit.  As such, X is a two dimensional array, y is a one dimensional array and the first three input parameters for the MultilayerPerceptron class denote the dimensions of the neural network. These are set to 2 (input layer), 2 (hidden layer) and 1 (output layer).

package Experiments;

import NeuralNets.MultilayerPerceptron;
import Activation.SigmoidActivation;

public class ExecuteXOr {
	public static double[][] X; // Matrix to contain input values
	public static double[] y; // Vector to contain expected values
	public static MultilayerPerceptron NN; // Neural Network

The main method instantiates the dimension of the X and y arrays, along with the multilayerPerceptron class, before kicking off the methods used to initialise the dataset, train and test the network. We have chosen sigmoid as the activation function in this case.  However, as implied by the existence of the Activation interface, there are many other possible activation functions.

The first three parameters of the MultilayerPerceptron constructor define the dimensions of the network. In this case we have defined two input units, two hidden units and one output unit, as is required for this architecture.

The input parameters for trainNetwork define the maximum number of iterations to complete during training and the target error rate. In this case, no more than 200 thousand iterations (epochs) will be completed and training will also stop if the error rate drops to 0.01 or less.

	public static void main(String[] args) {
		X=new double[4][2]; // Input values
		y=new double[4]; // Target values
		// Instantiate the neural network class
		NN = new MultilayerPerceptron(2,2,1,new SigmoidActivation());

The initialiseDataSet method provides the values for the X and y arrays where X contains all possible input values and y contains the expected output, given each set of input values. See part one of this post for more details on the structure of the XOr problem to understand why these values are as they are, but it is important to understand that this is an exhaustive list of possible inputs and outputs, rather than a sample of possible values.  Because we’re not using a sample of possible values we need not be concerned about overfitting and thus do not require separate training and testing datasets.

	private static void initialiseDataSet(){
		X[0][0] = 0;
		X[0][1] = 0;
		y[0] = 0;
		X[1][0] = 0;
		X[1][1] = 1;
		y[1] = 1;
		X[2][0] = 1;
		X[2][1] = 0;
		y[2] = 1;

		X[3][0] = 1;
		X[3][1] = 1;
		y[3] = 0;

For each iteration (epoch), the trainNetwork method iterates over every training example, completing a forward propagation and backward propagation step for every training example before updating the weights. Weights are updated with a learning rate of 0.1, which slows the rate at which the weights of the network are updated. This helps to prevent the system from overshooting the ideal weight setup during updates.

Another potential issue to rectify is the tendency, depending on how the weights are initialised, for the algorithm to occasionally enter a search space that is difficult to exit, resulting in very slow progress in training. To avoid these scenarios, I have borrowed a method from combinatorial optimisation known as random restart. After every 500th epoch, a check is performed to see if the error score is improving too slowly. If progress is extremely slow, the weights are reinitialised, effectively kicking the incumbent solution into a new neighbourhood from which we will hopefully achieve better training performance.

At each epoch a check is also performed to see if the current error exceeds the target. If so, weights we have identified are close enough to the optimal configuration to produce highly accurate predictions, so the method is exited.

	private static void trainNetwork(int epochs, double targetError){
		double error=0;
		double baseError=Double.POSITIVE_INFINITY;

		// Iterate over the number of epochs to be completed
		for(int epoch = 0; epoch<epochs; epoch++){
			// Iterate over all training examples
			for(int i = 0; i<X.length; i++){
				// Feed forward
				// Run backpropagation and add the squared error to the sum of error for this epoch
				// update the weights
			// Every 500th epoch check whether progress is too slow and if so, reset the weights
			if(epoch % 500==0) {
				// If not
				if(baseError-error<0.00001) {
					NN.kick(); // Kick the candidate solution into a new neighbourhood with random restart
					baseError=error; // Record the base error
			// Print the sum of squared error for the current epoch to the console
			System.out.println("Epoch: " + epoch + " - Sum of Squared Error: " + error);
			// If the error is smaller than 0.01 stop training

Once the network has been trained, we can test its accuracy.   The testNetwork method iterates over each of the examples in X and runs forward propagation, which returns the neural network’s prediction for what the output should be. The prediction is in the form of a value falling between something very close to zero and something very close to one. Any value of 0.5 or higher is deemed to predict 1, while anything lower than 0.5 is deemed to predict 0. This is compared to the actual expected output and the proportion of correct predictions (prediction accuracy) is returned to the console. If the network is working correctly, the returned value should always be 1.0.

	private static void testNetwork(){
		double correct=0;
		double incorrect=0;
		double output=0;
		// Iterate over the testing examples (which happen to double as training examples)
		for(int i = 0; i<X.length; i++){ output=NN.forwardProp(X[i])[0]; // Feed forward to get the output for the current example // If the output is >= 0.5, we deem the output to be 1.0
			else // Otherwise it is 0.0
			// If the output value matches the target
				correct++; // Increment the number of successful classifications
				incorrect++; // Increment the number of unsuccessful classifications
		// Print the test accuracy to the console
		System.out.println("Test Accuracy: " + String.valueOf((correct/(incorrect+correct))));

Interface: iActivation

The iActivation interface defines the methods that must be implemented by any activation class.  The only two methods required are getActivation and getDerivative.

An implementation of the getActivation method takes as input, the sum of values input into a single unit and outputs the activation value for the unit.  This is used in forward propagation to compress the sum of values received by  a unit, usually into a value of 0 or 1, or something very close to 0 or 1.  However, there are exceptions such as the tanh activation function which compresses the input value into a value of between -1 and 1.

The getDerivative method represents the computed derivative of the activation function.  This is used by backward propagation to influence how the weights are to be updated in order to change them to a value that slightly improves the accuracy of the network.  The function determines the direction in which the weights should be changed.  How the partial derivative is computed is beyond the scope of this post, but anyone interested may wish to review this report.

package Activation;

public interface iActivation {
	double getDerivative(double x);
	double getActivation(double x);

Class: SigmoidActivation

The SigmoidActivation class is an implementation of iActivation, implementing the activation and derivative Sigmoid functions.

The Sigmoid activation function is g(x)=\frac{1}{1+e^{-x}}, while its derivative is {g'(x)=x(1-x)}.

Plotting the Sigmoid function clearly shows how it converts almost all presented values to something very close to 0 or 1.  The value 0 is converted to 0.5, but any values larger or smaller quickly approach the plateaus toward the left and the right of the plot.

Sigmoid Plot

The derivative calculates the direction of the slope of this line (gradient), making it possible to force the weight updates into a direction that improves the accuracy of the classification process.  Because different activation functions compress their input values in different ways, they require different derivative functions to calculate their respective gradients.  The use of polymorphism here allows for the potential implementation of any activation function to be encapsulated alongside its own derivative function.

package Activation;

public class SigmoidActivation implements iActivation {

	public double getDerivative(double x) {
		return (x * (1.0-x));

	public double getActivation(double x) {
		double expVal=Math.exp(-x);
		return 1.0/(1.0+expVal);

Class: MultilayerPerceptron

The MultilayerPerceptron class encapsulates all of the data for the neural network and the methods required to intialise and propagate the network.

Data held by the class include the dimensions of the network (InputCount, hiddenCount and outputCount), the first and second layer of weights (W1 and W2), The first and second layer of delta weights (DW1 and DW2), the pre-activation values for each of the units (Z1 and Z2) the input values (InputValues) and the activation values output by each subsequent layer of the network (HiddenValues, OutputValues).

package NeuralNets;

import java.util.Random;
import Activation.iActivation;

public class MultilayerPerceptron {
	int inputCount; // Number of input units
	int hiddenCount; // Number of hidden units
	int outputCount; // Number of output units
	// Weight matrices
	double[][] W1; // First Layer of Weights
	double[][] W2; // Second Layer of Weights
	// Delta weight matrices
	double[][] DW1; // First Layer of Delta Weights
	double[][] DW2; // Second Layer of Delta Weights
	// The values at the time of activation (the values input to the hidden and output units)
	double[] Z1; // First Layer pre-activation values
	double[] Z2; // Second Layer pre-activation values
	iActivation activation=null; // Activation Class

	// The values actually output by the units in each layer after being squashed by the activation function
	double[] InputValues; // Inputs layer Values
	double[] HiddenValues; // Hidden layer Values
	double[] OutputValues; // Output layer Values

The constructor method for the class receives the dimensions of the network, along with the activation function, as input and proceeds to use that information to initialise the variables that are global to the class.

	public MultilayerPerceptron(int inputCount, int hiddenCount ,int outputCount, iActivation activation){
		// Initialise the first layer weight and delta weight matrices (accounting for bias unit)
		W1 = initialiseWeights(new double[hiddenCount][inputCount+1]);
		DW1 =initialiseDeltaWeights(new double[hiddenCount][inputCount+1]);
		// Initialise the second layer weight and delta weight matrices (accounting for bias unit)
		W2 = initialiseWeights(new double[outputCount][hiddenCount+1]);
		DW2 = initialiseDeltaWeights(new double[outputCount][hiddenCount+1]);
		// Initialise the activation vectors
		Z1 = initialiseActivations(new double[hiddenCount]);
		Z2 = initialiseActivations(new double[outputCount]);
		// Initialise the hidden and output value vectors (same dimensions as activation vectors)

The initialisation methods randomly initialise the weight arrays (W1 and W2) within a reasonable range of small values relative to the size of the InputValues array. All other arrays are initialised with zero values, while the kick method exists to accommodate the random restart process, essentially reinitialising the weight arrays with random values. This is called when the network is stuck in a search neighbourhood where training occurs at an extremely slow rate.

	private double[][] initialiseWeights(double w[][]){
		Random rn = new Random();

		double offset = 1/(Math.sqrt(inputCount));
		for(int i=0; i<w.length; i++){
			for(int j=0; j<w[i].length;j++){ // No bias unit

		return w;

	private double[][] initialiseDeltaWeights(double w[][]){
		for(int i=0; i<w.length; i++){
			for(int j=0; j<w[i].length;j++){ 
		return w;
	private double[] initialiseDelta(double d[]){
		for(int i=0; i<d.length; i++){
		return d;

	private double[] initialiseActivations(double z[]){
		for(int i=0; i<z.length; i++){
		return z;

	public void kick(){
		// Kick the candidate solution into a new neighbourhood (random restart)
		W2 = initialiseWeights(new double[outputCount][hiddenCount+1]); // Account for bias unit
		W1 = initialiseWeights(new double[hiddenCount][inputCount+1]); // Account for bias unit

The forwardProp method completes forward propagation. First a bias unit is added to the input values and hidden unit values. The non-bias pre-activation values (Z values) are calculated by summing the input values after multiplying them by their weights. The hidden unit activations are then calculated by parsing the resulting Z values through the activation function.

The hidden layer activations then serve as input to the output layer and the process is repeated for that layer with the activation for the output unit serving as the prediction.

	public double[] forwardProp(double[] inputs){
		this.InputValues = new double[(inputs.length+1)];
		// Add bias unit to inputs
		for(int i = 1; i<InputValues.length; i++){
		HiddenValues = new double[hiddenCount+1];
		HiddenValues[0]=1; // Add bias unit
		// Get hidden layer activations
		for(int i = 0; i<Z1.length;i++){
			for(int j = 0; j<InputValues.length; j++){
				// Hidden Layer Activation
				Z1[i]+=W1[i][j] * InputValues[j]; 

			// Hidden Layer Output value
			HiddenValues[i+1]= activation.getActivation(Z1[i]);
		// Get output layer activations
		for(int i = 0; i<Z2.length;i++){
			for(int j=0; j<HiddenValues.length; j++){
				// Get output layer Activation
				Z2[i] += W2[i][j] * HiddenValues[j]; 
			// Get output layer output Value

		return OutputValues;

There are two backProp methods for completing backward propagation, the first accommodating the use of a single target value and the second accommodating an array of target values, for architectures containing more than one output unit. Delta weights are the values by which the weights can be updated to improve the performance of the network. The outputs of the backward propagation process include the delta weights for the first and second layer of weights in the network and the half of the sum of the outputs of the cost function, otherwise known as the error.

As the name suggests, backward propagation begins with the output layer and propagates backwards through the network. The error is calculated using the classic squared error cost function, in which half of the squared sum of the difference between the target values and the actual predictions is calculated. Squaring the error ensures that all error values are both positive in value and are magnified, exaggerating the degree of error.

The deltas for the second layer of weights are calculated by taking the difference between the target values and predictions, multiplying that by the activation derivative of the predictions.

Once the second layer delta weights have been calculated, the first layer delta weights are also updated. These are calculated by multiplying the second layer deltas by the second layer of weights and multiplying that by the activation derivative of the hidden layer activation values. These values are then multiplied by the input layer values.

Worth noting, is that the backProp method does not actually update the weights, it merely provides the delta weight values that can later be used to perform the update.

	// Support a single double value as the target
	public double backProp(double targets){
		double[] t = new double[1];
		return backProp(t);
	public double backProp(double[] targets){
		double error=0;
		double errorSum=0;
		double[] D1; // First Layer Deltas
		double[] D2; // Second Layer Deltas
		D1 = initialiseDelta(new double[hiddenCount]);
		D2 = initialiseDelta(new double[outputCount]);
		// Calculate Deltas for the second layer and the error
		for(int i = 0;i<D2.length; i++){
			D2[i]=(targets[i]-OutputValues[i]) * activation.getDerivative(OutputValues[i]);
			errorSum+= Math.pow((targets[i]-OutputValues[i]),2); // Squared Error
		error = errorSum / 2;
		// Update Delta Weights for second layer
		for(int i = 0; i<outputCount; i++){
			for(int j = 0; j<hiddenCount+1; j++){ 
				DW2[i][j] += D2[i] * HiddenValues[j];
		// Calculate Deltas for first layer of weights
		for(int j = 0; j<hiddenCount; j++){
			for(int k = 0; k<outputCount; k++){
				D1[j] += (D2[k] * W2[k][j+1]) * activation.getDerivative(HiddenValues[j+1]);
		// Update first layer deltas
		for(int i=0; i<hiddenCount;i++){
			for(int j=0; j<inputCount+1; j++){ // Account for bias unit
				DW1[i][j] += D1[i] * InputValues[j];

		return error;

In this particular implementation of ExecuteXOr, the weights are updated in every epoch. However, in some circumstances it can be advantageous to update the weights less regularly. The updateWeights method allows for updating of the weights to occur at any interval as directed by the calling method. Each time the weights are updated, the DW1 and DW2 arrays are reset to zero values. For as long as the weights are not updated, the delta weights accumulate.

The only input parameter for the method is the learningRate. When learningRate is set to 1, the entire delta weight is used in the update. When it is set to a fractional value, the rate at which the weights are updated is reduced. ExecuteXor is using a learning rate of 0.1, ensuring that only 10% of the delta weight value is used to update the weights at each epoch.

	public void updateWeights(double learningRate){
		// Update first layer of weights
		for(int i = 0; i<W2.length; i++){
			for(int j = 0; j<W2[i].length; j++){
				W2[i][j] += learningRate * DW2[i][j];
		// Update second layer of weights
		for(int i = 0; i<W1.length; i++){
			for(int j = 0; j<W1[i].length; j++){
				W1[i][j] += learningRate * DW1[i][j];
		// Reset delta weights
		DW1 = initialiseDeltaWeights(new double[hiddenCount][inputCount+1]);
		DW2 = initialiseDeltaWeights(new double[outputCount][hiddenCount+1]);



This post is the second part of an article on multilayer perceptron (MLP) artificial neural networks for the XOr problem.  It has demonstrated a Java implementation of an MLP.  The article began with a brief recap of the XOr problem and a summary of the processes used to train an MLP, including a high-level discussion of forward propagation and backward propagation.

The class structure and dependencies for the implementation were then detailed before stepping through each class in the implementation, complete with Java code and detailed explanations of each of the methods.