Codility Ferrum 2018 Challenge

This post isn’t really on topic, but anyone with a general interest in algorithms might find it useful.

The following is my entry for Codility’s LongestNonNegativeSumSlice challenge.  The temporal and spatial complexity are both O(N). The solution uses a combination of dynamic programming and prefix summation.

The problem definition is that for a given array input parameter — a sequence of integers with possible elemental values of -1, 0 and 1 — return the size of the longest sequence in the array with a sum larger than or equal to zero. For example, the longest sequence in [-1, -1, 0, 0, 1] is four, because the sum of the second to fifth elements is zero.

This problem really highlights the importance of considering the specific details of the problem carefully before implementing a solution. It’s very easy to implement a brute force solution with temporal complexity on the order of $N^2$ or even chase red herrings by looking at the solutions for other similar looking problems, such as those looking for the size of the slice with the maximum sum or the same problem allowing input values from a wider range of values.

Ultimately, the key to the problem lies in the restricted range of possible array element values. Because the smallest and largest values that can be held in the array are -1 and 1 respectively, the sum can only change in either direction by a maximum of 1 in each iteration. Therefore, if you encounter a particular negative value more than once in the prefix sum, the distance between the element following the index recorded the first time you encountered that value and that of any subsequent time you encounter it, necessarily has a sum of 0.

If the input values are [-1,-1, 0, 0, 0, 1], the prefix sum sequence runs as -1, -2, -2, -2, -2, -1. From this sequence, you can see that the element following the first -1 prefix-sum value is the second and the element presenting the subsequent -1 prefix-sum value is the sixth. The size of the sub-array (slice) contributing to this sum is 5 (the number of elements between the second and sixth inclusive). We can also see that the sum of the values from the third to fourth and third to fifth are zero, using the same technique.

There is, of course, an edge case where the prefix-sum value does not fall below zero before the largest sub-array with a sum >= 0 is found. For example: [ 0, 0, 0, 0, 0,-1], [ 0, 0, 0, 0, 0, 0] or [ 1,-1, 1,-1, 1,-1,-1]. For this reason, we also need to keep track of the edge case where the prefix sum is >= 0.

The solution — written in Java here — is relatively simple. A hash map keeps track of the first index location in which each of the negative prefix-sum values are encountered. Each time we re-encounter a previously recorded negative prefix-sum value, we know we have found a sum of zero, so we check whether the size of the sub-array that produces the zero sum value is greater than the incumbent maximum slice and if it is, it becomes the incumbent maximum slice. If the prefix sum is >= 0, then the current index (plus one if the array is zero indexed) becomes the maximum slice.

import java.util.HashMap;
import java.util.Map;
class Solution {
public int solution(int[] A) {
int sum = 0;
int maxslice = 0;
Map<Integer,Integer> sumindex = new HashMap<Integer,Integer>();

for(int i = 0; i<A.length; i++) {
sum += A[i];
if(sum >= 0)
maxslice = i + 1;
else if(sumindex.containsKey(sum))
maxslice = Math.max(maxslice, i - sumindex.get(sum));
else
sumindex.put(sum,i);
}
return maxslice;
}
}

This implementation iterates over the list of array elements in A once (N iterations). The containsKey and get methods of the HashMap have O(1) temporal complexity, so the temporal complexity of this implementation is therefore O(N) despite the challenge statement predicting worst-case complexity of O(N log N). Spatial complexity is O(N), which is in line with Codility’s expectation for worst-case space complexity. The entry received a score of 100% for correctness and scalability.

Post Edit – March 28, 2018:
Beck has pointed out that, since the number of elements in sumindex can’t exceed the size of the input array, we could replace the HashMap with a regular array. In doing so, we just have to index by the absolute value of the sum minus one (to account for the zero indexation of the array) rather than the sum itself. The actual sum will be negative any time we want to use the array.

The following is an updated implementation:

class Solution {
public int solution(int[] A) {
int sum = 0;
int maxslice = 0;
int[] sumindex = new int[A.length];

for(int i = 0; i<A.length; i++) {
sum += A[i];
int idx=Math.abs(sum)-1;
if(sum >= 0)
maxslice = i + 1;
else if(sumindex[idx]!=0)
maxslice = Math.max(maxslice, i - sumindex[idx]);
else
sumindex[idx]=i;
}
return maxslice;
}
}

Post Edit – April 16, 2018:
Alternatively, to remove the need to calculate the idx variable, negate the sum and use that for indexation:

class Solution {
public int solution(int[] A) {
int sum = 0;
int maxslice = 0;
int[] sumindex = new int[A.length];

for(int i = 0; i<A.length; i++) {
sum -= A[i];
if(sum <= 0)
maxslice = i + 1;
else if(sumindex[sum-1]!=0)
maxslice = Math.max(maxslice, i - sumindex[sum-1]);
else
sumindex[sum-1]=i;
}
return maxslice;
}
}

Introduction

As promised in part one, this second part details a java implementation of a multilayer perceptron (MLP) for the XOr problem.  Actually, as you will see, the core classes are designed to implement any MLP implementation with a single hidden layer.

First, it will help to introduce a quick overview of how MLP networks can be used to make predictions for the XOr problem.  For a more detailed explanation, please review part one of this post.

The image at the top of this article depicts the architecture for a multilayer perceptron network designed specifically to solve the XOr problem.  It contains integer inputs that can each hold the value of 1 or 0, a hidden layer with two units (not counting the dashed bias units which we will ignore for for now for the sake of simplicity) and a single output unit.  Once the network is trained, the output unit should predict the output you would expect if you were to parse the input values through an XOr logic gate.  That is, if the two input values are not equal (1,0 or 0, 1), the output should be 1, otherwise the output should be 0.

Forward propagation is completed first, which parses the input values through the network and eventually to the output unit, making a prediction of what the output should be, given the input.  This is then compared to the expected output and the prediction error is calculated.  Part of the process of passing through the network involves multiplying the values by small weights.  If the weights are correctly configured, the output prediction will be correct.

This configuration is not a manual process, but rather something that occurs automatically through a process known as backward propagation.  Backward propagation considers the error of the prediction made by forward propagation and parses those values backwards through the network, adjusting the weights to values that slightly improve prediction accuracy.

Forward and backward propagation are repeated for each training example in the dataset many times until the weights of the network are tuned to values that result in forward propagation producing accurate output.

The following sections will provide more detail on how the process works by detailing a fully functional Java implementation.

Class Structure and Dependencies

The implementation consists of four classes.  These include ExecuteXOr, which is the highest level class containing all of the XOr specific code; MultiLayerPerceptron, which contains all of the code used to implement a generic multilayer perceptron with with a single hidden layer; iActivation, the activation interface and SigmoidActivation, which is a concrete activation class specifically implementing the sigmoid activation function. MLP Implementation UML Class Diagram Showing Dependencies and Methods

Class: ExecuteXOR

The first class we will review is ExecuteXOr.  This is the highest level class, containing the main method used to kick off the the experiment.  Here the user defines the dimensions of the input and output values, instantiates the activation class and MultiLayerPerceptron class.  Note that, for the XOr problem, we have two input units, two hidden units and one output unit.  As such, X is a two dimensional array, y is a one dimensional array and the first three input parameters for the MultilayerPerceptron class denote the dimensions of the neural network. These are set to 2 (input layer), 2 (hidden layer) and 1 (output layer).

package Experiments;

import NeuralNets.MultilayerPerceptron;
import Activation.SigmoidActivation;

public class ExecuteXOr {
public static double[][] X; // Matrix to contain input values
public static double[] y; // Vector to contain expected values
public static MultilayerPerceptron NN; // Neural Network

The main method instantiates the dimension of the X and y arrays, along with the multilayerPerceptron class, before kicking off the methods used to initialise the dataset, train and test the network. We have chosen sigmoid as the activation function in this case.  However, as implied by the existence of the Activation interface, there are many other possible activation functions.

The first three parameters of the MultilayerPerceptron constructor define the dimensions of the network. In this case we have defined two input units, two hidden units and one output unit, as is required for this architecture.

The input parameters for trainNetwork define the maximum number of iterations to complete during training and the target error rate. In this case, no more than 200 thousand iterations (epochs) will be completed and training will also stop if the error rate drops to 0.01 or less.

public static void main(String[] args) {
X=new double; // Input values
y=new double; // Target values

// Instantiate the neural network class
NN = new MultilayerPerceptron(2,2,1,new SigmoidActivation());

initialiseDataSet();
trainNetwork(200000,0.01);
testNetwork();
}

The initialiseDataSet method provides the values for the X and y arrays where X contains all possible input values and y contains the expected output, given each set of input values. See part one of this post for more details on the structure of the XOr problem to understand why these values are as they are, but it is important to understand that this is an exhaustive list of possible inputs and outputs, rather than a sample of possible values.  Because we’re not using a sample of possible values we need not be concerned about overfitting and thus do not require separate training and testing datasets.

private static void initialiseDataSet(){
X = 0;
X = 0;
y = 0;

X = 0;
X = 1;
y = 1;

X = 1;
X = 0;
y = 1;

X = 1;
X = 1;
y = 0;
}

For each iteration (epoch), the trainNetwork method iterates over every training example, completing a forward propagation and backward propagation step for every training example before updating the weights. Weights are updated with a learning rate of 0.1, which slows the rate at which the weights of the network are updated. This helps to prevent the system from overshooting the ideal weight setup during updates.

Another potential issue to rectify is the tendency, depending on how the weights are initialised, for the algorithm to occasionally enter a search space that is difficult to exit, resulting in very slow progress in training. To avoid these scenarios, I have borrowed a method from combinatorial optimisation known as random restart. After every 500th epoch, a check is performed to see if the error score is improving too slowly. If progress is extremely slow, the weights are reinitialised, effectively kicking the incumbent solution into a new neighbourhood from which we will hopefully achieve better training performance.

At each epoch a check is also performed to see if the current error exceeds the target. If so, weights we have identified are close enough to the optimal configuration to produce highly accurate predictions, so the method is exited.

private static void trainNetwork(int epochs, double targetError){
double error=0;
double baseError=Double.POSITIVE_INFINITY;

// Iterate over the number of epochs to be completed
for(int epoch = 0; epoch&amp;amp;lt;epochs; epoch++){
error=0;

// Iterate over all training examples
for(int i = 0; i&amp;amp;lt;X.length; i++){
// Feed forward
NN.forwardProp(X[i]);

// Run backpropagation and add the squared error to the sum of error for this epoch
error+=NN.backProp(y[i]);

// update the weights
NN.updateWeights(0.1);
}

// Every 500th epoch check whether progress is too slow and if so, reset the weights
if(epoch % 500==0) {
// If not
if(baseError-error&amp;amp;lt;0.00001) {
NN.kick(); // Kick the candidate solution into a new neighbourhood with random restart
baseError=Double.POSITIVE_INFINITY;
}
else
baseError=error; // Record the base error
}

// Print the sum of squared error for the current epoch to the console
System.out.println(&amp;amp;quot;Epoch: &amp;amp;quot; + epoch + &amp;amp;quot; - Sum of Squared Error: &amp;amp;quot; + error);

// If the error is smaller than 0.01 stop training
if(error&amp;amp;lt;targetError)
break;
}

}

Once the network has been trained, we can test its accuracy.   The testNetwork method iterates over each of the examples in X and runs forward propagation, which returns the neural network’s prediction for what the output should be. The prediction is in the form of a value falling between something very close to zero and something very close to one. Any value of 0.5 or higher is deemed to predict 1, while anything lower than 0.5 is deemed to predict 0. This is compared to the actual expected output and the proportion of correct predictions (prediction accuracy) is returned to the console. If the network is working correctly, the returned value should always be 1.0.

private static void testNetwork(){
double correct=0;
double incorrect=0;
double output=0;

// Iterate over the testing examples (which happen to double as training examples)
for(int i = 0; i&amp;amp;lt;X.length; i++){ output=NN.forwardProp(X[i]); // Feed forward to get the output for the current example // If the output is &amp;amp;gt;= 0.5, we deem the output to be 1.0
if(output&amp;amp;gt;=0.5)
output=1.0;
else // Otherwise it is 0.0
output=0.0;

// If the output value matches the target
if(output==y[i])
correct++; // Increment the number of successful classifications
else
incorrect++; // Increment the number of unsuccessful classifications
}

// Print the test accuracy to the console
System.out.println(&amp;amp;quot;Test Accuracy: &amp;amp;quot; + String.valueOf((correct/(incorrect+correct))));
}
}

Interface: iActivation

The iActivation interface defines the methods that must be implemented by any activation class.  The only two methods required are getActivation and getDerivative.

An implementation of the getActivation method takes as input, the sum of values input into a single unit and outputs the activation value for the unit.  This is used in forward propagation to compress the sum of values received by  a unit, usually into a value of 0 or 1, or something very close to 0 or 1.  However, there are exceptions such as the tanh activation function which compresses the input value into a value of between -1 and 1.

The getDerivative method represents the computed derivative of the activation function.  This is used by backward propagation to influence how the weights are to be updated in order to change them to a value that slightly improves the accuracy of the network.  The function determines the direction in which the weights should be changed.  How the partial derivative is computed is beyond the scope of this post, but anyone interested may wish to review this report.

package Activation;

public interface iActivation {
double getDerivative(double x);
double getActivation(double x);
}

Class: SigmoidActivation

The SigmoidActivation class is an implementation of iActivation, implementing the activation and derivative Sigmoid functions.

The Sigmoid activation function is $g(x)=\frac{1}{1+e^{-x}}$, while its derivative is ${g'(x)=x(1-x)}$.

Plotting the Sigmoid function clearly shows how it converts almost all presented values to something very close to 0 or 1.  The value 0 is converted to 0.5, but any values larger or smaller quickly approach the plateaus toward the left and the right of the plot.

The derivative calculates the direction of the slope of this line (gradient), making it possible to force the weight updates into a direction that improves the accuracy of the classification process.  Because different activation functions compress their input values in different ways, they require different derivative functions to calculate their respective gradients.  The use of polymorphism here allows for the potential implementation of any activation function to be encapsulated alongside its own derivative function.

package Activation;

public class SigmoidActivation implements iActivation {

@Override
public double getDerivative(double x) {

return (x * (1.0-x));
}

@Override
public double getActivation(double x) {
double expVal=Math.exp(-x);

return 1.0/(1.0+expVal);
}

}

Class: MultilayerPerceptron

The MultilayerPerceptron class encapsulates all of the data for the neural network and the methods required to intialise and propagate the network.

Data held by the class include the dimensions of the network (InputCount, hiddenCount and outputCount), the first and second layer of weights (W1 and W2), The first and second layer of delta weights (DW1 and DW2), the pre-activation values for each of the units (Z1 and Z2) the input values (InputValues) and the activation values output by each subsequent layer of the network (HiddenValues, OutputValues).

package NeuralNets;

import java.util.Random;
import Activation.iActivation;

public class MultilayerPerceptron {
int inputCount; // Number of input units
int hiddenCount; // Number of hidden units
int outputCount; // Number of output units

// Weight matrices
double[][] W1; // First Layer of Weights
double[][] W2; // Second Layer of Weights

// Delta weight matrices
double[][] DW1; // First Layer of Delta Weights
double[][] DW2; // Second Layer of Delta Weights

// The values at the time of activation (the values input to the hidden and output units)
double[] Z1; // First Layer pre-activation values
double[] Z2; // Second Layer pre-activation values

iActivation activation=null; // Activation Class

// The values actually output by the units in each layer after being squashed by the activation function
double[] InputValues; // Inputs layer Values
double[] HiddenValues; // Hidden layer Values
double[] OutputValues; // Output layer Values

The constructor method for the class receives the dimensions of the network, along with the activation function, as input and proceeds to use that information to initialise the variables that are global to the class.

public MultilayerPerceptron(int inputCount, int hiddenCount ,int outputCount, iActivation activation){
this.inputCount=inputCount;
this.hiddenCount=hiddenCount;
this.outputCount=outputCount;
this.activation=activation;

// Initialise the first layer weight and delta weight matrices (accounting for bias unit)
W1 = initialiseWeights(new double[hiddenCount][inputCount+1]);
DW1 =initialiseDeltaWeights(new double[hiddenCount][inputCount+1]);

// Initialise the second layer weight and delta weight matrices (accounting for bias unit)
W2 = initialiseWeights(new double[outputCount][hiddenCount+1]);
DW2 = initialiseDeltaWeights(new double[outputCount][hiddenCount+1]);

// Initialise the activation vectors
Z1 = initialiseActivations(new double[hiddenCount]);
Z2 = initialiseActivations(new double[outputCount]);

// Initialise the hidden and output value vectors (same dimensions as activation vectors)
OutputValues=Z2.clone();
HiddenValues=Z1.clone();
}

The initialisation methods randomly initialise the weight arrays (W1 and W2) within a reasonable range of small values relative to the size of the InputValues array. All other arrays are initialised with zero values, while the kick method exists to accommodate the random restart process, essentially reinitialising the weight arrays with random values. This is called when the network is stuck in a search neighbourhood where training occurs at an extremely slow rate.

private double[][] initialiseWeights(double w[][]){
Random rn = new Random();

double offset = 1/(Math.sqrt(inputCount));

for(int i=0; i&amp;lt;w.length; i++){
for(int j=0; j&amp;lt;w[i].length;j++){ // No bias unit
w[i][j]=offset-rn.nextDouble();

}
}

return w;
}

private double[][] initialiseDeltaWeights(double w[][]){

for(int i=0; i&amp;lt;w.length; i++){
for(int j=0; j&amp;lt;w[i].length;j++){

w[i][j]=0;
}
}

return w;
}

private double[] initialiseDelta(double d[]){
for(int i=0; i&amp;lt;d.length; i++){
d[i]=0;
}

return d;
}

private double[] initialiseActivations(double z[]){
for(int i=0; i&amp;lt;z.length; i++){
z[i]=0;
}
return z;
}

public void kick(){
// Kick the candidate solution into a new neighbourhood (random restart)
W2 = initialiseWeights(new double[outputCount][hiddenCount+1]); // Account for bias unit
W1 = initialiseWeights(new double[hiddenCount][inputCount+1]); // Account for bias unit
}

The forwardProp method completes forward propagation. First a bias unit is added to the input values and hidden unit values. The non-bias pre-activation values (Z values) are calculated by summing the input values after multiplying them by their weights. The hidden unit activations are then calculated by parsing the resulting Z values through the activation function.

The hidden layer activations then serve as input to the output layer and the process is repeated for that layer with the activation for the output unit serving as the prediction.

public double[] forwardProp(double[] inputs){
this.InputValues = new double[(inputs.length+1)];

// Add bias unit to inputs
InputValues=1;
for(int i = 1; i&amp;lt;InputValues.length; i++){
this.InputValues[i]=inputs[i-1];
}

HiddenValues = new double[hiddenCount+1];
HiddenValues=1; // Add bias unit

// Get hidden layer activations
for(int i = 0; i&amp;lt;Z1.length;i++){
Z1[i]=0;
for(int j = 0; j&amp;lt;InputValues.length; j++){
// Hidden Layer Activation
Z1[i]+=W1[i][j] * InputValues[j];
}

// Hidden Layer Output value
HiddenValues[i+1]= activation.getActivation(Z1[i]);
}

// Get output layer activations
for(int i = 0; i&amp;lt;Z2.length;i++){
Z2[i]=0;
for(int j=0; j&amp;lt;HiddenValues.length; j++){
// Get output layer Activation
Z2[i] += W2[i][j] * HiddenValues[j];
}
// Get output layer output Value
OutputValues[i]=activation.getActivation(Z2[i]);
}

return OutputValues;
}

There are two backProp methods for completing backward propagation, the first accommodating the use of a single target value and the second accommodating an array of target values, for architectures containing more than one output unit. Delta weights are the values by which the weights can be updated to improve the performance of the network. The outputs of the backward propagation process include the delta weights for the first and second layer of weights in the network and the half of the sum of the outputs of the cost function, otherwise known as the error.

As the name suggests, backward propagation begins with the output layer and propagates backwards through the network. The error is calculated using the classic squared error cost function, in which half of the squared sum of the difference between the target values and the actual predictions is calculated. Squaring the error ensures that all error values are both positive in value and are magnified, exaggerating the degree of error.

The deltas for the second layer of weights are calculated by taking the difference between the target values and predictions, multiplying that by the activation derivative of the predictions.

Once the second layer delta weights have been calculated, the first layer delta weights are also updated. These are calculated by multiplying the second layer deltas by the second layer of weights and multiplying that by the activation derivative of the hidden layer activation values. These values are then multiplied by the input layer values.

Worth noting, is that the backProp method does not actually update the weights, it merely provides the delta weight values that can later be used to perform the update.

// Support a single double value as the target
public double backProp(double targets){
double[] t = new double;
t=targets;
return backProp(t);
}

public double backProp(double[] targets){
double error=0;
double errorSum=0;
double[] D1; // First Layer Deltas
double[] D2; // Second Layer Deltas
D1 = initialiseDelta(new double[hiddenCount]);
D2 = initialiseDelta(new double[outputCount]);

// Calculate Deltas for the second layer and the error
for(int i = 0;i&amp;lt;D2.length; i++){
D2[i]=(targets[i]-OutputValues[i]) * activation.getDerivative(OutputValues[i]);

errorSum+= Math.pow((targets[i]-OutputValues[i]),2); // Squared Error
}
error = errorSum / 2;

// Update Delta Weights for second layer
for(int i = 0; i&amp;lt;outputCount; i++){
for(int j = 0; j&amp;lt;hiddenCount+1; j++){
DW2[i][j] += D2[i] * HiddenValues[j];
}
}

// Calculate Deltas for first layer of weights
for(int j = 0; j&amp;lt;hiddenCount; j++){
for(int k = 0; k&amp;lt;outputCount; k++){
D1[j] += (D2[k] * W2[k][j+1]) * activation.getDerivative(HiddenValues[j+1]);
}
}

// Update first layer deltas
for(int i=0; i&amp;lt;hiddenCount;i++){
for(int j=0; j&amp;lt;inputCount+1; j++){ // Account for bias unit
DW1[i][j] += D1[i] * InputValues[j];

}
}
return error;
}

In this particular implementation of ExecuteXOr, the weights are updated in every epoch. However, in some circumstances it can be advantageous to update the weights less regularly. The updateWeights method allows for updating of the weights to occur at any interval as directed by the calling method. Each time the weights are updated, the DW1 and DW2 arrays are reset to zero values. For as long as the weights are not updated, the delta weights accumulate.

The only input parameter for the method is the learningRate. When learningRate is set to 1, the entire delta weight is used in the update. When it is set to a fractional value, the rate at which the weights are updated is reduced. ExecuteXor is using a learning rate of 0.1, ensuring that only 10% of the delta weight value is used to update the weights at each epoch.

public void updateWeights(double learningRate){
// Update first layer of weights
for(int i = 0; i&amp;lt;W2.length; i++){
for(int j = 0; j&amp;lt;W2[i].length; j++){
W2[i][j] += learningRate * DW2[i][j];
}
}

// Update second layer of weights
for(int i = 0; i&amp;lt;W1.length; i++){
for(int j = 0; j&amp;lt;W1[i].length; j++){
W1[i][j] += learningRate * DW1[i][j];
}
}

// Reset delta weights
DW1 = initialiseDeltaWeights(new double[hiddenCount][inputCount+1]);
DW2 = initialiseDeltaWeights(new double[outputCount][hiddenCount+1]);
}

}

Conclusion

This post is the second part of an article on multilayer perceptron (MLP) artificial neural networks for the XOr problem.  It has demonstrated a Java implementation of an MLP.  The article began with a brief recap of the XOr problem and a summary of the processes used to train an MLP, including a high-level discussion of forward propagation and backward propagation.

The class structure and dependencies for the implementation were then detailed before stepping through each class in the implementation, complete with Java code and detailed explanations of each of the methods.

Approaches to Big Combinatorial Optimisation Problems

Combinatorial optimisation is a problem category in which the goal is to find an optimal combination of entities.  A classic example is the travelling salesman problem, in which the goal is to plot the most efficient route a salesman can take to visit all of the towns in scope, given the locations of towns and distances between them.

Problems such as this can be solved to global optimality using deterministic methods only when the search space involved is very small.  However, it is almost always the case that any real-world problem instance will have a search space much too large to expect to find global optimum (Talbi et al, 2006).  So before a viable approach can be postulated for a combinatorial optimisation problem, it is important to understand the nature of the problem complexity through the lens of computational complexity theory.

For any given problem under consideration, computational complexity theory asks whether there is an algorithm that can effectively solve it (Ogihara, 1999).  Attempts are then made to categorise the problem on the basis of its computational difficulty relative to other problems (Hromkovic, 2010).

Temporal complexity (or time complexity) is a measure of complexity based on the relationship between an algorithm’s running time and the size of its input.  In other words, an algorithm’s temporal complexity is a function of its input-size (Hromkovic, 2010).  ‘Time’ in this instance refers to the number of steps that the algorithm must take in order to run to completion (Mayordomo, 2004).  In order for an algorithmic solution to be considered viable, its temporal complexity must be polynomial (Papadimitriou et al, 1998).  In such cases, the temporal-complexity is said to be “polynomial-time” or P (Demaine, 2011).

For many problems, there is no known algorithm in P, but individual candidate solutions can nonetheless be verified in polynomial-time (Mayordomo, 2004).  That is, although a non-deterministic process might be used to come up with potential sub-optimal solutions, the process of scoring each individual candidate for comparison can be completed deterministically.  This is known as deterministic verification (Hromkovic, 2010).  When formulated as decision problems — by ensuring the output is binary — such problems are regarded as nondeterministic-polynomial–time problems or NP (Mayordomo, 2004).

All P problems also fall into NP because they too have nondeterministic solutions (Demaine, 2011).  At present, it is not known whether the reverse is the case and all NP problems have solutions in P (Mayordomo, 2004).  The question of whether P=NP is one of the most important problems in mathematics and computer science today (Sipser, 1992).  Rather than attempt to answer this millennium prize question, in most contexts it makes sense to assume that P≠NP.  This approach assumes that not all problems in NP have polynomial-time solutions waiting to be discovered.

Two additional temporal complexity classes to consider are the NP-Hard and NP-Complete classes.  NP-Hard is the set of all problems that are at least as difficult as the hardest problems in NP (Demaine, 2011).  This includes problems that are not in NP, such as combinatorial optimisation problems.  The NP-Complete class on the other hand, is the set of all NP-Hard problems that are also in NP and thus are formulated as decision problems (Demaine, 2011). Relationships between the relevant classes of temporal-complexity (assuming P≠NP)

All NP-Complete problems can be solved by exhaustively checking every candidate solution in the search space.  However, the search spaces tend to be far too large for this approach to be considered practical (Woeginger, 2003).  NP-Hard and NP-Complete heuristics must therefore use more efficient methods to explore the solution space.

The first thing to go is the idea of finding the globally optimal solution.  Instead we search for any high-quality sub-optimal solution.   Furthermore, with no known polynomial-time algorithm that can solve them, approaches to these problems tend to use deterministic verification.   Most commonly by applying a standard metaheuristic.

Metaheuristics are generalised (high-level) heuristics.  They fall into the major categories of local search and evolutionary algorithms (Talbi et al, 2006).  Examples of local search-based metaheuristics include local search itself, tabu search and simulated annealingGenetic algorithms on the other hand, are examples of evolutionary algorithms (Talbi et al, 2006).

Local search algorithms explore the search space one solution at a time, improving on the incumbent solution by making a single change at each iteration until the locally optimal solution is found.  This creates the problem of the heuristic getting stuck when it converges on the locally optimal solution.  For this reason, some local search metaheuristics include means by which to break away from the local optima, allowing them to explore more than one locally optimal solution.  These tend to produce better results.

The evolutionary approach involves evaluating many different candidate solutions all at once and includes some means of excluding poor solutions and splitting and re-combining good solutions in the hope of improving on them in the next generation.  Genetic algorithms are the most common example.

Metaheuristics have the advantage of applicability to a variety of combinatorial optimisation problems.  For this reason, the metaheuristic approach is viewed as a generic approach to problem solving.  Furthermore, metaheuristic development is relatively simple and considerably cheaper than bespoke development (Hromkovic, 2010), though the cost saving comes at the expense of performance (Dowsland, 1998; Talbi, 2006).

In recent years a new category of metaheuristic has emerged in an attempt to create a heuristic that is both generalisable and does not suffer from the deminished performance of classic metaheuristics.  These are known as hyper-heuristics.

The term ‘hyper-heuristic’ was coined to describe metaheuristics that invoke other heuristics (Cowling et al, 2001), a concept that has existed since the 1960s (Ross, 2005), however, the definition was recently expanded to include algorithms that generate heuristics (Burke et al, 2010).

A hyper-heuristic algorithm approaches a problem by calling on a series of low-level heuristics to generate solutions (Burke et al, 2009). While a metaheuristic search space is made up of problem solutions, the hyper-heuristic search space is composed of heuristic algorithms (Burke et al, 2009).  Hyper-heuristics have been found to exhibit generality well beyond that of being able to provide state-of-the-art results for multiple problem instances within a given problem category.  In fact, there are examples of individual hyper-heuristic implementations that can actually solve many distinct problem types.  That is, the same implementation deployed to schedule employees to rosters can be deployed to solve the travelling salesman, knapsack, vehicle routing and many other problem types.

Future posts in this series will explore each of the most commonly used metaheuristics in detail.

Sources

Burke, E. K. Hyde, M. Kendall, G. Ochoa, G. Ozcan, E. Qu, R. (2009). A survey of hyper-heuristics. Computer Science Technical Report No. NOTTCS-TR-SUB-0906241418-2747, School of Computer Science and Information Technology, University of Nottingham.

Burke, E. K. Hyde, M. Kendall G., Ochoa, G. Özcan, E. Woodward J. R. (2010). A Classification of Hyper-heuristic Approaches. In Handbook of Metaheuristics, International Series in Operations Research & Management Science, 146, 449-468.

Cowling, P. Kendall, G. Soubeiga, E. (2001). A hyper-heuristic approach to scheduling a sales summit. In Practice and Theory of Automated Timetabling III, 176-190.

Demaine, E. (2011). Introduction to Algorithms: Lecture 23 – Computational Complexity. Accessible from <http://www.youtube.com/watch?v=moPtwq_cVH8> [Accessed on 22 June 2013]. Massachusetts Institute of Technology.

Dowsland, K. A. (1998). Off-the-Peg or Made-to-Measure? Timetabling and Scheduling with SA and TS. In Practice and Theory of Automated Timetabling II, 37-52

Hromkovic, J. (2010). Algorithmics for Hard Problems: Introduction to Combinatorial Optimization, Randomization, Approximation and Heuristics. Zurich Switzerland, Springer.

Mayordomo, E. (2004). P versus NP. Monografías de la Real Academia de Ciencias Exactas, Físicas, Químicas y Naturales de Zaragoza, (26), 57-68.

Papadimitriou, C. H. Steiglitz, K. (1998). Combinatorial Optimization: Algorithms and Complexity, New York, Dover Publications.

Ross, P. (Ed). (2005). Hyper-heuristics. In Search methodologies, 529-556.

Sipser, M. (1992). The history and status of the P versus NP question. In Proceedings of the twenty-fourth annual ACM symposium on Theory of computing, 603-618.

Talbi, E. (Ed). (2006). Parallel Combinatorial Optimization, Villeneuve d’Ascq, France, Wiley

Ogihara, M 1999, ‘COMPUTATIONAL COMPLEXITY THEORY’, Encyclopaedia Of Electrical & Electronics Engineering, 3, 618-628.

Woeginger, G. J. (2003). Exact algorithms for NP-hard problems: A survey. In Combinatorial Optimization—Eureka, You Shrink! (pp. 185-207).

Irish Electric Vehicle Charge Point Status Datasets

I recently completed a minor thesis as a partial requirement for an MSc in computer science, a course with heavy leanings towards machine learning and data analytics.  The thesis explored the question of whether predictive analytics can be used to predict electric vehicle (EV) charge point availability in Ireland.

Contention for charge points is of increasing concern to Irish EV owners as the ratio of plugin EVs to charge points is set to rapidly increase in the near future, despite there being no plans to increase investment in the infrastructure.  A key motivation for the research was the idea that an algorithm that can make better-than-chance predictions about the availability and reliability of charge points from historical data, can potentially be used to inform a vehicle routing algorithm of charging stations to avoid when making route decisions for electric vehicles.

Unfortunately there were no datasets available to us lowly MSc students, so a big part of the workload involved building my own from live charge point status data provided in Ireland by ESB E-Cars, the organisation which at the time of writing, is responsible for maintaining the publicly funded charge-point networks in the Republic of Ireland and Northern Ireland. The live status data is available on the E-Cars charge point map and E-Car Connect app.

While the thesis only considered data from November 2016 to June 2017, I have continued to collect data and have made it accessible to EV drivers through a web site (www.cpinfo.ie).  Although it is a work in progress and currently has limited search capabilities, CPInfo has proven to be a useful tool in evaluating the reliability and potential availability of charge points.  EV drivers use this information when planning their routes as it can help to identify charge points that are frequently out of service or occupied by other drivers.

Open Data Licence

I am now making the raw datasets available for anyone to use under a creative commons attribution 4.0 international public licence.  Licencing is necessary because considerable pre-prossing has been conducted on the source data to produce the datasets and this work attracts copyright protection implicitly. Explicitly offering an open data licence provides clarity on how the data can be used. Essentially, the licence allows anyone to use the data for any purpose on the condition that the source is correctly attributed. To attribute the source of the datasets, simply provide a hyperlink or citation referencing this blog post.

The licencing of the formatted datasets does not undermine the copyright that may be held by ESB E-Cars, who might own aspects of the datasets by virtue of owning the source data.  A representative of ESB E-cars has described this data as “publicly available and free to use”.

Dataset Details

The datasets take the form of monthly tab-delimited text files. Each line includes the date, time, charge point Id, charge point type {StandardType2, CHAdeMO, CCS, FastAC}, latitudinal and logitudinal coordinates, status {OOS, OOC, Part, Occ, Unknown} and address of a single charge point.  The data represent a snapshot of the status of the charge point network taken at five-minute intervals.

StandardType2 is a slow charge point of up to 22kw AC, while the CHAdeMO, CCS and FastAC are different types of DC and AC rapid charge points.  A standard type two charge point represented in the datasets typically has two available connections that can be used simultaneously, while a rapid charge point can have either one, two or three connections, each with a different connection type (CHAdeMO, CCS or FastAC).

The available status has been omitted as it can be implied by the absence of a record. The other statuses include out of service (OOS), out of contact (OOC), partially occupied (Part), fully occupied (Occ) and Unknown.  An unknown status exists where the status data was either not available or otherwise not polled due to connection issues at the time interval in question.  Where the status is unknown for all charge points at the interval, a single record exists with a charge point id of unknown and a status of unknown.

There are also a number of nuances in the data that must be considered.

First of all, charge points are sometimes moved from one location to another and/or replaced by another charge point. When this happens, the charge point Id at a given location changes in the dataset.  Furthermore, the charge point removed from one location, can appear again at another. The charge point Id therefore cannot be relied upon as a means to track the charging activity at a particular location. Instead, the latitudinal and longitudinal coordinates should be used. However, these can also change slightly (by a matter of meters) and thus can’t be used directly as unique identifiers. To get the full history of activity at a given location, a search should include charge points within a tight range of latitudinal and longitudinal values rather than by matching the exact values.

Another nuance is that the charge point map sometimes goes through periods, of up to several days, where the status data are not updated. This is reflected in the datasets by statuses that don’t change for unusually long periods of time.  As it stands, the datasets reflect the state of the charge point maps during those periods rather than the statuses of the charge points in reality.  These periods can be identified via a checksum calculated over all rows at each time interval or by manually checking the statuses at the busiest charge points.

Rapid chargers never hold a status of partly occupied despite the fact that in many cases it is possible to charge on the FastAC connection at the same time as CCS or CHAdeMO connection.  Furthermore, an occupied status on the CCS or CHAdeMO connection does not necessarily imply that that specific connection was in use.  This is because only one of the two can be used at any given time and thus if the CHAdeMO connection is in use, then CCS is also unavailable and vice versa.

At the time of writing there are no examples of multiple rapid charge points in the same vicinity, however, there are a small number of cases where there are multiple standard type two charge points at the same location.  Dundrum town centre and the Stillorgan Luas station are two examples.  On the map, these are represented slightly differently to other charge points in that the charge point Ids and statuses are grouped together.  If there are two charge points, and therefore four connections, only one icon appears on the map listing both charge point ids and the status will only show as fully occupied if all four connections are in use.  Consequently, these were omitted from the dataset.  I intend to rectify this in a future script update.

Any updates to the datasets will be expressed by updating this blog post.  The datasets are available for download here, where they will be updated monthly.

Why are there so many fake data scientists and machine learning engineers? The title of this post is a question I answered recently on Quora, a post that seems to have gathered some interest, so I thought it might be worthwhile expanding on it here.

In my response, I pointed out that in recent months I have encountered a number of software engineers who seem to believe that machine learning libraries, such as Tensorflow, can sufficiently abstract away the need for machine learning knowledge in much the same way that high level programming languages in most industrial areas of application have abstracted away the need for knowledge of low-level programming.

I should point that I have nothing at all against the use of machine learning libraries and I am in no way advocating for the coding of machine learning algorithms from scratch in industrial practice.  Where I have advocated for coding machine learning algorithms from scratch in the past, it has always been for the purpose of education.  The point I have attempted to make in my two paragraph post on Quora, and in these two previous blog posts, is that there is a range of knowledge that is required for machine learning development regardless of whether you are personally coding the algorithms or referencing software libraries.

Machine learning engineers and data scientists need to understand what kind of data needs to be gathered or found from the start of any project.  They need to understand how to pre-process that data, perform feature selection, cross validation for both model selection and parameter tuning of the selected model, all while being careful to avoid overfitting.  They have to understand what tools are available to them, when it is appropriate to use them and how to set their parameters.  They have to be able to design full machine learning pipelines, possibly with multiple machine learning algorithms interacting.  Without this knowledge, expect a lot of time wasted through unnecessary trial-and-error experimentation, or worse, models that fail to make accurate predictions in the wild.

Automating machine learning libraries so they can complete some of this work without user knowledge is a long-standing goal of many machine learning researchers in industry and academia.  The so-called “democratisation of machine learning” isn’t a new concept and varying degrees of success have been achieved in the automation of some of the algorithmic and statistical knowledge required to do machine learning (Centinsoy et al., 2016; IBM Analytics, 2016) or otherwise lower the barrier to entry for machine learning practitioners (Chen et al., 2016; Guo et al., 2016; Patel, 2010; 2016).  But we’re not yet at a point where a software engineer can jump into machine learning development without some kind of introductory training or mentorship.  Those who do are involved in machine learning black magic.

Furthermore, there is the question of the degree of capability of a software engineer who has little knowledge of machine learning.  In a previous post I quoted former Kaggle chief scientist, Jeremy Howard, suggesting that there is a non-linear disparity in capability between the best and average machine learning developers and that the best of the best learned their trade by understanding the mathematics behind the algorithms.  Howard was not talking about people coding machine learning algorithms from scratch.  He was talking about Kaggle competition Entrants, who almost universally use libraries.  The fact of the matter is that the people who understand the algorithms put the libraries to better use and perform better in Kaggle competitions by orders of magnitude.

Lest the reader assume I am against the idea of software engineers working on machine learning projects, nothing could be further from the truth.  In my view, the whole field of machine learning development is in dire need of the sort of software-engineer thinking that brought the SOLID principles and software design patterns to object oriented software development.  As Sculley et al. (2014) have pointed out, machine learning implementations bring with them a whole slew of new ways to generate technical debt in a software project.  Despite this, very little guidance has been proposed in the way of best practices or design patterns for machine learning implementations.  The sum of existing work basically amounts to the aforementioned paper (Sculley et al., 2014) and another rejected conference paper on the topic of design patterns for deep convolutional neural networks (Smith & Topin, 2016). Moving machine learning into the hands of more industrial software engineers who care about the practical implications it will have on their projects can only be good for the field.  I’m merely advising caution.

References

Cetinsoy, A., Martin, F. J., Ortega, J. A., Petersen, P. (2016). The Past, Present, and Future of Machine Learning APIs. In Proceedings of The 2nd International Conference on Predictive APIs and Apps (pp. 43-49).

Chen, D., Bellamy, R. K., Malkin, P. K., & Erickson, T. (2016). Diagnostic visualization for non-expert machine learning practitioners: A design study. In Visual Languages and Human-Centric Computing (VL/HCC), 2016 IEEE Symposium on (pp. 87-95). IEEE.

Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., Lew, M. S. (2016). Deep learning for visual understanding: A review. Neurocomputing, 187, 27-48.

IBM Analytics. (2016). The democratization of Machine Learning: Apache Spark opens up the door for the rest of us. IBM White Paper. Accessed on May 17, 2017, from: https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=CDW12360USEN

Patel, K. (2010). Lowering the barrier to applying machine learning. In Adjunct proceedings of the 23nd annual ACM symposium on User interface software and technology (pp. 355-358). ACM.

Patel, K. D. (2013). Lowering the barrier to applying machine learning (Doctoral dissertation). University of Washington. Accessed on March 25, 2017, from http://www.cc.gatech.edu/~stasko/8001/heer06.pdf

Sculley, D., Phillips, T., Ebner, D., Chaudhary, V., Young, M. (2014). Machine learning: The high-interest credit card of technical debt. In SE4ML:Software Engineering for Machine Learning (NIPS 2014 Workshop). Accessed on March 25, 2017, from http://www.eecs.tufts.edu/~dsculley/papers/technical-debt.pdf

Smith, L. N., & Topin, N. (2016). Deep convolutional neural network design patterns. arXiv preprint arXiv:1611.00847. Submitted to the International Conference on Learning Representations (ICLR) and rejected, 2017. Accessed on March 06, 2017, from https://pdfs.semanticscholar.org/8863/9a6e21a8a8989e6d25e44119a90ba0b27628.pdf

Artificial Neural Networks – Part 1: The XOr Problem

Introduction
This is the first in a series of posts exploring artificial neural network (ANN) implementations.  The purpose of the article is to help the reader to gain an intuition of the basic concepts prior to moving on to the algorithmic implementations that will follow.

No prior knowledge is assumed, although, in the interests of brevity, not all of the terminology is explained in the article.  Instead hyperlinks are provided to Wikipedia and other sources where additional reading may be required.

This is a big topic. ANNs have a wide variety of applications and can be used for supervised, unsupervised, semi-supervised and reinforcement learning. That’s before you get into problem-specific architectures within those categories. But we have to start somewhere, so in order to narrow the scope, we’ll begin with the application of ANNs to a simple problem.

The XOr Problem
The XOr, or “exclusive or”, problem is a classic problem in ANN research. It is the problem of using a neural network to predict the outputs of XOr logic gates given two binary inputs. An XOr function should return a true value if the two inputs are not equal and a false value if they are equal. All possible inputs and predicted outputs are shown in figure 1.

XOr is a classification problem and one for which the expected outputs are known in advance.  It is therefore appropriate to use a supervised learning approach.

On the surface, XOr appears to be a very simple problem, however, Minksy and Papert (1969) showed that this was a big problem for neural network architectures of the 1960s, known as perceptrons.

Perceptrons
Like all ANNs, the perceptron is composed of a network of units, which are analagous to biological neurons.  A unit can receive an input from other units.  On doing so, it takes the sum of all values received and decides whether it is going to forward a signal on to other units to which it is connected.  This is called activation.  The activation function uses some means or other to reduce the sum of input values to a 1 or a 0 (or a value very close to a 1 or 0) in order to represent activation or lack thereof.  Another form of unit, known as a bias unit, always activates, typically sending a hard coded 1 to all units to which it is connected.

Perceptrons include a single layer of input units — including one bias unit  — and a single output unit (see figure 2).  Here a bias unit is depicted by a dashed circle, while other units are shown as blue circles. There are two non-bias input units representing the two binary input values for XOr.  Any number of input units can be included.

The perceptron is a type of feed-forward network, which means the process of generating an output — known as forward propagation — flows  in one direction from the input layer to the output layer.  There are no connections between units in the input layer.  Instead, all units in the input layer are connected directly to the output unit.

A simplified explanation of the forward propagation process is that the input values X1 and X2, along with the bias value of 1, are multiplied by their respective weights W0..W2,  and parsed to the output unit.  The output unit takes the sum of those values and employs an activation function — typically the Heaviside step function — to convert the resulting value to a 0 or 1, thus classifying the input values as 0 or 1.

It is the setting of the weight variables that gives the network’s author control over the process of converting input values to an output value.  It is the weights that determine where the classification line, the line that separates data points into classification groups,  is drawn.  If all data points on one side of a classification line are assigned the class of 0, all others are classified as 1.

A limitation of this architecture is that it is only capable of separating data points with a single line. This is unfortunate because the XOr inputs are not linearly separable. This is particularly visible if you plot the XOr input values to a graph. As shown in figure 3, there is no way to separate the 1 and 0 predictions with a single classification line. Figure 3: Plotted XOr Inputs with Colour Coded Expected Outputs (red=0; green=1)

Multilayer Perceptrons
The solution to this problem is to expand beyond the single-layer architecture by adding an additional layer of units without any direct access to the outside world, known as a hidden layer.  This kind of architecture — shown in Figure 4 — is another feed-forward network known as a multilayer perceptron (MLP).

It is worth noting that an MLP can have any number of units in its input, hidden and output layers.  There can also be any number of hidden layers.  The architecture used here is designed specifically for the XOr problem.

Similar to the classic perceptron, forward propagation begins with the input values and bias unit from the input layer being multiplied by their  respective weights, however, in this case there is a weight for each combination of input (including the input layer’s bias unit) and hidden unit (excluding the hidden layer’s bias unit).  The products of the input layer values and their respective weights are parsed as input to the non-bias units in the hidden layer.  Each non-bias hidden unit invokes an activation function — usually the classic sigmoid function in the case of the XOr problem — to squash the sum of their input values down to a value that falls between 0 and 1 (usually a value very close to either 0 or 1), or in the case of tanh, a value close to either -1 or 1.  The outputs of each hidden layer unit, including the bias unit, are then multiplied by another set of respective weights and parsed to an output unit.  The output unit also parses the sum of its input values through an activation function — again, the sigmoid function is appropriate here — to return an output value falling between 0 and 1.  This is the predicted output.

This architecture, while more complex than that of the classic perceptron network, is capable of achieving non-linear separation.  Thus, with the right set of weight values, it can provide the necessary separation to accurately classify the XOr inputs.

Backpropagation
The elephant in the room, of course, is how one might come up with a set of weight values that ensure the network produces the expected output.  In practice, trying to find an acceptable set of weights for an MLP network manually would be an incredibly laborious task.  In fact, it is NP-complete (Blum and Rivest, 1992).  However, it is fortunately possible to learn a good set of weight values automatically through a process known as backpropagation.  This was first demonstrated to work well for the XOr problem by Rumelhart et al. (1985).

The backpropagation algorithm begins by comparing the actual value output by the forward propagation process to the expected value and then moves backward through the network, slightly adjusting each of the weights in a direction that reduces the size of the error by a small degree.  Both forward and back propagation are re-run thousands of times on each input combination until the network can accurately predict the expected output of the possible inputs using forward propagation.

For the xOr problem, 100% of possible data examples are available to use in the training process. We can therefore expect the trained network to be 100% accurate in its predictions and there is no need to be concerned with issues such as bias and variance in the resulting model.

Conclusion
In this post, the classic ANN XOr problem was explored. The problem itself was described in detail, along with the fact that the inputs for XOr are not linearly separable into their correct classification categories. A non-linear solution — involving an MLP architecture — was explored at a high level, along with the forward propagation algorithm used to generate an output value from the network and the backpropagation algorithm, which is used to train the network.

Part two of this post features a Java implementation of the MLP architecture described here, including all of the components necessary to train the network to act as an XOr logic gate.

References
Blum, A. Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Neural Networks, 5(1), 117-127.

Minsky, M. Papert, S. (1969). Perceptron: an introduction to computational geometry. The MIT Press, Cambridge, expanded edition, 19(88), 2.

Rumelhart, D. Hinton, G. Williams, R. (1985). Learning internal representations by error propagation (No. ICS-8506). California University San Diego LA Jolla Inst. for Cognitive Science.

Discussion: Why Machine Learning Beginners Shouldn’t Avoid the Math

In a post I published yesterday, I argued that it is important for students of machine learning to understand the algorithms and underlying mathematics prior to using tools or libraries that black box the code. I suggested that to do so is likely to result in a lot of “time-wasting confusion” due to students not having the necessary understanding to configure parameters or interpret results. One of the examples I provided for the opposing view was this blog post from BigML, which argues that beginners don’t need courses such as those provided by Coursera if they use their tool.

Francisco J. Martin, CEO of Big ML, has tweeted in response. So Kids shouldn’t avoid assembler, automata, and compilers when learning to code?

This is a very good question and one that grants us an opportunity to dig deeper into the issue. I am responding here because I don’t believe it’s a question I can answer in 140 characters.

The short answer is no, I’m perfectly ok with beginner programmers starting out in high-level languages and working their way down, or even stopping there and not working their way down. But this is not analagous to machine learning.

I see three big differences.

First of all, learning a high-level language is actually a constructive step towards learning lower level languages. If that’s the goal, and you started with something like Java, you could potentially learn quite a lot about programming in general. Then trying C++ would help to fill in blanks with resect to some of the aspects of programming the Java glosses over. Likewise, Assembler could take you a step further.

If playing with the parameters of black-boxed algorithms offers a path at all towards becoming proficient at machine learning, it’s an incredibly innefficient one. It’s an awfully big search space to approach by trial and error when you consider the combinations of parameters, feature selection and the question of whether you have enough or appropriate data examples.

The second difference is that to do high-level programming does not require an understanding of low-level programming. I can do anything that Java or c# will let me do without knowing anything about assembly language.  In comparison, a machine learning tool requires me to know how to set appropriate values of parameters that are parsed into the hidden algorithms. They also require me to understand whether or not I have an appropriate (representative) dataset with appropriate features. Then when it finishes I need to be able to interpret the results and take appropriate actions. Better outcomes come from more informed decisions.

The third difference relates to the potential benefits of exploring the low-level languages. There are some exceptions to this, but generally speaking, writing more efficient algorithms in low-level languages comes at such great expense in comparison to the constantly falling cost of computation, that it just isn’t worthwhile.

In my last post I cited Kaggle’s chief scientist, Jeremy Howard, who said there was a massive difference in capability between good and average data scientists. I take this to indicate that in machine learning, more knowledge leads to exponentially better outcomes.  Unlike low-level programming, there is a huge benefit to having a detailed knowledge of machine learning.

I have come across some arguments suggesting that as Moore’s law reaches its limit, low-level coding will become much more sought after. If that happens I’ll revisit my position on low-level coding, but for now I’m betting that specialist processors like GPUs will help to bridge the gap before the next paradigm of computation comes along to keep the gravy train of exponential price-performance improvement going.

The Self-reinforcing Myth of Hard-wired Math Inability

There is a commonly held belief that some people have brains that are pre-wired for mathematical excellence, while everyone else is doomed to struggle with the subject. This toxic myth needs to be put deep in the ground and buried in molten lead. It is as destructive as it is self-fulfilling.

The myth equally encourages people who are good at math to falsely believe (Murayama et al., 2012) they are more intelligent than those who are not, and leaves everyone else inclined to believe they can never improve. This is despite the fact that math ability has very little to do with intelligence (Blair et al., 2007).

The reason this myth exists is well understood. School students who were well prepared by their parents in math prior to starting school find themselves separated in ability from their classmates who were not. The latter group consider the seemingly unachievable abilities of their peers and quickly lose confidence in their own abilities. Once that self-confidence is lost, any attempt at completing a math problem leads to math anxiety (Ashcraft et al., 2002; Devine et al., 2012), where thoughts of self-doubt cloud the mind and make it difficult to concentrate on the task at hand.

Mathematics, like computer programming, is a discipline that requires concentration. The student needs to be able to follow a train of thought where A leads to B leads to C etc. A student who lacks self-confidence struggles to maintain the necessary train of thought due to being repeatedly interrupted by negative thoughts about their abilities.  This results in poor performance and reinforces the idea that they are incapable of learning the subject.

It is interesting to see this belief so prevalent among software developers who are perfectly capable of writing an algorithm in a programming language, but suddenly feel that it is impossible to grasp the same algorithm represented by a set of mathematical symbols. There is simply no reason that this should be the case. I’ve yet to meet an experienced programmer who would tell me they find it near-impossible to learn the syntax of a new programming language and yet that is precisely what is entailed in learning how to express an algorithm using linear algebra.

A common point of confusion for many who haven’t done a lot of math since secondary school is in the use of mathematics as a language rather than a set of equations to be solved. In academic computer science, linear algebra, as it is used to express algorithms, is not something to be solved, but rather a language used to describe an algorithm.

Understanding the language of academic computer science is becoming increasingly important as the traditional staples of academia, such as machine learning, increasingly find use in industry.  After all, even if a software developer manages to avoid the math in their work, how can they expect to keep up with the latest developments in this fast-moving field without an ability to understand the academic literature?  Yet this is precisely what some software developers are attempting to do.

Math inability is not hard wired and software developers are already well practiced in the mental skills required.  We use the skill of stepping through a problem and visualising the state changes that occur at each step, every time we read or write a piece of code.  Anyone who can do that is capable of becoming proficient enough in mathematics to understand the mathematical components of the computer science literature.

References

Ashcraft, M. H. (2002). Math anxiety: Personal, educational, and cognitive consequences. Current directions in psychological science, 11(5), 181-185.

Blair, C., & Razza, R. P. (2007). Relating effortful control, executive function, and false belief understanding to emerging math and literacy ability in kindergarten. Child development, 78(2), 647-663.

Devine, A., Fawcett, K., Szűcs, D., & Dowker, A. (2012). Gender differences in mathematics anxiety and the relation to mathematics performance while controlling for test anxiety. Behavioral and brain functions, 8(1), 1.

Murayama, K., Pekrun, R., Lichtenfeld, S., & Vom Hofe, R. (2012). Predicting long‐term growth in students’ mathematics achievement: The unique contributions of motivation and cognitive strategies. Child development, 84(4), 1475-1490.

Andreescu, T., Gallian, J. A., Kane, J. M., & Mertz, J. E. (2008). Cross-cultural analysis of students with exceptional talent in mathematical problem solving. Notices of the AMS, 55(10), 1248-1260.

Berger, A., Tzur, G., & Posner, M. I. (2006). Infant brains detect arithmetic errors. Proceedings of the National Academy of Sciences, 103(33), 12649-12653.

Post Edits

13/07/2016 – Added references and see also sections.  Updated inline references to show primary sources rather than just linking to secondary sources.

14/07/2016 – Corrected typo in final paragraph “Math ability is not hard wired…” changed to “Math inability is not hard wired”.

Why Learn Machine Learning and Optimisation?

In this post I hope to convince the reader that machine learning and optimisation are worthwhile fields for a software developer in industry to engage with.

I explain the purpose of this blog and argue that we are in the midst of a machine-learning revolution.

______________________________________________________________

When I first started coding as a teenager in the early 1990s, the future looked certain to be shaped by artificial intelligence. We were told that we’d soon have “fifth generation” languages that would allow for the creation of complex software applications without the need for human programmers. Expert systems would replace human experts in every walk of life and we’d talk to our machines in much the same way Gene Roddenberry imagined we should. Unfortunately, this model of reality didn’t quite go to plan. After many years of enormous research and development expense — mainly focused in Japan — we entered another AI winter. The future was left in the hands of a handful of diehard academics, while the software industry mostly ignored AI research.

The good news is that the AI winter is now well and truly over.  The technology has been slowly but surely increasing its influence on mainstream software development and data analytics for at least a decade and 2015 has been billed as a breakthrough year by media sources such as Bloomberg and and Wired magazine.

Whether we realise it or not, most of us use AI every day. In fact, AI is responsible for all of the coolest software innovations you’ve heard of in recent years. It is the basis for autonomous helicopters, autonomous cars, big data analytics, google search, automatic language translation, targeted advertising, optical character recognition, speech recognition, facial recognition, anomaly detection, news-article clustering, vehicle routing and product recommendation, just to list the few examples I could name at the time of writing.

As a field, artificial intelligence has been deeply rooted in academia for decades, but it is quickly becoming prevalent in industry.  We are at the dawn of the AI revolution and there has never been a better time to start sciencing up your skill set.

This blog, is here to help and, as its name suggests, will focus on two important and complementary sub-fields of AI: Machine Learning and Optimisation. The intention is to explain both topics in a language that software developers in industry can easily understand, with or without a background in hard computer science.

I believe this is an important addition to the discourse on these topics because most of the sources you’re likely to come across assume a strong existing knowledge of linear algebra, calculus, statistics, probability, information theory and computational complexity theory: the language of academic computer science.  This is unsurprising, given that the techniques were mostly developed by computer scientists, mathematicians and statisticians, but it can unfortunately be a barrier to a lot of people getting started.

The intention here is to remove that barrier by describing the various techniques using familiar, medium-level programming languages.  The posts that follow will not shy away from the theory, but no assumptions will be made with respect to prior understanding of mathematics or computer science and code snipets will accompany any mathematical descriptions.