Why are there so many fake data scientists and machine learning engineers?

The title of this post is a question I answered recently on Quora, a post that seems to have gathered some interest, so I thought it might be worthwhile expanding on it here.

In my response, I pointed out that in recent months I have encountered a number of software engineers who seem to believe that machine learning libraries, such as Tensorflow, can sufficiently abstract away the need for machine learning knowledge in much the same way that high level programming languages in most industrial areas of application have abstracted away the need for knowledge of low-level programming.

I should point that I have nothing at all against the use of machine learning libraries and I am in no way advocating for the coding of machine learning algorithms from scratch in industrial practice.  Where I have advocated for coding machine learning algorithms from scratch in the past, it has always been for the purpose of education.  The point I have attempted to make in my two paragraph post on Quora, and in these two previous blog posts, is that there is a range of knowledge that is required for machine learning development regardless of whether you are personally coding the algorithms or referencing software libraries.

Machine learning engineers and data scientists need to understand what kind of data needs to be gathered or found from the start of any project.  They need to understand how to pre-process that data, perform feature selection, cross validation for both model selection and parameter tuning of the selected model, all while being careful to avoid overfitting.  They have to understand what tools are available to them, when it is appropriate to use them and how to set their parameters.  They have to be able to design full machine learning pipelines, possibly with multiple machine learning algorithms interacting.  Without this knowledge, expect a lot of time wasted through unnecessary trial-and-error experimentation, or worse, models that fail to make accurate predictions in the wild.

Automating machine learning libraries so they can complete some of this work without user knowledge is a long-standing goal of many machine learning researchers in industry and academia.  The so-called “democratisation of machine learning” isn’t a new concept and varying degrees of success have been achieved in the automation of some of the algorithmic and statistical knowledge required to do machine learning (Centinsoy et al., 2016; IBM Analytics, 2016) or otherwise lower the barrier to entry for machine learning practitioners (Chen et al., 2016; Guo et al., 2016; Patel, 2010; 2016).  But we’re not yet at a point where a software engineer can jump into machine learning development without some kind of introductory training or mentorship.  Those who do are involved in machine learning black magic.

Furthermore, there is the question of the degree of capability of a software engineer who has little knowledge of machine learning.  In a previous post I quoted former Kaggle chief scientist, Jeremy Howard, suggesting that there is a non-linear disparity in capability between the best and average machine learning developers and that the best of the best learned their trade by understanding the mathematics behind the algorithms.  Howard was not talking about people coding machine learning algorithms from scratch.  He was talking about Kaggle competition Entrants, who almost universally use libraries.  The fact of the matter is that the people who understand the algorithms put the libraries to better use and perform better in Kaggle competitions by orders of magnitude.

Lest the reader assume I am against the idea of software engineers working on machine learning projects, nothing could be further from the truth.  In my view, the whole field of machine learning development is in dire need of the sort of software-engineer thinking that brought the SOLID principles and software design patterns to object oriented software development.  As Sculley et al. (2014) have pointed out, machine learning implementations bring with them a whole slew of new ways to generate technical debt in a software project.  Despite this, very little guidance has been proposed in the way of best practices or design patterns for machine learning implementations.  The sum of existing work basically amounts to the aforementioned paper (Sculley et al., 2014) and another rejected conference paper on the topic of design patterns for deep convolutional neural networks (Smith & Topin, 2016). Moving machine learning into the hands of more industrial software engineers who care about the practical implications it will have on their projects can only be good for the field.  I’m merely advising caution.

References

Cetinsoy, A., Martin, F. J., Ortega, J. A., Petersen, P. (2016). The Past, Present, and Future of Machine Learning APIs. In Proceedings of The 2nd International Conference on Predictive APIs and Apps (pp. 43-49).

Chen, D., Bellamy, R. K., Malkin, P. K., & Erickson, T. (2016). Diagnostic visualization for non-expert machine learning practitioners: A design study. In Visual Languages and Human-Centric Computing (VL/HCC), 2016 IEEE Symposium on (pp. 87-95). IEEE.

Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., Lew, M. S. (2016). Deep learning for visual understanding: A review. Neurocomputing, 187, 27-48.

IBM Analytics. (2016). The democratization of Machine Learning: Apache Spark opens up the door for the rest of us. IBM White Paper. Accessed on May 17, 2017, from: https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=CDW12360USEN

Patel, K. (2010). Lowering the barrier to applying machine learning. In Adjunct proceedings of the 23nd annual ACM symposium on User interface software and technology (pp. 355-358). ACM.

Patel, K. D. (2013). Lowering the barrier to applying machine learning (Doctoral dissertation). University of Washington. Accessed on March 25, 2017, from http://www.cc.gatech.edu/~stasko/8001/heer06.pdf

Sculley, D., Phillips, T., Ebner, D., Chaudhary, V., Young, M. (2014). Machine learning: The high-interest credit card of technical debt. In SE4ML:Software Engineering for Machine Learning (NIPS 2014 Workshop). Accessed on March 25, 2017, from http://www.eecs.tufts.edu/~dsculley/papers/technical-debt.pdf

Smith, L. N., & Topin, N. (2016). Deep convolutional neural network design patterns. arXiv preprint arXiv:1611.00847. Submitted to the International Conference on Learning Representations (ICLR) and rejected, 2017. Accessed on March 06, 2017, from https://pdfs.semanticscholar.org/8863/9a6e21a8a8989e6d25e44119a90ba0b27628.pdf

Artificial Neural Networks – Part 1: The XOr Problem

 Introduction
This is the first in a series of posts exploring artificial neural network (ANN) implementations.  The purpose of the article is to help the reader to gain an intuition of the basic concepts prior to moving on to the algorithmic implementations that will follow.

No prior knowledge is assumed, although, in the interests of brevity, not all of the terminology is explained in the article.  Instead hyperlinks are provided to Wikipedia and other sources where additional reading may be required.

This is a big topic. ANNs have a wide variety of applications and can be used for supervised, unsupervised, semi-supervised and reinforcement learning. That’s before you get into problem-specific architectures within those categories. But we have to start somewhere, so in order to narrow the scope, we’ll begin with the application of ANNs to a simple problem.

The XOr Problem
The XOr, or “exclusive or”, problem is a classic problem in ANN research. It is the problem of using a neural network to predict the outputs of XOr logic gates given two binary inputs. An XOr function should return a true value if the two inputs are not equal and a false value if they are equal. All possible inputs and predicted outputs are shown in figure 1.

Figure1
Figure 1: XOr Inputs and Expected Outputs

XOr is a classification problem and one for which the expected outputs are known in advance.  It is therefore appropriate to use a supervised learning approach.

On the surface, XOr appears to be a very simple problem, however, Minksy and Papert (1969) showed that this was a big problem for neural network architectures of the 1960s, known as perceptrons.

Perceptrons
Like all ANNs, the perceptron is composed of a network of units, which are analagous to biological neurons.  A unit can receive an input from other units.  On doing so, it takes the sum of all values received and decides whether it is going to forward a signal on to other units to which it is connected.  This is called activation.  The activation function uses some means or other to reduce the sum of input values to a 1 or a 0 (or a value very close to a 1 or 0) in order to represent activation or lack thereof.  Another form of unit, known as a bias unit, always activates, typically sending a hard coded 1 to all units to which it is connected.

Perceptrons include a single layer of input units — including one bias unit  — and a single output unit (see figure 2).  Here a bias unit is depicted by a dashed circle, while other units are shown as blue circles. There are two non-bias input units representing the two binary input values for XOr.  Any number of input units can be included.

Figure2
Figure 2: Single Layer Perceptron Network

The perceptron is a type of feed-forward network, which means the process of generating an output — known as forward propagation — flows  in one direction from the input layer to the output layer.  There are no connections between units in the input layer.  Instead, all units in the input layer are connected directly to the output unit.

A simplified explanation of the forward propagation process is that the input values X1 and X2, along with the bias value of 1, are multiplied by their respective weights W0..W2,  and parsed to the output unit.  The output unit takes the sum of those values and employs an activation function — typically the Heavside step function — to convert the resulting value to a 0 or 1, thus classifying the input values as 0 or 1.

It is the setting of the weight variables that gives the network’s author control over the process of converting input values to an output value.  It is the weights that determine where the classification line, the line that separates data points into classification groups,  is drawn.  If all data points on one side of a classification line are assigned the class of 0, all others are classified as 1.

A limitation of this architecture is that it is only capable of separating data points with a single line. This is unfortunate because the XOr inputs are not linearly separable. This is particularly visible if you plot the XOr input values to a graph. As shown in figure 3, there is no way to separate the 1 and 0 predictions with a single classification line.

Figure3
Figure 3: Plotted XOr Inputs with Colour Coded Expected Outputs (red=0; green=1)

Multilayer Perceptrons
The solution to this problem is to expand beyond the single-layer architecture by adding an additional layer of units without any direct access to the outside world, known as a hidden layer.  This kind of architecture — shown in Figure 4 — is another feed-forward network known as a multilayer perceptron (MLP).

Figure4
Figure 4: Multilayer Pereceptron Architecture for XOr

It is worth noting that an MLP can have any number of units in its input, hidden and output layers.  There can also be any number of hidden layers.  The architecture used here is designed specifically for the XOr problem.

Similar to the classic perceptron, forward propagation begins with the input values and bias unit from the input layer being multiplied by their  respective weights, however, in this case there is a weight for each combination of input (including the input layer’s bias unit) and hidden unit (excluding the hidden layer’s bias unit).  The products of the input layer values and their respective weights are parsed as input to the non-bias units in the hidden layer.  Each non-bias hidden unit invokes an activation function — usually the classic sigmoid function in the case of the XOr problem — to squash the sum of their input values down to a value that falls between 0 and 1 (usually a value very close to either 0 or 1).  The outputs of each hidden layer unit, including the bias unit, are then multiplied by another set of respective weights and parsed to an output unit.  The output unit also parses the sum of its input values through an activation function — again, the sigmoid function is appropriate here — to return an output value falling between 0 and 1.  This is the predicted output.

This architecture, while more complex than that of the classic perceptron network, is capable of achieving non-linear separation.  Thus, with the right set of weight values, it can provide the necessary separation to accurately classify the XOr inputs.

Non-linear Separation Made Possible by MLP Architecture
Non-linear Separation Made Possible by MLP Architecture

Backpropagation
The elephant in the room, of course, is how one might come up with a set of weight values that ensure the network produces the expected output.  In practice, trying to find an acceptable set of weights for an MLP network manually would be an incredibly laborious task.  In fact, it is NP-complete (Blum and Rivest, 1992).  However, it is fortunately possible to learn a good set of weight values automatically through a process known as backpropagation.  This was first demonstrated to work well for the XOr problem by Rumelhart et al. (1985).

The backpropagation algorithm begins by comparing the actual value output by the forward propagation process to the expected value and then moves backward through the network, slightly adjusting each of the weights in a direction that reduces the size of the error by a small degree.  Both forward and back propagation are re-run thousands of times on each input combination until the network can accurately predict the expected output of the possible inputs using forward propagation.

For the xOr problem, 100% of possible data examples are available to use in the training process. We can therefore expect the trained network to be 100% accurate in its predictions and there is no need to be concerned with issues such as bias and variance in the resulting model.

Conclusion
In this post, the classic ANN XOr problem was explored. The problem itself was described in detail, along with the fact that the inputs for XOr are not linearly separable into their correct classification categories. A non-linear solution — involving an MLP architecture — was explored at a high level, along with the forward propagation algorithm used to generate an output value from the network and the backpropagation algorithm, which is used to train the network.

The next post in this series will feature a Java implementation of the MLP architecture described here, including all of the components necessary to train the network to act as an XOr logic gate.

References
Blum, A. Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Neural Networks, 5(1), 117-127.

Minsky, M. Papert, S. (1969). Perceptron: an introduction to computational geometry. The MIT Press, Cambridge, expanded edition, 19(88), 2.

Rumelhart, D. Hinton, G. Williams, R. (1985). Learning internal representations by error propagation (No. ICS-8506). California University San Diego LA Jolla Inst. for Cognitive Science.

Discussion: Why Machine Learning Beginners Shouldn’t Avoid the Math

In a post I published yesterday, I argued that it is important for students of machine learning to understand the algorithms and underlying mathematics prior to using tools or libraries that black box the code. I suggested that to do so is likely to result in a lot of “time-wasting confusion” due to students not having the necessary understanding to configure parameters or interpret results. One of the examples I provided for the opposing view was this blog post from BigML, which argues that beginners don’t need courses such as those provided by Coursera if they use their tool.

Francisco J. Martin, CEO of Big ML, has tweeted in response.

FJM

So Kids shouldn’t avoid assembler, automata, and compilers when learning to code?

This is a very good question and one that grants us an opportunity to dig deeper into the issue. I am responding here because I don’t believe it’s a question I can answer in 140 characters.

The short answer is no, I’m perfectly ok with beginner programmers starting out in high-level languages and working their way down, or even stopping there and not working their way down. But this is not analagous to machine learning.

I see three big differences.

First of all, learning a high-level language is actually a constructive step towards learning lower level languages. If that’s the goal, and you started with something like Java, you could potentially learn quite a lot about programming in general. Then trying C++ would help to fill in blanks with resect to some of the aspects of programming the Java glosses over. Likewise, Assembler could take you a step further.

If playing with the parameters of black-boxed algorithms offers a path at all towards becoming proficient at machine learning, it’s an incredibly innefficient one. It’s an awfully big search space to approach by trial and error when you consider the combinations of parameters, feature selection and the question of whether you have enough or appropriate data examples.

The second difference is that to do high-level programming does not require an understanding of low-level programming. I can do anything that Java or c# will let me do without knowing anything about assembly language.  In comparison, a machine learning tool requires me to know how to set appropriate values of parameters that are parsed into the hidden algorithms. They also require me to understand whether or not I have an appropriate (representative) dataset with appropriate features. Then when it finishes I need to be able to interpret the results and take appropriate actions. Better outcomes come from more informed decisions.

The third difference relates to the potential benefits of exploring the low-level languages. There are some exceptions to this, but generally speaking, writing more efficient algorithms in low-level languages comes at such great expense in comparison to the constantly falling cost of computation, that it just isn’t worthwhile.

In my last post I cited Kaggle’s chief scientist, Jeremy Howard, who said there was a massive difference in capability between good and average data scientists. I take this to indicate that in machine learning, more knowledge leads to exponentially better outcomes.  Unlike low-level programming, there is a huge benefit to having a detailed knowledge of machine learning.

I have come across some arguments suggesting that as Moore’s law reaches its limit, low-level coding will become much more sought after. If that happens I’ll revisit my position on low-level coding, but for now I’m betting that specialist processors like GPUs will help to bridge the gap before the next paradigm of computation comes along to keep the gravy train of exponential price-performance improvement going.

The Self-reinforcing Myth of Hard-wired Math Inability

There is a commonly held belief that some people have brains that are pre-wired for mathematical excellence, while everyone else is doomed to struggle with the subject. This toxic myth needs to be put deep in the ground and buried in molten lead. It is as destructive as it is self-fulfilling.

The myth equally encourages people who are good at math to falsely believe (Murayama et al., 2012) they are more intelligent than those who are not, and leaves everyone else inclined to believe they can never improve. This is despite the fact that math ability has very little to do with intelligence (Blair et al., 2007).

The reason this myth exists is well understood. School students who were well prepared by their parents in math prior to starting school find themselves separated in ability from their classmates who were not. The latter group consider the seemingly unachievable abilities of their peers and quickly lose confidence in their own abilities. Once that self-confidence is lost, any attempt at completing a math problem leads to math anxiety (Ashcraft et al., 2002; Devine et al., 2012), where thoughts of self-doubt cloud the mind and make it difficult to concentrate on the task at hand.

Mathematics, like computer programming, is a discipline that requires concentration. The student needs to be able to follow a train of thought where A leads to B leads to C etc. A student who lacks self-confidence struggles to maintain the necessary train of thought due to being repeatedly interrupted by negative thoughts about their abilities.  This results in poor performance and reinforces the idea that they are incapable of learning the subject.

It is interesting to see this belief so prevalent among software developers who are perfectly capable of writing an algorithm in a programming language, but suddenly feel that it is impossible to grasp the same algorithm represented by a set of mathematical symbols. There is simply no reason that this should be the case. I’ve yet to meet an experienced programmer who would tell me they find it near-impossible to learn the syntax of a new programming language and yet that is precisely what is entailed in learning how to express an algorithm using linear algebra.

A common point of confusion for many who haven’t done a lot of math since secondary school is in the use of mathematics as a language rather than a set of equations to be solved. In academic computer science, linear algebra, as it is used to express algorithms, is not something to be solved, but rather a language used to describe an algorithm.

Understanding the language of academic computer science is becoming increasingly important as the traditional staples of academia, such as machine learning, increasingly find use in industry.  After all, even if a software developer manages to avoid the math in their work, how can they expect to keep up with the latest developments in this fast-moving field without an ability to understand the academic literature?  Yet this is precisely what some software developers are attempting to do.

Math inability is not hard wired and software developers are already well practiced in the mental skills required.  We use the skill of stepping through a problem and visualising the state changes that occur at each step, every time we read or write a piece of code.  Anyone who can do that is capable of becoming proficient enough in mathematics to understand the mathematical components of the computer science literature.

References

Ashcraft, M. H. (2002). Math anxiety: Personal, educational, and cognitive consequences. Current directions in psychological science, 11(5), 181-185.

Blair, C., & Razza, R. P. (2007). Relating effortful control, executive function, and false belief understanding to emerging math and literacy ability in kindergarten. Child development, 78(2), 647-663.

Devine, A., Fawcett, K., Szűcs, D., & Dowker, A. (2012). Gender differences in mathematics anxiety and the relation to mathematics performance while controlling for test anxiety. Behavioral and brain functions, 8(1), 1.

Murayama, K., Pekrun, R., Lichtenfeld, S., & Vom Hofe, R. (2012). Predicting long‐term growth in students’ mathematics achievement: The unique contributions of motivation and cognitive strategies. Child development, 84(4), 1475-1490.

See Also

Andreescu, T., Gallian, J. A., Kane, J. M., & Mertz, J. E. (2008). Cross-cultural analysis of students with exceptional talent in mathematical problem solving. Notices of the AMS, 55(10), 1248-1260.

Berger, A., Tzur, G., & Posner, M. I. (2006). Infant brains detect arithmetic errors. Proceedings of the National Academy of Sciences, 103(33), 12649-12653.

Post Edits

13/07/2016 – Added references and see also sections.  Updated inline references to show primary sources rather than just linking to secondary sources.

14/07/2016 – Corrected typo in final paragraph “Math ability is not hard wired…” changed to “Math inability is not hard wired”.


Why Learn Machine Learning and Optimisation?

In this post I hope to convince the reader that machine learning and optimisation are worthwhile fields for a software developer in industry to engage with.

I explain the purpose of this blog and argue that we are in the midst of a machine-learning revolution.

______________________________________________________________

When I first started coding as a teenager in the early 1990s, the future looked certain to be shaped by artificial intelligence. We were told that we’d soon have “fifth generation” languages that would allow for the creation of complex software applications without the need for human programmers. Expert systems would replace human experts in every walk of life and we’d talk to our machines in much the same way Gene Roddenberry imagined we should.

file4711279208151

Unfortunately, this model of reality didn’t quite go to plan. After many years of enormous research and development expense — mainly focused in Japan — we entered another AI winter. The future was left in the hands of a handful of diehard academics, while the software industry mostly ignored AI research.

The good news is that the AI winter is now well and truly over.  The technology has been slowly but surely increasing its influence on mainstream software development and data analytics for at least a decade and 2015 has been billed as a breakthrough year by media sources such as Bloomberg and and Wired magazine.

Whether we realise it or not, most of us use AI every day. In fact, AI is responsible for all of the coolest software innovations you’ve heard of in recent years. It is the basis for autonomous helicopters, autonomous cars, big data analytics, google search, automatic language translation, targeted advertising, optical character recognition, speech recognition, facial recognition, anomaly detection, news-article clustering, vehicle routing and product recommendation, just to list the few examples I could name at the time of writing.

As a field, artificial intelligence has been deeply rooted in academia for decades, but it is quickly becoming prevalent in industry.  We are at the dawn of the AI revolution and there has never been a better time to start sciencing up your skill set.

This blog, is here to help and, as its name suggests, will focus on two important and complementary sub-fields of AI: Machine Learning and Optimisation. The intention is to explain both topics in a language that software developers in industry can easily understand, with or without a background in hard computer science.

I believe this is an important addition to the discourse on these topics because most of the sources you’re likely to come across assume a strong existing knowledge of linear algebra, calculus, statistics, probability, information theory and computational complexity theory: the language of academic computer science.  This is unsurprising, given that the techniques were mostly developed by computer scientists, mathematicians and statisticians, but it can unfortunately be a barrier to a lot of people getting started.

The intention here is to remove that barrier by describing the various techniques using familiar, medium-level programming languages.  The posts that follow will not shy away from the theory, but no assumptions will be made with respect to prior understanding of mathematics or computer science and code snipets will accompany any mathematical descriptions.