Why are there so many fake data scientists and machine learning engineers?

The title of this post is a question I answered recently on Quora, a post that seems to have gathered some interest, so I thought it might be worthwhile expanding on it here.

In my response, I pointed out that in recent months I have encountered a number of software engineers who seem to believe that machine learning libraries, such as Tensorflow, can sufficiently abstract away the need for machine learning knowledge in much the same way that high level programming languages in most industrial areas of application have abstracted away the need for knowledge of low-level programming.

I should point that I have nothing at all against the use of machine learning libraries and I am in no way advocating for the coding of machine learning algorithms from scratch in industrial practice.  Where I have advocated for coding machine learning algorithms from scratch in the past, it has always been for the purpose of education.  The point I have attempted to make in my two paragraph post on Quora, and in these two previous blog posts, is that there is a range of knowledge that is required for machine learning development regardless of whether you are personally coding the algorithms or referencing software libraries.

Machine learning engineers and data scientists need to understand what kind of data needs to be gathered or found from the start of any project.  They need to understand how to pre-process that data, perform feature selection, cross validation for both model selection and parameter tuning of the selected model, all while being careful to avoid overfitting.  They have to understand what tools are available to them, when it is appropriate to use them and how to set their parameters.  They have to be able to design full machine learning pipelines, possibly with multiple machine learning algorithms interacting.  Without this knowledge, expect a lot of time wasted through unnecessary trial-and-error experimentation, or worse, models that fail to make accurate predictions in the wild.

Automating machine learning libraries so they can complete some of this work without user knowledge is a long-standing goal of many machine learning researchers in industry and academia.  The so-called “democratisation of machine learning” isn’t a new concept and varying degrees of success have been achieved in the automation of some of the algorithmic and statistical knowledge required to do machine learning (Centinsoy et al., 2016; IBM Analytics, 2016) or otherwise lower the barrier to entry for machine learning practitioners (Chen et al., 2016; Guo et al., 2016; Patel, 2010; 2016).  But we’re not yet at a point where a software engineer can jump into machine learning development without some kind of introductory training or mentorship.  Those who do are involved in machine learning black magic.

Furthermore, there is the question of the degree of capability of a software engineer who has little knowledge of machine learning.  In a previous post I quoted former Kaggle chief scientist, Jeremy Howard, suggesting that there is a non-linear disparity in capability between the best and average machine learning developers and that the best of the best learned their trade by understanding the mathematics behind the algorithms.  Howard was not talking about people coding machine learning algorithms from scratch.  He was talking about Kaggle competition Entrants, who almost universally use libraries.  The fact of the matter is that the people who understand the algorithms put the libraries to better use and perform better in Kaggle competitions by orders of magnitude.

Lest the reader assume I am against the idea of software engineers working on machine learning projects, nothing could be further from the truth.  In my view, the whole field of machine learning development is in dire need of the sort of software-engineer thinking that brought the SOLID principles and software design patterns to object oriented software development.  As Sculley et al. (2014) have pointed out, machine learning implementations bring with them a whole slew of new ways to generate technical debt in a software project.  Despite this, very little guidance has been proposed in the way of best practices or design patterns for machine learning implementations.  The sum of existing work basically amounts to the aforementioned paper (Sculley et al., 2014) and another rejected conference paper on the topic of design patterns for deep convolutional neural networks (Smith & Topin, 2016). Moving machine learning into the hands of more industrial software engineers who care about the practical implications it will have on their projects can only be good for the field.  I’m merely advising caution.


Cetinsoy, A., Martin, F. J., Ortega, J. A., Petersen, P. (2016). The Past, Present, and Future of Machine Learning APIs. In Proceedings of The 2nd International Conference on Predictive APIs and Apps (pp. 43-49).

Chen, D., Bellamy, R. K., Malkin, P. K., & Erickson, T. (2016). Diagnostic visualization for non-expert machine learning practitioners: A design study. In Visual Languages and Human-Centric Computing (VL/HCC), 2016 IEEE Symposium on (pp. 87-95). IEEE.

Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., Lew, M. S. (2016). Deep learning for visual understanding: A review. Neurocomputing, 187, 27-48.

IBM Analytics. (2016). The democratization of Machine Learning: Apache Spark opens up the door for the rest of us. IBM White Paper. Accessed on May 17, 2017, from: https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=CDW12360USEN

Patel, K. (2010). Lowering the barrier to applying machine learning. In Adjunct proceedings of the 23nd annual ACM symposium on User interface software and technology (pp. 355-358). ACM.

Patel, K. D. (2013). Lowering the barrier to applying machine learning (Doctoral dissertation). University of Washington. Accessed on March 25, 2017, from http://www.cc.gatech.edu/~stasko/8001/heer06.pdf

Sculley, D., Phillips, T., Ebner, D., Chaudhary, V., Young, M. (2014). Machine learning: The high-interest credit card of technical debt. In SE4ML:Software Engineering for Machine Learning (NIPS 2014 Workshop). Accessed on March 25, 2017, from http://www.eecs.tufts.edu/~dsculley/papers/technical-debt.pdf

Smith, L. N., & Topin, N. (2016). Deep convolutional neural network design patterns. arXiv preprint arXiv:1611.00847. Submitted to the International Conference on Learning Representations (ICLR) and rejected, 2017. Accessed on March 06, 2017, from https://pdfs.semanticscholar.org/8863/9a6e21a8a8989e6d25e44119a90ba0b27628.pdf

Discussion: Why Machine Learning Beginners Shouldn’t Avoid the Math

In a post I published yesterday, I argued that it is important for students of machine learning to understand the algorithms and underlying mathematics prior to using tools or libraries that black box the code. I suggested that to do so is likely to result in a lot of “time-wasting confusion” due to students not having the necessary understanding to configure parameters or interpret results. One of the examples I provided for the opposing view was this blog post from BigML, which argues that beginners don’t need courses such as those provided by Coursera if they use their tool.

Francisco J. Martin, CEO of Big ML, has tweeted in response.


So Kids shouldn’t avoid assembler, automata, and compilers when learning to code?

This is a very good question and one that grants us an opportunity to dig deeper into the issue. I am responding here because I don’t believe it’s a question I can answer in 140 characters.

The short answer is no, I’m perfectly ok with beginner programmers starting out in high-level languages and working their way down, or even stopping there and not working their way down. But this is not analagous to machine learning.

I see three big differences.

First of all, learning a high-level language is actually a constructive step towards learning lower level languages. If that’s the goal, and you started with something like Java, you could potentially learn quite a lot about programming in general. Then trying C++ would help to fill in blanks with resect to some of the aspects of programming the Java glosses over. Likewise, Assembler could take you a step further.

If playing with the parameters of black-boxed algorithms offers a path at all towards becoming proficient at machine learning, it’s an incredibly innefficient one. It’s an awfully big search space to approach by trial and error when you consider the combinations of parameters, feature selection and the question of whether you have enough or appropriate data examples.

The second difference is that to do high-level programming does not require an understanding of low-level programming. I can do anything that Java or c# will let me do without knowing anything about assembly language.  In comparison, a machine learning tool requires me to know how to set appropriate values of parameters that are parsed into the hidden algorithms. They also require me to understand whether or not I have an appropriate (representative) dataset with appropriate features. Then when it finishes I need to be able to interpret the results and take appropriate actions. Better outcomes come from more informed decisions.

The third difference relates to the potential benefits of exploring the low-level languages. There are some exceptions to this, but generally speaking, writing more efficient algorithms in low-level languages comes at such great expense in comparison to the constantly falling cost of computation, that it just isn’t worthwhile.

In my last post I cited Kaggle’s chief scientist, Jeremy Howard, who said there was a massive difference in capability between good and average data scientists. I take this to indicate that in machine learning, more knowledge leads to exponentially better outcomes.  Unlike low-level programming, there is a huge benefit to having a detailed knowledge of machine learning.

I have come across some arguments suggesting that as Moore’s law reaches its limit, low-level coding will become much more sought after. If that happens I’ll revisit my position on low-level coding, but for now I’m betting that specialist processors like GPUs will help to bridge the gap before the next paradigm of computation comes along to keep the gravy train of exponential price-performance improvement going.

The Self-reinforcing Myth of Hard-wired Math Inability

There is a commonly held belief that some people have brains that are pre-wired for mathematical excellence, while everyone else is doomed to struggle with the subject. This toxic myth needs to be put deep in the ground and buried in molten lead. It is as destructive as it is self-fulfilling.

The myth equally encourages people who are good at math to falsely believe (Murayama et al., 2012) they are more intelligent than those who are not, and leaves everyone else inclined to believe they can never improve. This is despite the fact that math ability has very little to do with intelligence (Blair et al., 2007).

The reason this myth exists is well understood. School students who were well prepared by their parents in math prior to starting school find themselves separated in ability from their classmates who were not. The latter group consider the seemingly unachievable abilities of their peers and quickly lose confidence in their own abilities. Once that self-confidence is lost, any attempt at completing a math problem leads to math anxiety (Ashcraft et al., 2002; Devine et al., 2012), where thoughts of self-doubt cloud the mind and make it difficult to concentrate on the task at hand.

Mathematics, like computer programming, is a discipline that requires concentration. The student needs to be able to follow a train of thought where A leads to B leads to C etc. A student who lacks self-confidence struggles to maintain the necessary train of thought due to being repeatedly interrupted by negative thoughts about their abilities.  This results in poor performance and reinforces the idea that they are incapable of learning the subject.

It is interesting to see this belief so prevalent among software developers who are perfectly capable of writing an algorithm in a programming language, but suddenly feel that it is impossible to grasp the same algorithm represented by a set of mathematical symbols. There is simply no reason that this should be the case. I’ve yet to meet an experienced programmer who would tell me they find it near-impossible to learn the syntax of a new programming language and yet that is precisely what is entailed in learning how to express an algorithm using linear algebra.

A common point of confusion for many who haven’t done a lot of math since secondary school is in the use of mathematics as a language rather than a set of equations to be solved. In academic computer science, linear algebra, as it is used to express algorithms, is not something to be solved, but rather a language used to describe an algorithm.

Understanding the language of academic computer science is becoming increasingly important as the traditional staples of academia, such as machine learning, increasingly find use in industry.  After all, even if a software developer manages to avoid the math in their work, how can they expect to keep up with the latest developments in this fast-moving field without an ability to understand the academic literature?  Yet this is precisely what some software developers are attempting to do.

Math inability is not hard wired and software developers are already well practiced in the mental skills required.  We use the skill of stepping through a problem and visualising the state changes that occur at each step, every time we read or write a piece of code.  Anyone who can do that is capable of becoming proficient enough in mathematics to understand the mathematical components of the computer science literature.


Ashcraft, M. H. (2002). Math anxiety: Personal, educational, and cognitive consequences. Current directions in psychological science, 11(5), 181-185.

Blair, C., & Razza, R. P. (2007). Relating effortful control, executive function, and false belief understanding to emerging math and literacy ability in kindergarten. Child development, 78(2), 647-663.

Devine, A., Fawcett, K., Szűcs, D., & Dowker, A. (2012). Gender differences in mathematics anxiety and the relation to mathematics performance while controlling for test anxiety. Behavioral and brain functions, 8(1), 1.

Murayama, K., Pekrun, R., Lichtenfeld, S., & Vom Hofe, R. (2012). Predicting long‐term growth in students’ mathematics achievement: The unique contributions of motivation and cognitive strategies. Child development, 84(4), 1475-1490.

See Also

Andreescu, T., Gallian, J. A., Kane, J. M., & Mertz, J. E. (2008). Cross-cultural analysis of students with exceptional talent in mathematical problem solving. Notices of the AMS, 55(10), 1248-1260.

Berger, A., Tzur, G., & Posner, M. I. (2006). Infant brains detect arithmetic errors. Proceedings of the National Academy of Sciences, 103(33), 12649-12653.

Post Edits

13/07/2016 – Added references and see also sections.  Updated inline references to show primary sources rather than just linking to secondary sources.

14/07/2016 – Corrected typo in final paragraph “Math ability is not hard wired…” changed to “Math inability is not hard wired”.

Why Learn Machine Learning and Optimisation?

In this post I hope to convince the reader that machine learning and optimisation are worthwhile fields for a software developer in industry to engage with.

I explain the purpose of this blog and argue that we are in the midst of a machine-learning revolution.


When I first started coding as a teenager in the early 1990s, the future looked certain to be shaped by artificial intelligence. We were told that we’d soon have “fifth generation” languages that would allow for the creation of complex software applications without the need for human programmers. Expert systems would replace human experts in every walk of life and we’d talk to our machines in much the same way Gene Roddenberry imagined we should.


Unfortunately, this model of reality didn’t quite go to plan. After many years of enormous research and development expense — mainly focused in Japan — we entered another AI winter. The future was left in the hands of a handful of diehard academics, while the software industry mostly ignored AI research.

The good news is that the AI winter is now well and truly over.  The technology has been slowly but surely increasing its influence on mainstream software development and data analytics for at least a decade and 2015 has been billed as a breakthrough year by media sources such as Bloomberg and and Wired magazine.

Whether we realise it or not, most of us use AI every day. In fact, AI is responsible for all of the coolest software innovations you’ve heard of in recent years. It is the basis for autonomous helicopters, autonomous cars, big data analytics, google search, automatic language translation, targeted advertising, optical character recognition, speech recognition, facial recognition, anomaly detection, news-article clustering, vehicle routing and product recommendation, just to list the few examples I could name at the time of writing.

As a field, artificial intelligence has been deeply rooted in academia for decades, but it is quickly becoming prevalent in industry.  We are at the dawn of the AI revolution and there has never been a better time to start sciencing up your skill set.

This blog, is here to help and, as its name suggests, will focus on two important and complementary sub-fields of AI: Machine Learning and Optimisation. The intention is to explain both topics in a language that software developers in industry can easily understand, with or without a background in hard computer science.

I believe this is an important addition to the discourse on these topics because most of the sources you’re likely to come across assume a strong existing knowledge of linear algebra, calculus, statistics, probability, information theory and computational complexity theory: the language of academic computer science.  This is unsurprising, given that the techniques were mostly developed by computer scientists, mathematicians and statisticians, but it can unfortunately be a barrier to a lot of people getting started.

The intention here is to remove that barrier by describing the various techniques using familiar, medium-level programming languages.  The posts that follow will not shy away from the theory, but no assumptions will be made with respect to prior understanding of mathematics or computer science and code snipets will accompany any mathematical descriptions.