Why are there so many fake data scientists and machine learning engineers?

The title of this post is a question I answered recently on Quora, a post that seems to have gathered some interest, so I thought it might be worthwhile expanding on it here.

In my response, I pointed out that in recent months I have encountered a number of software engineers who seem to believe that machine learning libraries, such as Tensorflow, can sufficiently abstract away the need for machine learning knowledge in much the same way that high level programming languages in most industrial areas of application have abstracted away the need for knowledge of low-level programming.

I should point that I have nothing at all against the use of machine learning libraries and I am in no way advocating for the coding of machine learning algorithms from scratch in industrial practice.  Where I have advocated for coding machine learning algorithms from scratch in the past, it has always been for the purpose of education.  The point I have attempted to make in my two paragraph post on Quora, and in these two previous blog posts, is that there is a range of knowledge that is required for machine learning development regardless of whether you are personally coding the algorithms or referencing software libraries.

Machine learning engineers and data scientists need to understand what kind of data needs to be gathered or found from the start of any project.  They need to understand how to pre-process that data, perform feature selection, cross validation for both model selection and parameter tuning of the selected model, all while being careful to avoid overfitting.  They have to understand what tools are available to them, when it is appropriate to use them and how to set their parameters.  They have to be able to design full machine learning pipelines, possibly with multiple machine learning algorithms interacting.  Without this knowledge, expect a lot of time wasted through unnecessary trial-and-error experimentation, or worse, models that fail to make accurate predictions in the wild.

Automating machine learning libraries so they can complete some of this work without user knowledge is a long-standing goal of many machine learning researchers in industry and academia.  The so-called “democratisation of machine learning” isn’t a new concept and varying degrees of success have been achieved in the automation of some of the algorithmic and statistical knowledge required to do machine learning (Centinsoy et al., 2016; IBM Analytics, 2016) or otherwise lower the barrier to entry for machine learning practitioners (Chen et al., 2016; Guo et al., 2016; Patel, 2010; 2016).  But we’re not yet at a point where a software engineer can jump into machine learning development without some kind of introductory training or mentorship.  Those who do are involved in machine learning black magic.

Furthermore, there is the question of the degree of capability of a software engineer who has little knowledge of machine learning.  In a previous post I quoted former Kaggle chief scientist, Jeremy Howard, suggesting that there is a non-linear disparity in capability between the best and average machine learning developers and that the best of the best learned their trade by understanding the mathematics behind the algorithms.  Howard was not talking about people coding machine learning algorithms from scratch.  He was talking about Kaggle competition Entrants, who almost universally use libraries.  The fact of the matter is that the people who understand the algorithms put the libraries to better use and perform better in Kaggle competitions by orders of magnitude.

Lest the reader assume I am against the idea of software engineers working on machine learning projects, nothing could be further from the truth.  In my view, the whole field of machine learning development is in dire need of the sort of software-engineer thinking that brought the SOLID principles and software design patterns to object oriented software development.  As Sculley et al. (2014) have pointed out, machine learning implementations bring with them a whole slew of new ways to generate technical debt in a software project.  Despite this, very little guidance has been proposed in the way of best practices or design patterns for machine learning implementations.  The sum of existing work basically amounts to the aforementioned paper (Sculley et al., 2014) and another rejected conference paper on the topic of design patterns for deep convolutional neural networks (Smith & Topin, 2016). Moving machine learning into the hands of more industrial software engineers who care about the practical implications it will have on their projects can only be good for the field.  I’m merely advising caution.

References

Cetinsoy, A., Martin, F. J., Ortega, J. A., Petersen, P. (2016). The Past, Present, and Future of Machine Learning APIs. In Proceedings of The 2nd International Conference on Predictive APIs and Apps (pp. 43-49).

Chen, D., Bellamy, R. K., Malkin, P. K., & Erickson, T. (2016). Diagnostic visualization for non-expert machine learning practitioners: A design study. In Visual Languages and Human-Centric Computing (VL/HCC), 2016 IEEE Symposium on (pp. 87-95). IEEE.

Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., Lew, M. S. (2016). Deep learning for visual understanding: A review. Neurocomputing, 187, 27-48.

IBM Analytics. (2016). The democratization of Machine Learning: Apache Spark opens up the door for the rest of us. IBM White Paper. Accessed on May 17, 2017, from: https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=CDW12360USEN

Patel, K. (2010). Lowering the barrier to applying machine learning. In Adjunct proceedings of the 23nd annual ACM symposium on User interface software and technology (pp. 355-358). ACM.

Patel, K. D. (2013). Lowering the barrier to applying machine learning (Doctoral dissertation). University of Washington. Accessed on March 25, 2017, from http://www.cc.gatech.edu/~stasko/8001/heer06.pdf

Sculley, D., Phillips, T., Ebner, D., Chaudhary, V., Young, M. (2014). Machine learning: The high-interest credit card of technical debt. In SE4ML:Software Engineering for Machine Learning (NIPS 2014 Workshop). Accessed on March 25, 2017, from http://www.eecs.tufts.edu/~dsculley/papers/technical-debt.pdf

Smith, L. N., & Topin, N. (2016). Deep convolutional neural network design patterns. arXiv preprint arXiv:1611.00847. Submitted to the International Conference on Learning Representations (ICLR) and rejected, 2017. Accessed on March 06, 2017, from https://pdfs.semanticscholar.org/8863/9a6e21a8a8989e6d25e44119a90ba0b27628.pdf

Published by

James Burkill

Veteran software engineer and student of all things AI.

LinkedIn: https://ie.linkedin.com/in/james-burkill-459a1513

2 thoughts on “Why are there so many fake data scientists and machine learning engineers?”

    1. That’s actually a point I hadn’t considered and it’s yet another example of a problem that has much more significance for software developers — who incorporate machine learning into a wider project —than it does for data scientists and academic researchers.

      So little thought has been given to how this tsunami of machine learning functionality is going to affect software projects.

Leave a Reply