The Unteachable Parts of Data Science

Thursday, 22 February, 2018

The number of books, articles, Youtube videos and online courses for data science are increasing almost exponentially. Every major publication house, university, author and the hobby video maker/writer is publishing resources describing data science. A good fraction of these courses emphasize on Machine Learning as an important tool to have in one's toolkit. There is no denying that Machine Learning and techniques for Big Data handling are vital and important but there are some important elements of any data science project which simply, in my opinion, cannot be taught!

In a typical data science or Machine Learning problem, our workflow goes as follows.

Preprocess the Data
Feed it to various Machine Learning algorithms to construct models.
Evaluate and fine tune the models.
Deploy the models as required.

Let's now talk about the processing phase. Why do you need this phase and what is that you do over there? Let's see some examples of preprocessing to get a feel for this.

In a recent work I was involved in, images of galaxies needed to be fed into a Deep Learning network so that one could train it to identify special structures in the galaxies called 'bars'. By their inherent nature, deep networks require immense amount of data samples. Further, each sample image is required to be of the same size. To solve the former problem, one can do upsampling by creating rotated variations of each image. One also needs to worry about scaling of the images etc.
Another example of preprocessing is as follows. The information we need to extract is stored in a NoSQL database such as MongoDB. This means the information is stored in the form of JSON like structure and such data cannot be fed into a Machine Learning routine as is. You need to do some work to get it into a form which the algorithm accepts. Sometimes the data is not even in the semi-structured form and needs even more processing to get it into a form ready to be fed into a machine learning routine.
Yet another process is feature engineering. You have ten features but it is possible to combine a subset of these features using a mathematical function to produce another feature which gives better discriminatory power in distinguishing between one class or the other. Or it may be that an existing feature is in a form (plain text, for example), not suitable for a learning routine.
Then there are standard processes such as dimensionality reduction as well as standardization or scaling of data.

In all of these processes or examples, there are two aspects. One aspect is that of the tool or the technique. To rotate images, you need to have some idea of tools that can perform such image transformations. To transform unstructured data, you again need tools to do this for you. To engineer features, you need libraries that can make it easier for you to perform these data transformations, and do quick statistical tests to see if the given way of combining features is the best. For standardizing data and reducing dimensionality, again you need routines or libraries. This aspect can be taught.

The second aspect in all these examples is related to the questions of when to do what and how best to do it! For example, we know how to reduce dimensionality using some routine but how many dimensions should we insist on? We know how to manipulate tabular data for the purpose of constructing new features but what, if any, features should we engineer? We know how to transform data from one form to another but how do we figure out the best way to move from raw (completely seneless to the computer) data to an initial form?

Here, we have to invoke domain knowledge and expertise. A colleague of mine was recently describing her experiences as a data science intern. She is well trained in statistics and Machine Learning. She is computer savvy enough to play around with all the processing tools. But she estimated that she spent a good deal of her time, about 80% in simply understanding the data itself. Without this step, it was simply not possible for her to even begin to bring the data in a form a Machine Learning routine could understand. The 80% figure may vary but I'd be surprised if it ever fell down to a low of 20%.

This is the element of data science that cannot be taught. Especially if professionals have gotten used to working in a shell where they are able to merely use tools without paying much attention to the domain in which the tools will be employed, the prospect of doing research in finding out about the source of the data can be daunting. How do you teach a person to explore new kinds of data, learn their collection process and spend time understanding how the data can or cannot help in solving the problem at hand?

It is extremely important for a data scientist to be able to sit with experts, guides and books to learn about the domain in which they are trying to solve a data science problem. This requires an approach, a certain fierce independence and a child like curiosity. So it does not come as a surprise to me that a lot of data science job descriptions keep saying "PhDs preferred". Assuming students do their PhD at a reasonably good place, they will be quite well trained at playing around with new kinds of knowledge, data and even suggest and test new methods. After all, what's the gaurantee that those tools and algorithms will work for every problem out there?

Kaustubh Vaghmare's
Home Page

Kaustubh Vaghmare's
Home Page