The Unteachable Parts of Data Science

Thursday, 22 February, 2018

The number of books, articles, Youtube videos and online courses for data science are increasing almost exponentially. Every major publication house, university, author and the hobby video maker/writer is publishing resources describing data science. A good fraction of these courses emphasize on Machine Learning as an important tool to have in one's toolkit. There is no denying that Machine Learning and techniques for Big Data handling are vital and important but there are some important elements of any data science project which simply, in my opinion, cannot be taught!

In a typical data science or Machine Learning problem, our workflow goes as follows.

Let's now talk about the processing phase. Why do you need this phase and what is that you do over there? Let's see some examples of preprocessing to get a feel for this.

In all of these processes or examples, there are two aspects. One aspect is that of the tool or the technique. To rotate images, you need to have some idea of tools that can perform such image transformations. To transform unstructured data, you again need tools to do this for you. To engineer features, you need libraries that can make it easier for you to perform these data transformations, and do quick statistical tests to see if the given way of combining features is the best. For standardizing data and reducing dimensionality, again you need routines or libraries. This aspect can be taught.

The second aspect in all these examples is related to the questions of when to do what and how best to do it! For example, we know how to reduce dimensionality using some routine but how many dimensions should we insist on? We know how to manipulate tabular data for the purpose of constructing new features but what, if any, features should we engineer? We know how to transform data from one form to another but how do we figure out the best way to move from raw (completely seneless to the computer) data to an initial form?

Here, we have to invoke domain knowledge and expertise. A colleague of mine was recently describing her experiences as a data science intern. She is well trained in statistics and Machine Learning. She is computer savvy enough to play around with all the processing tools. But she estimated that she spent a good deal of her time, about 80% in simply understanding the data itself. Without this step, it was simply not possible for her to even begin to bring the data in a form a Machine Learning routine could understand. The 80% figure may vary but I'd be surprised if it ever fell down to a low of 20%.

This is the element of data science that cannot be taught. Especially if professionals have gotten used to working in a shell where they are able to merely use tools without paying much attention to the domain in which the tools will be employed, the prospect of doing research in finding out about the source of the data can be daunting. How do you teach a person to explore new kinds of data, learn their collection process and spend time understanding how the data can or cannot help in solving the problem at hand?

It is extremely important for a data scientist to be able to sit with experts, guides and books to learn about the domain in which they are trying to solve a data science problem. This requires an approach, a certain fierce independence and a child like curiosity. So it does not come as a surprise to me that a lot of data science job descriptions keep saying "PhDs preferred". Assuming students do their PhD at a reasonably good place, they will be quite well trained at playing around with new kinds of knowledge, data and even suggest and test new methods. After all, what's the gaurantee that those tools and algorithms will work for every problem out there?




Up