Data Science Journey of Manu Joseph, The Creator of PyTorch Tabular – Analytics India Magazine

Listen to this story

“I thrive in situations where I have to get things done or create new systems and new modules. I like to satisfy my curiosity and maker trait,” said the creator of PyTorch Tabular, GATE and LAMA-Net, Manu Joseph. He said that he is fascinated with math, data science, and machine learning, particularly deep learning, because of its flexibility and scalability. 

Joseph currently heads the applied research at Thoucentric, a niche management company. At the company, he leads the group of researchers in productionising cutting-edge technology to add value to real-world customers, primarily in causality, predictive maintenance, time series forecasting, NLP and others. Prior to this, he worked with companies like Philips, Entercoms, Schneider Electric, Cognizant Technology Solutions and others.  

In an exclusive interview with Analytics India Magazine, Joseph talks about his journey into data science, alongside some of his passion projects, tips for people entering data science for better career opportunities, and more. 

A Self-taught Data Scientist 

From starting his career in industrial engineering to working in the IT industry, and later moving to the data science and analytics field, and currently leading the research initiatives, Joseph’s journey has been truly inspirational.  

“Transitioning from a STEM role, say, engineering, to data science is relatively easier than other areas,” said Joseph. He said that whatever branch you study in engineering changes the way your brain is wired. “I think that is actually helpful in all of these things,” he added. 

However, he said when shifting domains to areas like machine learning, statistics, or computer science, you have to be comfortable with programming. “There’s no way around it,” he added. 

He said you could learn all the machine learning, you can learn everything, but at the end of the day, for all of that to be useful, you need to convert that into code. “In today’s scenario, nobody will do it for you. So you have to do it yourself,” he added, saying that a few years ago, there was the luxury, but now, with the industry growing rapidly, there is no other option but to learn. 

Further, Joseph said that you should not be afraid of Math. “It is not going to come in the beginning. You can get away with Math early on, but eventually, it will come and then it will make a lot of difference,” he added, saying that it is a lot easier to communicate concepts in Math than in English. “Understanding what’s happening is actually very important. Otherwise, you will be able to build a model; you will be able to predict and get results out of it. But, the problem starts when you are stuck in between and do not know what to do,” said Joseph. 

Lastly, he said that people should start looking at interesting problems, create datasets, participate in hackathons, and develop models to make them more useful. “Move away from your standard Titanic datasets and solve something interesting that makes your resume stand out. It is very easy to identify people who have gone the extra mile,” he added. 

Origin of PyTorch Tabular 

An industrial engineer turned data scientist, Joseph said when you are working with a business problem, tabular data constitutes about 90 per cent of the data—which is in tables—and all of your classical machine learning are the things we always use. However, these are just a small portion of what we do because everybody wants to do other things. 

“That is where we started looking at deep learning. During my research, I found out that there was not a lot of work happening in that area,” recalled Joseph, saying that previously people were still using standard feedforward networks and something on top of that, which is a tabular model. 

“Since I was interested in the field, I kept tabs on what was happening. That’s when models like TabNet and a few other models came out. So I did see an acceleration in the space like more and more people were looking at how to do things,” added Joseph. 

Further, he said that when all these models came out and people started to implement their own data—it was a lot of hassle. “Because apart from TabNet, which has a very good library, all the other models were mostly coded bases. Making it work was extremely cumbersome,” he added.   

That was the start of PyTorch Tabular, a framework for deep learning with tabular data. The framework has been built on top of PyTorch and PyTorch Lighting and works on pandas data frames directly. It has also used SOTA models such as NODE and TabNet to create a unified API. 

“I started this as an internal project. At the time, it did not even have a name. The idea, however, was to unify all of that so that you can switch between different models, just like a Scikit-learn setup,” said Joseph. He said once the data pipeline is ready, switching to a new model is just about changing one line of code. That was the guiding principle behind the development of PyTorch Tabular. Soon he open-sourced the library for others to contribute and use. It is one of the most liked and talked about ML libraries on GitHub.  

Enters GATE 

One thing led to another; Joseph and his colleague Harsh Raj later released a novel high-performance, parameter and computationally efficient deep learning architecture for tabular data called GATE (gated additive tree ensemble). Inspired by GRU, GATE uses a gating mechanism as a feature representation learning unit with an in-built feature selection mechanism. It also uses an ensemble of differentiable, non-linear decision trees, re-weighted with simple self-attention to predict the desired output. 

Joseph said that GATE is a competitive alternative to SOTA methods like GBDTs, NODE, FT Transformers, etc., where they have experimented on several public datasets (both classification and regression). The code is yet to be available for open source. 

LAMA-Net 

At Thoucentric, Joseph, alongside Varchita Lalwani, recently developed LAMA-Net, a new encoder-decoder (Transformer) based model with an induced bottleneck, latent alignment using maximum mean discrepancy and manifold learning to tackle the problem of unsupervised homogeneous domain adaptation for remaining useful life (RUL) prediction. 

Citing predictive maintenance in manufacturing, Joseph said this is more like a domain adaptation technique, where we focus on how we can use training data with shifting data distributions to train a robust model to predict remaining useful time. 

“In a real-world implementation, it is really difficult to get the data needed to train these models—you will need to have data for multiple failures in the past, and failures are usually a rare event. So, getting the data is difficult,” said Joseph, saying that using the existing datasets, we can now use our domain adaptation to a new dataset without any labels. 

What next? 

To date, Joseph has worked on more than 20+ AI/ML projects, and in a personal capacity, he has worked on more than ten projects. At Thoucentric, he is currently building a team of data scientists who will be working on new-age technologies to solve their customer problems. The team is working on four different projects and is planning to publish three papers in the coming months. 

Joseph told AIM that he would continue developing new methods and technologies in areas that do not use a lot of training data and build domain-agnostic models. “Because, having worked in the industry for some time now, I know that training data is very difficult to come by. That too, like annotated training data, is very, very difficult to come by,” said Joseph. He said that is why he is interested in areas like transfer learning, self-supervised learning, etc. 

Go-to Resources Curated by Manu Joseph

Data science resources:

Newsletters: 

AI/ML Courses: 

Must-read research papers 

Spread the love

Leave a Reply

Your email address will not be published.