The Critical Data Science Skill You Were Never Taught - Problem Formulation

Problem formulation in ai world for data science

The interview

Picture this – you are in the middle of a data science interview round and the interviewer asks you about a specific business problem. They say, “We are seeing a lot of customers dropping off the platform – the churn is high. We are losing subscribers from our guided meditation app. If you were a data scientist with us, how would you solve this problem?” You’re delighted – you know this answer! “This is straightforward,” you say. “I will make a churn prediction model. The customers which seem to have a higher churn probability will be retained by some intervention, perhaps some discount or some loyalty”, you say confidently. 

“I will make a churn prediction model. The customers which seem to have a higher churn probability will be retained by some intervention, perhaps some discount or some loyalty”

The interviewer nods and probes further. “What kind of model would you use?” You are delighted because now you can talk about the various modelling techniques you know about. You say, “Well, Logistic regression is too simple for this use case. For tabular data, XGBoost is the best model. You know what, I have seen some new research which shows that there are foundation models trained on tabular data which outperform even XGboost when given the right context. So I would definitely try XGBoost or some new tabular LLM which might give you really good accuracy.” 

That was good, you think while heading back home, hoping for a positive email the next day. Surprise, surprise! This was a fail, and it wasn’t even a tough one for the interviewer. As this is exactly the kind of answer that leads to an immediate failure. In particular, you failed on one key dimension – problem formulation.

You said the right answer, but still failed the interview. What went wrong?

Problem formulation

It is the way you translate a business problem into a data science problem. It needs many decisions – 

  • whether we take a supervised approach or unsupervised approach
  • The kind of inputs will go into the model 
  • what would the outcome really be (What is the Y variable?)

Specifying all of these is formulating a data science problem. To be concrete, the business problem would be that the customers are churning off the platform and we need to retain them. The data science problem would perhaps be  – A supervised classification formulation where the outcome (Y) is whether the customer churned, and the features (X variables) include the customer’s activity on the platform and their demographics. This was the formulation you chose. It is immaterial, and you can’t know it, whether this was a good formulation. But the way you get to this formulation was the real deal breaker.

What did you assume?

To understand this, first ask yourself, when you choose this formulation: what did you assume? In fact, why don’t you list down all the assumptions you made here? I can tell you there are quite a few. Here are some of my questions to you – 

  1. Why did you assume that you have enough data records for XGBoost? 
  2. What definition of churn did you assume? In reality, different companies have different definitions and churn is hard to measure. 
  3. Do you even know they have been recording data for long enough that you can run a model? 
  4. Do you know how the model will be used later? Will it be served on the platform?Or will a human look at these outputs?
  5. Do you know whether they need interpretability? Because if that is the case, using an XGboost is out of the question.
  6. How do you know Logistic Regression is too simple for this situation? 
  7. Will this be a one time model or will this model need to be updated frequently?

Real world problem solving

You see, you made quite a lot of assumptions. Not assuming these things and not making these mistakes is exactly what problem formulation is about.

Problem formulation is the art and science of converting a business problem into a specified data science problem.

This specification is very aware of the business context, aware of the constraints, of the requirements of such a system. You’re not here to make a model, you are here to solve a problem. 

But, on Kaggle, you were given this data set and told to optimize the accuracy. The real world isn’t Kaggle, of course. Data is extremely precious. The right data is elusive. Practically all data projects are making do with what they have. There is no golden dataset. I have first hand seen a project get stalled after 4 months because the right data was not available. This was because the formulation that the team was betting on was not feasible without a certain amount of data. 

The better response

So what would have been a better way to respond to that interview question? Always.The best first response is to ask questions, clarifying questions. Understand the business context. Understand the constraints – constraints are really important. Business constraints are the constraints on your solution. Understand how this will be used. Perhaps you do not need any model at all. Perhaps you need to make a thorough manual analysis with the output being a well made visualization which gives a valuable insight for people to act on. The key aspects to understand are implicit in the list of questions I asked above. 

I trust this clarifies what problem formulation is and the need for it. Because remember, you are not here to make a high accuracy predictive model. You are here to solve business problems. A solution to a business problem must respect business constraints. A solution needs to create business impact. It doesn’t matter whether that is using a foundational LLM or XGBoost, or a simple pivot table in an Excel file. Understanding problem formulation will help you decide what is needed for the job and make you immensely successful.

Python for Data & AI

Stop copying AI code. Build real analytical logic in our upcoming live cohort.