Cost-Benefit Analysis of a ML classifier.#

In the previous notebook, we trained a LogisticRegression classifier to predict the number in a handwritten digit image. In this notebook, we will use the results to perform a cost-benefit analysis in a business setting.

For this notebook, let’s asumme the accuracy of our classifier was 97% over the test set.

Automatic mail redirection in a postal service 📩#

Suppose we want to deploy our classifier in a postal service hub, to automatize the ZIP code reading of handwritten addresses. The classifier will be used to read the ZIP code of the address, and then the mail will be automatically redirected to the corresponding ZIP code.

Assuming a ZIP code is 5 digits long, what is the probability that the classifier will read the 5 digits correctly? And incorrectly?

\[P(zip\_ok) = (0.97)^5 = 0.86\]
\[P(zip\_not\_ok) = 1 - P(zip\_ok) = 0.14\]
p_zip_ok = 0.97 ** 5
p_zip_not_ok = 1 - p_zip_ok

p_zip_ok, p_zip_not_ok
(0.8587340256999999, 0.1412659743000001)

These are the probabilities of correctly routing the mail to the correct ZIP code, and incorrectly routing the mail to the wrong ZIP code, respectively.

Now, we need to asign those two events some economic value (cost or benefit).

  1. If the mail is incorrectly routed, the mail will arrive later to its destination. This will cause a customer complaint, and the postal service will have to reimburse the customer with the 1€ fee he has paid before for early delivery.

  2. For the mails that are correctly routed, we need to estimate the savings that the company will have. As an example, suppose that with this ML system, we automatize the work of some postal worker. A postal worker can process 200 mails per hour, and the postal service pays him minimum wage (8.45 €/hour). If using the ML system, the worker can do other tasks, so the company will save the cost of mail processing.

The cost of (manually) processing a mail is thus \(8.45 / 200 = 0.04€\) per mail.

Notice that now, we have two different costs: the cost of misclassification, and the cost (savings) of correct classification:

cost_miss = -1. # we use negative number to indicate it's bad (a cost)
cost_correct = +0.04  # we use positive number to indicate it's good (a saving)

Now, to compute the Expected Value of the classifier, we just need to compute the expected cost, which is the sum of the cost of misclassification and the cost of correct classification, multiplied by the probability of each event:

\[ EV = P(zip\_ok) * cost(zip\_ok) + P(zip\_not\_ok) * cost(zip\_not\_ok) \]
\[ EV = 0.86 * 0.04€ + 0.14 * (-1€) = -0.11 € \]
EV = p_zip_ok * cost_correct + p_zip_not_ok * cost_miss
EV
-0.1069166132720001

This value is negative, so the company would lose money if they deploy the ML system!!

Since we have computed an expectation, on average, the company will lose 0.11€ per mail.

This is because the cost of misclassification is much higher than the cost of correct classification.

Alternative scenario: using a better ML model#

Let’s assume now we develop a better model, that has an accuracy of 99.91% over the test set. What is the expected value of this model for the ZIP redirection task?

p_zip_ok = 0.9991 ** 5
p_zip_not_ok = 1 - p_zip_ok

p_zip_ok, p_zip_not_ok
(0.9955080927132799, 0.004491907286720109)
EV = p_zip_ok * cost_correct + p_zip_not_ok * cost_miss
EV
0.03532841642181109

Now, the EV is positive, so the company will save money if they use this newer ML model.

When using a ML model in a business setting, it is important to not only consider metrics such as accuracy, but also other economic factors, such as the cost of misclassification, or the cost of correct classification.