**When building predictive models, you obviously need to pay close attention to their performance. That is essentially what it is all about – getting the prediction right. Especially if you are working for paying clients you need to prove that the performance of your models is good enough for their business. Fortunately, there is a whole bunch of statistical metrics and tools at hand for assessing model’s performance.**

In my experience, performance metrics for (especially binary) classification tasks such as confusion matrix and derived metrics are naturally understood by almost anyone. A bit more problematic is the situation for regression and time series. For example when you want to predict future sales or want to derive income from other parameters, you need to show how close your prediction is to the observed reality.

I will not write about (adjusted) R-squared, F-test and other statistical measures. Instead, I want to focus on performance metrics that should represent more intuitive concept of performance as I believe they can help you to sell your work much more. These are:

- mean absolute error
- median absolute deviation
- root mean squared error
- mean absolute percentage error
- mean percentage error

Mean absolute error (MAE) takes an average absolute difference between your prediction and observed reality. The metric treats positive and negative error equally. Plus the measure uses the same scale as the underlying data.

Instead of mathematical formulas, which you can easily find on wiki, let me show you the R codes.

#' mean absolute error f_calculate_mae <- function(observed, predicted, decimals=2){ error <- observed - predicted return(round(mean(abs(error)), decimals)) }

When your errors are skewed, e.g. few errors are significantly higher than others, mean gets dragged away and median (half of our errors are below, half above) is a better measure of how the errors typically looks like. If we replace mean by median we get median absolute deviation (MAD).

#' median absolute deviation f_calculate_mad <- function(observed, predicted, decimals=2){ error <- observed - predicted return(round(median(abs(error)), decimals)) }

So MAD plays down the role of too high errors. Sometimes we need to increase their influence when these huge errors can hit us badly. This is when root-mean-square error (RMSE) comes in place. By taking root of mean squared errors, you increase the importance of larger errors.

#' root mean squared error f_calculate_rmse <- function(observed, predicted, decimals=2){ error <- observed - predicted return(round(sqrt(mean(error^2)), decimals)) }

Sometimes you also want to evaluate the error in relative terms. One of the common metrics is mean absolute percentage error (MAPE), which is almost like MAE, but you take relative errors and you multiply the resulting number by 100 to get score in percentage.

#' mean absolute percentage error f_calculate_mape <- function(observed, predicted, decimals=2){ mean_observed <- mean(observed) # trick to deal w/ zeros in observed pct_error <- (mean_observed - predicted) / mean_observed return(round(100 * mean(abs(pct_error)), decimals)) }

When taking absolute errors, all the errors add up. Sometimes you prefer that positive errors equal out negative errors and vice versa. Then there is mean percentage error (MPE).

#' mean percentage error f_calculate_mpe <- function(observed, predicted, decimals=2){ mean_observed <- mean(observed) # trick to deal w/ zeros in observed pct_error <- (mean_observed - predicted) / mean_observed return(round(100 * mean(pct_error), decimals)) }

The best way to understand abstract mathematical concepts (in this case they are not that abstract though to be honest) is to understand their behavior.

For this purpose let’s consider a following example: we take normally distributed target with mean of 100 and standard deviation of 1. And we want to see what will happen to our performance measures if the prediction has different mean or different standard deviation.

Following chart shows four examples of such scenarios.

library(ggplot2) library(reshape2) #' simulate some data set.seed(42) my_df <- rbind(data.frame(time=c(1:24), observed = rnorm(n=24, mean=100, sd=1), predicted = rnorm(n=24, mean=100, sd=1), scenario = 'normally distributed prediction with mean=100, sd=1'), data.frame(time=c(1:24), observed = rnorm(n=24, mean=100, sd=1), predicted = rnorm(n=24, mean=100, sd=5), scenario = 'normally distributed prediction with mean=100, sd=5'), data.frame(time=c(1:24), observed = rnorm(n=24, mean=100, sd=1), predicted = rnorm(n=24, mean=101, sd=1), scenario = 'normally distributed prediction with mean=101, sd=1'), data.frame(time=c(1:24), observed = rnorm(n=24, mean=100, sd=1), predicted = rnorm(n=24, mean=101, sd=5), scenario = 'normally distributed prediction with mean=101, sd=5')) my_df <- reshape2::melt(my_df, id.vars=c('time', 'scenario')) ggplot2::ggplot(my_df, aes(x=time, y=value, color=variable)) + geom_line() + facet_wrap( ~ scenario, ncol=1) + theme(legend.position='bottom') + ggtitle('Normally distributed observed target\n(with mean=100, sd=1)')

Now let see what different means and different standard deviations do with our performance metrics.

Let’s consider samples of 12, 24, 36 and 100 observations (first three give us idea how a monthly prediction can behave 1, 2, and 3 years ahead). We also consider mean of prediction to be the same as the target and then higher by 1, 5, and 10. Similarly, standard deviation of prediction will be the same as target’s (1) and then 5 and 10. Lastly, we’ll simulate 100 situations for each of the scenarios consisting of a combination of number of observations, prediction mean and standard deviation. This leaves us with 4800 scenarios overall.

#' simulation scenarios for_df <- expand.grid(my_iteration = seq(1, 100), my_n = c(12, 24, 36, 100), my_mean = c(100, 101, 105, 110), my_sd = c(1, 5, 10)) # simulation loops set.seed(24) for (i in c(1:nrow(for_df))){ #' simulate data predicted <- rnorm(n=for_df$my_n[i], mean=100, sd=1) observed <- rnorm(n=for_df$my_n[i], mean=for_df$my_mean[i], sd=for_df$my_sd[i]) #' add performance metrics for_df$rmse[i] <- f_calculate_rmse(observed, predicted) for_df$mae[i] <- f_calculate_mae(observed, predicted) for_df$mad[i] <- f_calculate_mad(observed, predicted) for_df$mape[i] <- f_calculate_mape(observed, predicted) for_df$mpe[i] <- f_calculate_mpe(observed, predicted) } #' melt for plotting for_df <- reshape2::melt(for_df, id.vars=c('my_iteration', 'my_n', 'my_mean', 'my_sd'))

Now, let’s have a look at the performance metrics for our scenarios.

#' rmse, mae, mad ggplot(for_df[for_df$variable %in% c('rmse', 'mae', 'mad'), ], aes(x=as.factor(my_mean), y=value, fill=as.factor(my_sd))) + geom_boxplot() + facet_wrap(~ as.factor(variable), ncol=3) + xlab('Prediction mean') + guides(fill=guide_legend(title='Prediction standard deviation')) + theme(legend.position='bottom') + ggtitle('Influence of mean and standard deviation of prediction')

As expected, performance metrics increase with difference of prediction mean and observed mean (e.g. increasing box plots of the same color). Such increase is the steepest for RMSE, the least steep for MAD.

We can also see that the higher standard deviation, the higher is the error (blue boxplots are higher than red ones). This is true especially for RMSE, which is more penalizing larger errors.

What is positive to see is that all the metrics are on the same scale as the original data and they are comparable. Especially with RMSE it is tricky to visualize in your head how root of mean squared errors could look like.

#' mape, mpe ggplot(for_df[!for_df$variable %in% c('rmse', 'mae', 'mad'), ], aes(x=as.factor(my_mean), y=value, fill=as.factor(my_sd))) + geom_boxplot() + facet_wrap(~ as.factor(variable), ncol=2) + xlab('Prediction mean') + guides(fill=guide_legend(title='Prediction standard deviation')) + theme(legend.position='bottom') + ggtitle('Influence of mean and standard deviation of prediction')

In case of MAPE and MPE we can see that by definition the boxplots are situated around prediction mean. Their spread increases with standard deviation. What is also interesting to note is the relation of MAPE and MPE for small prediction mean and large prediction mean. For the lower values we can see that MPE is much lower, wider and also negative. The reason is that these scenarios may lead to prediction being both above and below the observed target. For higher means the MAPE and MPE are effectivelly equal because it is less common to find prediction below target. So for significantly biased prediction MAPE and (absolute) MPE will be the same.

I hope that by looking at the simple charts you get some feeling for these performance metrics. This should make a choice of the right performance metric for each task easier.

Sometimes you care about the fit for all the observations, sometimes an overall fit is good enough. In some cases you are vulnerable to extreme prediction errors and sometimes you are not. These are all factors to take in consideration when choosing the performance metrics. You should also think about creating your own performance metrics if you need to cover some specific business situation. For example when ordering durable goods based on a 6-months prediction you care about overall performance and the errors in the individual months are not that important. So MPE might be a good measure. But at the same time you prefer your prediction to be higher than the target (which can be closed by a targeted marketing campaign) than lower (which is just lost business). So it would be handy to adjust the MPE measure accordingly.