Evaluating Crowd Density Estimation Models

Desktop Version | 14 May, 2019

This post is an extension of the crowd density estimation  post. The models used here were trained using almost no hyperparameter tuning so, please ignore their actual performance. It is possible to improve the performance by carefully selecting the image augmentation(s), learning rate, learning rate schedule, and the optimizer. This post's objective is to visualise the predictions across different variables to understand the model performance, and identify areas that can be improved.

The vega plots displayed on this page are pulled from the GitHub and might take some time to load depending on your network connection.


In this section, we are going to plot the predictions of models based on two different architectures: VGG16 Baseline and VGG16+Decoder model. These predictions are generated by 4 different models:

  • ➤ VGG16 Baseline: trained on 224px images
  • ➤ VGG16 Baseline: trained on 448px images
  • ➤ VGG16+Decoder: trained on 224px images
  • ➤ VGG16+Decoder: trained on 448px images

Each plot displays two overall metrics: Mean Squared Error and Mean Absolute Error. We are using two different plots in this section.

1: Scatter Plot

2: Scatter + Heatmap

We first start with the baseline architecture and then present the decoder based architecture. Each sub-section presents the question, and plots provide the potential answer, if any.



In this sub-section, we plot the predictions generated by the baseline model on the training set. We are looking for a few things here: How well does the model overfit the training data? Are there any significant differences between the performance of the baseline model trained on 224PX images vs. 448PX images.

The checkbox presents the option to display true vs. predicted scatter plots for three different input types.

❮ 224px ❯

Interactive Plot: hover over the points or use the mouse to zoom in/out


We repeat the same for VGG16 + Decoder based architecture. Note that we are still not making any direct comparison between two different architectures used here.


The questions remain the same: How well does the model overfit the training data? Are there any significant differences between the performance of the baseline model trained on 224PX images vs 448PX images.

❮ 224px ❯

Interactive Plot: hover over the points or use the mouse to zoom in/out


We now move to a direct comparison between two different architectures. We found above that increasing the input size of training data improves the model performance. We use the predictions from the models trained on 448px images from now on.

We continue plotting predictions on images of three different input sizes, but this time we plot train and test set together. The plot on the left shows the baseline prediction, and the right one shows that of decoder. Here, you can highlight a section of the plot and make a direct comparison between the baseline and decoder model.

❮ 600px ❯

Left: VGG16 Baseline, Right: VGG16 + Decoder

Do you think decoder seems to generate the tighter prediction (i.e closer to the diagonal line) for larger images? Some images are really hard for both the models, isn't it? Are there any similarities between these images🤔? Let's cluster the images and try to find out.


And in the final section, we use the power of cross-filtering to review specific predictions across different models as well as the image clusters. We use the dimensionality reduction algorithm, t-SNE in this case, to cluster the images.

The plots below are generated using the baseline and decoder based models trained on 448px images. We show model predictions on 224, 448, and 600 px images in the above sections. However, here I am just using the predictions on 448px images, mainly because we want to highlight the approach. We could very well do this for different input sizes as well as the datasets if required.

You can select a part of the plot, and see other two plots highlight the relevant points/images. Go ahead, give it a try.

Train set: Use the mouse to select the area

The model was trained on 448px images, and we see that both the models can overfit the training data, which is something that you would want to do before you start regualising the model. Now, let's review the test data.

Test set: Use the mouse to select the area

Clearly, we have some work to do to get the model to generalise better. We also need to review thethat have the crowd density between 600-1000 (see x-axis). These images do not come from the same cluster.

And, although there aren't nice and clear clusters in this particular dataset. The idea seems quite useful to me. It can help me answer questions such as do certain clusters perform worse than the others; if so, I can then review images in that cluster manually.


And that concludes the post. I intended to share how I use different plots to visualise and compare model predictions and use that to drive my investigations. And hopefully, I was able to convey that.