Use case: Crowd Counting

Application Areas[2]:

  1. Video Surveillance
  2. Event Planning and Space Design: Crowd counting can be applied in scenarios like public rallies, sports events, etc. for finding out the density of participating people. This information can be very crucial for future event planning and space design.
  3. Extended Applications: Methods used here can also be applied to
    1. counting cells bacteria from microscopic images
    2. Animal crowd estimates in wildlife sanctuaries, or
    3. Estimating the number of vehicles at transportation hubs or traffic jams, etc.

Context-Aware Crowd Counting[3]

The authors [Liu et al.] introduced an end-to-end trainable deep architecture that combines features obtained using multiple receptive field sizes and learns the importance of each such feature at each image location and this accounts for potentially rapid scale changes. This approach adaptively encodes the scale of the contextual information required to accurately predict crowd density. This yields an algorithm that outperforms state-of-the-art crowd counting methods, especially when perspective effects are strong.

Following diagram represents the architecture of Context-Aware Network:

Figure 1: Context-Aware Network (Image credit: Authors[3])
Given an image \(I\), 10 layers of the \(VGG-16\) network represented by \(\mathcal{F}_{vgg}\), outputs the features of the form:
\[\textbf{f}_v = \mathcal{F}_{vgg}(I) \tag{1}\]
These \(\textit{base features}\) were taken by authors to build \(\textit{scale-aware}\) ones. The limitation of \(\mathcal{F}_{vgg}\) is that it encodes the same \(\textit{receptive field}\) over the entire image. To remedy this, authors computed scale-aware features \(s_j\) by performing \(\textit{Spatial Pyramid Pooling}\) [4] to extract \(\textit{multi-scale context information}\) from the VGG base features of above equation:
\begin{equation} \textbf{s}_j = U_{bi}(\mathcal{F}_j(P_{ave}(\textbf{f}_v, j), \theta_j)) \tag{2} \end{equation}
where, for each scale \(j\), \(P_{ave}(·, j)\) averages the VGG features into \(k(j)×k(j)\) blocks; \(F_j\) is a convolutional network with kernel size 1 to combine the context features across channels without changing their dimensions. Authors do this because SPP keeps each feature channel independent, thus limiting the representation power. They verified that without this the performance drops. \(U_{bi}\) represents bilinear interpolation to up-sample the array of contextual features to be of the same size as \(f_v\). Authors have used \(S = 4\) different scales, with corresponding block sizes \(k(j) ∈ {1, 2, 3, 6}\) since it shows better performance compared with other settings.

Contrast features that capture the differences between the features at a specific location and those in the neighborhood is defined as:

\[c_j = s_j - f_v \tag{3}\]

This also provides an important visual cue that denotes saliency.

These contrast features are passed as input to auxiliary networks with weights \( \theta^j_{sa} \) that compute the weights \(\omega_j\) assigned to each one of the \(S\) different scales authors use. Each such network outputs a scale-specific weight map of the form: \[ \omega_j = F^j_{sa}(c_j, \theta^j_{sa}) \tag{4}\] \(F^j_{sa}\) is a 1×1 convolutional layer followed by a sigmoid function to avoid division by zero. These weights than employed to compute final contextual features as: \[ \textbf{f}_I = \Bigg[\textbf{f}_v | \frac{\sum_{j=1}^{S}\omega_j \odot \textbf{s}_j}{\sum_{j=1}^{S}\omega _j} \Bigg] \tag{5} \] where \([.\lvert.]\) denotes the channel-wise concatenation operation and is the element-wise product between a weight map and a feature map. \[\] \(\textbf{Geometry Guided Context Learning:}\) This can be used for exploiting geometry information when it is available. They represented the \(\textit{scene geometry}\) of image \(I_i\) with a perspective map \(M_i\). Then base features of equation \((1)\) can be replaced by geometry-guided context features \(\textbf{f}_g\) can be defined as: \[ \textbf{f}_g = \mathcal{F}^\prime _{vgg}(M_i, \theta_g) \tag{6} \] where \(\mathcal{F}^\prime _{vgg}\) is a modified VGG-16 network with a single input channel. Weight map for each scale in equation \((4)\) is computed as: \begin{equation} \omega_j = \mathcal{F}^j_{gc}([\textbf{c}_j | \textbf{f}_g], \theta^j_{gc}) \tag{7} \end{equation} These weight maps are then used as in Equation \(5\).

Following diagram represents the architecture of geometry guided context learning:

Figure 3: Expanded Context-Aware Network (Image Credit: Authors[3])

Cross-scene Crowd Counting via Deep Convolutional Neural Networks[1]

Challenges mentioned by authors:

  1. Most existing crowd counting methods drops significantly when they are applied to the unseen scene.
  2. Severe occlusions
  3. Scene perspective distortion
  4. Diverse crowd distribution

The authors proposed a framework for cross-scene crowd counting with no extra annotations needed.  They fine-tuned a pre-trained CNN model for new distribution (unseen scenes) scenes. When a test image from a new scene is given, they choose similar training data to fine-tune the pre-trained network based on the perspective information and similarity in the density map.

They defined the density map regression ground truth as a sum of Gaussian kernels centered on the locations of objects. This kind of density map is suitable for characterizing the density distribution of circle-like objects such as cells and bacteria. They created a crowd density map by the combination of several distributions with perspective normalization.

They proposed a switchable training scheme with two related learning objectives, estimating density map and global count. The two related tasks assist each other and achieve lower loss. The method requires perspective maps both on training scenes and the test scene.

Single-Image Crowd Counting via Multi-Column Convolutional Neural Network[2]

The authors proposed a method that can accurately estimate crowd count from an individual image with arbitrary crowd density and arbitrary perspective. The viewpoint of an image can be arbitrary.

Image -> Multi-column Convolutional Neural Network (MCNN) -> Crowd density map (integral gives overall crowd count).

Reason for selecting multi-column architecture: the three columns correspond to filters with receptive fields of different sizes (large, medium, small) so that the features learned by each column CNN is adaptive to (hence the overall network is robust to) large variation in people/head size due to perspective effect or across different image resolutions.

They replaced the fully connected layer with a convolution layer whose filter size is 1 × 1. Therefore the input image of the model can be of arbitrary size to avoid distortion.

The output could have been directly the crowd count instead of the density map. The authors decided to go with a density map as it gives additional information about the spatial distribution of the crowd in the given image. For example, if the density in a small region is much higher than that in other regions, it may indicate something abnormal happens there.

In learning the density map via a CNN, the learned filters are more adapted to heads of different sizes, hence more suitable for arbitrary inputs whose perspective effect varies significantly. Thus the filters are more semantic meaningful, and consequently improves the accuracy of crowd counting.

The true density map is computed accurately based on geometry-adaptive kernels which do not need knowing the perspective map of the input image.

Model, once trained on one dataset, can be readily transferred to a new dataset. The number of crowds is estimated without segmenting the foreground.

Shanghaitech dataset: Nearly 1,200 images with around 330,000 accurately labeled heads. No two images in this dataset are taken from the same viewpoint. This dataset consists of two parts: Part A and Part B. Images in Part A are randomly crawled from the Internet, most of them have a large number of people. Part B is taken from the busy streets of metropolitan areas in Shanghai.

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition[4]

The prevalent CNNs require a fixed input image size (e.g., 224 x 224), which limits both the aspect ratio and the scale of the input image.

Image Credit: Authors (

To adopt the deep network for images of arbitrary sizes, the authors replaced the last pooling layer with a spatial pyramid pooling layer.

[Image Credit: Authors] A network structure with a spatial pyramid pooling layer. Here 256 is the filter number of the conv5 layer, and conv5 is the last convolutional layer.

Following is a nice explanation of SPP by Cogneethi[]


  1. [2015] Cross-scene Crowd Counting via Deep Convolutional Neural Networks by Cong Zhang, Hongsheng Li, Xiaogang Wang, Xiaokang Yang - Link
  2. [2016] Single-Image Crowd Counting via Multi-Column Convolutional Neural Network by Yingying Zhang Desen Zhou Siqin Chen Shenghua Gao Yi Ma - Link
  3. [2019] Context-Aware Crowd Counting by Weizhe Liu, Mathieu Salzmann, Pascal Fua - Link
  4. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun - Link
  5. Cogneethi - Link