Application Areas[2]:
- Video Surveillance
- Event Planning and Space Design: Crowd counting can be applied in scenarios like public rallies, sports events, etc. for finding out the density of participating people. This information can be very crucial for future event planning and space design.
- Extended Applications: Methods used here can also be applied to
- counting cells bacteria from microscopic images
- Animal crowd estimates in wildlife sanctuaries, or
- Estimating the number of vehicles at transportation hubs or traffic jams, etc.
Context-Aware Crowd Counting[3]
The authors [Liu et al.] introduced an end-to-end trainable deep architecture that combines features obtained using multiple receptive field sizes and learns the importance of each such feature at each image location and this accounts for potentially rapid scale changes. This approach adaptively encodes the scale of the contextual information required to accurately predict crowd density. This yields an algorithm that outperforms state-of-the-art crowd counting methods, especially when perspective effects are strong.
Following diagram represents the architecture of Context-Aware Network:

Contrast features that capture the differences between the features at a specific location and those in the neighborhood is defined as:
This also provides an important visual cue that denotes saliency.
Following diagram represents the architecture of geometry guided context learning:

Cross-scene Crowd Counting via Deep Convolutional Neural Networks[1]
Challenges mentioned by authors:
- Most existing crowd counting methods drops significantly when they are applied to the unseen scene.
- Severe occlusions
- Scene perspective distortion
- Diverse crowd distribution
The authors proposed a framework for cross-scene crowd counting with no extra annotations needed. They fine-tuned a pre-trained CNN model for new distribution (unseen scenes) scenes. When a test image from a new scene is given, they choose similar training data to fine-tune the pre-trained network based on the perspective information and similarity in the density map.
They defined the density map regression ground truth as a sum of Gaussian kernels centered on the locations of objects. This kind of density map is suitable for characterizing the density distribution of circle-like objects such as cells and bacteria. They created a crowd density map by the combination of several distributions with perspective normalization.
They proposed a switchable training scheme with two related learning objectives, estimating density map and global count. The two related tasks assist each other and achieve lower loss. The method requires perspective maps both on training scenes and the test scene.
Single-Image Crowd Counting via Multi-Column Convolutional Neural Network[2]
The authors proposed a method that can accurately estimate crowd count from an individual image with arbitrary crowd density and arbitrary perspective. The viewpoint of an image can be arbitrary.
Image -> Multi-column Convolutional Neural Network (MCNN) -> Crowd density map (integral gives overall crowd count).
Reason for selecting multi-column architecture: the three columns correspond to filters with receptive fields of different sizes (large, medium, small) so that the features learned by each column CNN is adaptive to (hence the overall network is robust to) large variation in people/head size due to perspective effect or across different image resolutions.
They replaced the fully connected layer with a convolution layer whose filter size is 1 × 1. Therefore the input image of the model can be of arbitrary size to avoid distortion.
The output could have been directly the crowd count instead of the density map. The authors decided to go with a density map as it gives additional information about the spatial distribution of the crowd in the given image. For example, if the density in a small region is much higher than that in other regions, it may indicate something abnormal happens there.
In learning the density map via a CNN, the learned filters are more adapted to heads of different sizes, hence more suitable for arbitrary inputs whose perspective effect varies significantly. Thus the filters are more semantic meaningful, and consequently improves the accuracy of crowd counting.
The true density map is computed accurately based on geometry-adaptive kernels which do not need knowing the perspective map of the input image.
Model, once trained on one dataset, can be readily transferred to a new dataset. The number of crowds is estimated without segmenting the foreground.
Shanghaitech dataset: Nearly 1,200 images with around 330,000 accurately labeled heads. No two images in this dataset are taken from the same viewpoint. This dataset consists of two parts: Part A and Part B. Images in Part A are randomly crawled from the Internet, most of them have a large number of people. Part B is taken from the busy streets of metropolitan areas in Shanghai.
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition[4]
The prevalent CNNs require a fixed input image size (e.g., 224 x 224), which limits both the aspect ratio and the scale of the input image.

To adopt the deep network for images of arbitrary sizes, the authors replaced the last pooling layer with a spatial pyramid pooling layer.

Following is a nice explanation of SPP by Cogneethi[]
References:
- [2015] Cross-scene Crowd Counting via Deep Convolutional Neural Networks by Cong Zhang, Hongsheng Li, Xiaogang Wang, Xiaokang Yang - Link
- [2016] Single-Image Crowd Counting via Multi-Column Convolutional Neural Network by Yingying Zhang Desen Zhou Siqin Chen Shenghua Gao Yi Ma - Link
- [2019] Context-Aware Crowd Counting by Weizhe Liu, Mathieu Salzmann, Pascal Fua - Link
- Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun - Link
- Cogneethi - Link
- https://github.com/gjy3035/Awesome-Crowd-Counting