ZSSR Reimplementation

Originally, ZSSR[1] was implemented in an older version of Tensorflow, which is not supported by the TF team anymore. So, we decided to re-implement ZSSR in Pytorch since it's the most prevalent ML framework among researchers.

ZSSR vs ESRGAN

We compared ZSSR[1] method with another popular ESRGAN approach proposed in [2] to see how external and internal learning based methods perform image SR tasks on images from different categories and also study their pros and cons, and identify use cases when one benefits more than the other.

Gradual Super-Resolution in ZSSR

After reimplementing the ZSSR, we focused on experimenting more with it. We investigated gradual super-resolution. For instance, if we want to rescale an image from 32x32 to 256x256, we can do so in multiple passes. First, we pass the 32x32 image to a network that generates 64x64 output, then we use this new output to generate 128x128, and finally, we pass it back to the same network to generate the desired resolution. This is needed since it is not a good approach to rescale an image from 240x240 to 1080x1080 in a single pass, even if we do, this results in interpolation artifacts.

Architectural changes in ZSSR (Modified ZSSR)

The existing ZSSR framework's fundamental problem is its test time training. Because of this reason, it's a bit hard to deploy. As a result, we've concentrated on reducing the time it takes to generate high-resolution photos for deployment purposes.

One of the popular deep learning models for image classification is MobileNets. MobileNet model is designed to effectively maximize accuracy while being mindful of the restricted resources for on-device or embedded applications. MobileNets are small, low-latency, low-power models parameterized to meet the resource constraints of a variety of use cases. MobileNet uses depth-wise separable convolutions. It significantly reduces the number of parameters when compared to the network with regular convolutions with the same depth in the nets. This results in lightweight deep neural networks. Inspired by this, we have replaced all the convolutions with depth-wise separable convolutions.
One must train the external learning-based models on a vast pool of photos. The entropy of distinct photos is extremely large, making learning the features extremely difficult. However, we only have one image that suggests low entropy for internal learning-based approaches. To capture the recurrent internal features, we don't require a large mode. So, in order to reduce the complexity of the model, we started experimenting with the channels. We experimented with several depths and found little difference in the final outcome. Finally, we used 32 channels for the entire model to reduce computation time even more.
We have tried experimenting with various activation functions. Our experiments revealed that varying activation functions won't show any significant change. Finally, we have used the PReLU activation function which has shown very little improvement.
The authors have used a learning rate of 0.001 for training the model. Our experiments revealed that a higher learning rate will lead to less training time without hurting the model's performance.As a result, we employed a 0.1 learning rate with learning rate scheduling.
Lastly, we removed the test time augmentation. The test time augmentation used in the ZSSR paper is the geometric self-ensemble proposed in [4] (which generates 8 different outputs for the 8 rotations+flips of the test image and then combines them). The authors proposed to use the median of these 8 outputs rather than their mean. This tweak has reduced the computation time while reducing the sharpness of the images.

All these components have significantly helped to reduce the computation time. The inference time on Tesla K80 is reduced from 227 seconds to 48 seconds on average. The inference time will be very low on high-end GPUs. Due to a lack of resources, we were unable to report the inference times on other machines.