Performance on Normal images
We tested both ESRGAN and ZSSR and compared the quality of results produced by them. The input image is shown on the left, while the images generated by ESRGAN and the original ZSSR are shown in the center and right, respectively. We can see that there’s no obvious difference in the quality of outputs produced by both models. We performed this experiment for many other images from different categories but observed the same thing.
Input
ESRGAN
ZSSR
Input
ESRGAN
ZSSR
Figure 7: Super-Resolved images from ESRGAN and ZSSR models
Performance on non-ideal images (images with different color distribution)
While comparing both models, we found out that for non-ideal cases like medical or satellite imaging, etc, ESRGAN fails to preserve the color information. Here in Figure 8, you can see that ESRGAN introduces a pinkish shade in the output image although the original input image does not have any reference for a pink shade in it. On the other hand, ZSSR doesn’t introduce any such shade and successfully preserves the color information.
​
This is due to the fact that the external learning-based model (ESRGAN) is trained with a large amount of data that excludes images with color encoding similar to X-ray images. It fails to encode the testing image correctly due to a change in the distribution of the color space. The internal learning-based model, on the other hand, uses the internal information of the input images to build the high-resolution image. As a result, the model is able to accurately encode color information.
Input
ESRGAN
Original ZSSR
Modified ZSSR
Figure 8: Super-Resolved image of an x-ray
Performance on images with textual data
The ESRGAN performed really well on the images having visual objects in them. So, to examine how ESRGAN and ZSSR performed on images with both visual and textual data, we experimented with images that contained both textual and visual data. We took an image that had both visual objects and texts in it and passed it as an input to both the models as seen in Figure X. The overall quality seems satisfactory to the human eye. However, zooming in on the photographs reveals the introduced artifacts. For example, if we look at the text written on the board behind the car; especially the word “Masterpiece” since we wanted to see how well they handle the text (non-ideal case) in the input images. We can see in Figure 9 that there’s a significant difference between the output quality of ESRGAN and ZSSR, indicating that ZSSR works better in non-ideal conditions.
It's not possible to include every possible word in the training data. So, the new words/text might be new to the model, making it a bit hard to perform super resolution for external learning methods. But the internal learning method can capture this information during the test-time training. So, internal learning works better in non-ideal cases.
We can also see that the result produced by Modified ZSSR is very close to the output of the original ZSSR model but it is important to note that Modified ZSSR took 4 times less time than original ZSSR to produce this output.
Input Image
Original ZSSR
ESRGAN
Modified ZSSR
Figure 9: Super-Resolved images with focus on textual data
Performance on images with noise
Being “Noise-tolerant” is one of the most important properties of a SR system. In order to test both ESRGAN and ZSSR models for their noise tolerance ability, we took an image and added some noise onto it by randomly blacking out some pixels. We did this only for the left half of the image so that we could easily compare the quality of the noise-induced half and the other non-noise-induced half.
​
We found out that ESRGAN introduced some artifacts like blurriness while super-resolving the image. On the other hand, Original & Modified ZSSR performed much better and didn’t introduce any such artifacts as shown in Figure 10.
Input Image
Original ZSSR
ESRGAN
Modified ZSSR
Figure 10: Super-Resolved images with noise on them
Gradual Super-Resolution
As explained in the “Proposed Work” section, we also implemented the Gradual SR technique for ZSSR method, in which we upscale the image step-by-step instead of doing it in a single step. As a result of this technique, we found out that the quality of results stayed the same as there was no obvious difference which can be seen in Figure 10.
Non-Gradual Super-Resolution (1x → 4x)
Gradual Super-Resolution (1x → 2x → 4x)
Future Work
-
Internal learning methods offer numerous benefits over external learning. This method can be extended to problems like video spatial and temporal super-resolution. This is really important because the content of the video is very specific. External learning-based models won't be able to generalize well. For temporal super-resolution, they may simply interpolate the temporal information. Internal learning approaches stand out in such scenarios. It can be used to enhance the old videos on youtube. Because we only have to do this once, we can afford extra computation.
-
One can explore the combination of external and internal based methods. This helps to make the best of both worlds. One naive example is using the pre-trained ESRGAN as the pre-trained model for internal learning. This aids in faster convergence and works for both ideas and non-ideal cases.
-
The same idea can be extended for Multiple image super resolution(MISR). We can pass all the multiple inputs to a shared backbone and extra the final outputs. The geometric self-ensemble can then be used to get the final super resolution.
-
The same idea can be used for other vision tasks like de-blurring, de-hazing, etc.
-
This can be extended to converting grayscale images/videos to RGB images/videos. This is a bit complicated because we have to inject the additional color information into the model.
-
It can also be explored for complex vision tasks like semantic segmentation. There are few works who has achieved decent performance using internal learning.