As you might have seen from our previous blogposts, BinaryEdge gathers many types of data from multiple sources and tries to apply different methodologies for analysis.
Today, we are going to take a look at the exposure of the VNC service to the internet and how we use Data Science to help us identify points of interest on screenshots and images.
What is VNC?
VNC is a service commonly used to manage remote machines. There are multiple software implementations of this protocol, but essentially a user can install a VNC server on a remote machine and use a client application to access this machine remotely.
The server can then be protected with a password for extra security, however, as we have seen from the previous blogposts, users tend not to care about security.
One important disclaimer here:
BinaryEdge only looks at open VNC servers. Those servers that request a password are completely skipped and no passwords are even tested. We have detected though, with sensors we have deployed, people trying to use weak and default passwords to try and access these servers, but this is not the case here. We have also started the process to notify the people responsible for these IP addresses, so that they can fix these as soon as possible.
Some previous research has been done on this topic as seen on this Forbes article.
How did we get VNC screenshots? (Data acquisition for this analysis)
By using the BinaryEdge engine, we are able to collect internet wide census of different services, ports and protocols. We acquire this data, analyse it and report it to our clients so that they always have updated information about their perimeters and can be less exposed to possible security threats.
For the case of the VNC service, the process goes like this:
However, as you might have guessed, the IPv4 space contains a lot of IP addresses and there is a lot of VNC machines out there. This brought us to an interesting set of problems.
There are 3 main problems when analysing these VNC images (and other kinds of screenshots in general of course, meaning that the methods shown are not VNC exclusive as we gather data from other services as well):
Lots of images - As we have mentioned, there are a lot of IP addresses out there that expose the VNC service
Lots of similar images - Lots of the devices are locked or just have similar screensavers
Lots of "boring" images - Quite a big number of "locked" Microsoft Windows systems or Linux consoles requesting login were found. We consider these as "boring" images since there isn't much happening and there's not much information to gather. A more interesting set of images would be, for instance:
We couldn't manually go and look at all the images we have acquired one by one. So, what we decided to do instead, was to use Data Science techniques to help us automatically filter and identify those interesting images.
All methods applied will be explained while trying to avoid the mathematical details behind their implementations. For more details please consult the reference list or contact us.
Since VNC screenshots had different sizes and some of them contained multiple screens, preprocessing techniques were first applied in order to normalize the image dataset (since some methods strictly require the images to have the same size). Screenshots with multiple screens were split according to the ratio of the image; we have calculated width/height, and the resulting value was used to determine how many screens had to be split. While not perfect, this method proved to be good enough for the time being. Also, the images were resized to 800 × 600. Furthermore, all images were converted to grayscale, since the color on the VNC screenshots contained no relevant information for the purpose of our study and since the calculations would be much faster this way.
When we first looked to our dataset, there was one thing that stood out - there were a lot of black/gray screens with absolutely no information. So one of the first things that came into our minds was "entropy".
Shannon's entropy , also known as information entropy, describes how much information the image contains and it is calculated by the equation:
High entropy values correspond to great information content. Hence, images with no information, i.e., all pixels have the same gray level, have approximately zero entropy.
With this in mind, our approach for distinguishing the VNC screenshots of interest was based on the assumption that images with low entropy values had no interest and could be discarded.
Entropy ≈ 0.00
To evaluate the performance of this method, we randomly selected 100 of our VNC screenshots, which contained 28 black/gray screens. Then, we've calculated a value of entropy for each image. For an entropy value lower than a 0.1 threshold, we were able to select and discard all black/gray screens.
However, this approach had obvious limitations. One example is that the Microsoft Windows lock screens had great entropy values, since the image contained a lot of detail, however, these were of no value for our purposes.
Entropy ≈ 6.16
We quickly concluded that this approach could only be used as a "preliminary selection" of non-relevant VNC screenshots.
Structural Similarity Index
Following the previous results, we definitely needed a more robust method that was able to compare the structural information between images. After some research, we have found a promising comparison method called Structural Similarity (SSIM) index .
SSIM is a widely used method for image quality assessment that perceives the change in structural information and it can be used for measuring the similarity between images. This index is a combination of three parameters, namely the brightness, the contrast and the correlation structure of the two images and is calculated by:
SSIM values range between 0 and 1, where 0 means no similarity and 1 corresponds to completely identical images.
Below are presented some examples of the application of this index using our VNC screenshots.
SSIM ≈ 0.89
SSIM ≈ 0.77
As one can observe, the first pair of images has a higher similarity value than the second pair. However, the second pair also returned a high SSIM value, regardless of the clear differences between the two images. This makes it difficult to find an optimal threshold.
In an attempt to find this optimal threshold of similarity, the following test was performed:
100 VNC screenshots were randomly selected from our dataset. This new subset included 22 black screens and 15 locked screens.
The SSIM method was applied using different threshold values of: 0.75, 0.80, 0.85, 0.90 and 0.95.
This test was performed twice, using a black screen and a locked screen as reference images.
The results obtained are shown in the plot below:
Ideally, we would find a threshold value where the percentage of true positives (images correctly classified as similar) is maximal and at the same time the percentage of false positives (dissimilar images classified as similar) is minimum.
As one can observe, the percentage of both true and false positives decreases for high values of SSIM. There is not a perfect value, but we can argue that the best trade-off is achieved for a threshold of 0.85.
Despite its simplicity, the SSIM approach proved to be a very interesting method to address the VNC screenshots similarity challenge as it returned satisfying results.
Nevertheless, the accuracy obtained still had room for improvement, which meant to us that we could investigate more sophisticated algorithms.
One of the most used algorithms for finding similar items is the K-Nearest Neighbours (KNN)  due to its simplicity and good results on basic recognition problems. However, it presents a major drawback - it's too computationally expensive for large datasets. For example, in order to find VNC screenshots that are similar to another given screenshot in a dataset, we need to compare each pair individually. If we think that a dataset can contain millions of images, it's easy to understand why this approach is impractical.
Luckily, there is an algorithm that overcomes this drawback - Locality-Sensitive Hashing (LSH) . LSH is a method for KNN search in high dimensional spaces that can drastically reduce the computational time by approximating the similarity search problem.
The detailed principles behind the LSH method and its inherent mathematics are beyond the scope of this article, as such, only a general explanation of the mechanics of the algorithm will be provided.
Instead of comparing all pairs of images in a dataset, items are hashed into buckets, such that similar items will be more likely to hash into the same buckets. Hence, only items within a certain bucket will be compared reducing the number of comparisons needed.
The main steps behind LSH are the following:
Project the data into a 1-dimensional binary space - Hamming space. Each image is mapped to a bit vector called the Hash Key. Close objects tend to be in the same bucket:
Similar images should map to nearby codes
Dissimilar images map to distant codes
- Find k codes with k smallest Hamming distances from a given query
LSH uses Hamming distances (the distance between bit vectors X and Y is equal to the number of non-matching bits) due to its low computational cost. Also, binary codes are storage efficient.
When a new query is given, i.e. when we want to find similar images to a given screenshot in the dataset, the LSH algorithm precomputes the hash code of that screenshot and then quickly compares it to the other precomputed hash codes to determine if they are close to the hash code of our query.
Representation of LSH schema:
We used an algorithm called LSH Forest  that automatically performs tuning of LSH parameters that depend on the data set, hence avoiding manual tuning.
Note that there are a lot of parameters that can - and should be - optimised to improve the accuracy of LSH.
Regarding KNN, there are two approaches to return the nearest neighbours: use the radius (distance to nearest neighbours) or the number of neighbours. For both, we need to have an idea of the range of these values a priori and this can be tricky. In a real dataset containing million of items, it's near impossible to determine the optimal k-value - we would have to know in advance how many similar images we expect to find. The same true holds to the radius value - it depends on the dataset. The choice of these the parameters is very critical since they directly determine how well the algorithm will perform.
Since there is no magic formula to obtain the optimal values for these parameters, the most simple way is to use a "trial and error" approach - run KNN within a reasonable range of values and choose that which led to the best accuracy.
Here is an example of a result obtained with our screenshots. We want the 10 most similar images, that is, the 10 nearest neighbours of the image on the left. The results obtained are shown on the right. As one can observe, the results are very satisfactory, since the images are truly similar.
The LSH method proved to be a very robust and efficient approach for our similarity problem. However, the choice of the KNN parameters is not trivial and using the "trial and error" approach could be a very time-consuming process. Ultimately we would like to obtain a system that doesn't have to be reconfigured on a case by case basis each time we want to search for a different type of screenshot.
Conclusion and further work
On this blogpost we have only presented a very small range of possible approaches for our VNC screenshots similarity problem - there are certainly other methods that could lead to very interesting outcomes.
Even the results obtained with the methods presented here might be improved by several factors, for example:
apply other preprocessing techniques to screenshots in order to increase the contrast emphasising the differences between them;
extensively test the impact of all parameters of each method and search for the optimal combination of values.
Other approaches that we believe that are worth trying are:
clustering methods, preferably ones that do not required the number of clusters to be specified a priori;
more general feature extraction methods such as template matching and maximally stable extremal regions (MSER);
application of optical character recognition (OCR) techniques to retrieve more information from the VNC screenshots.
If you would like to keep up to date with our analysis and posts please consider following us on twitter, google+ and facebook.
 C. E. Shannon. "A mathematical theory of communication." Bell Syst. Tech. J., 27:379–423,623–656, 1948.
 Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity" IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, Apr. 2004.
 D. W. Aha, “Special AI review issue on lazy learning" Artificial Intelligence Review, vol. 11, 1997.
 Indyk, Piotr.; Motwani, Rajeev, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality." Proceedings of 30th Symposium on Theory of Computing. 1998
 M. Bawa, T. Condie and P. Ganesan, "LSH Forest: Self-Tuning Indexes for Similarity Search" WWW ‘05 Proceedings of the 14th international conference on World Wide Web, 651-660, 2005.