This week the work on the server from CW 10 has been continued.
The index model with one of the LIRE features (Edge Histogram) using 5 million images, as a selected subset, has been built.
The released LIRE feature set contains a set of thirteen different visual image descriptors, from which only one of these, namely Edge Histogram has been chosen for similarity matching in this project, due to limited processing resources.
Edge Histogram is a structure to represent the local edge distribution of an image under 5 types of edges: vertical, horizontal, 45 degree, 135 degree and non-directional edge. The image is partitioned into 16 sub-images of equal size and edge distribution histogram is generated for each sub-image .
The edge histogram descriptor captures the spatial distribution of edges. The distribution of edges is a good texture signature that is useful for image to image matching even when the underlying texture is not homogeneous .
Previous work shows that Edge Histogram achieved relatively good results in content based image retrieval tasks . Thus, in this work Edge Histogram has been selected. As previously discussed, recent state-of-the-art research work shows that AlexNet feature achieve better performance in content based image retrieval tasks, however, this feature set will not be available during the time of this project.
Since the aim of this project is to prove the concept of auto tagging using a large reference image dataset, selecting and evaluating an optimal set of features is out of the scope of this work.
The LIRE features for the full 100 million image dataset, are released in 9921 compressed files (tar.gz). In order to build the index model for the selected subset, a number of steps had been taken:
- Get Edge Histogram feature from feature dataset:
Decompress and gather Edge Histogram feature, on which the similarity matching will be based, among all features from the 9,921 compressed LIRE feature files released by Yahoo!. To do this, a Python program has been written, which loads in the compressed feature files, one by one, extracts the Edge Histogram feature and appends it to the feature output file. The time cost of this extraction process is approximately 9.5 hours. The final file containing Edge Histogram feature for 100 million images is 20.3 GB.
- Mapping entries in image meta dataset to entries in feature dataset:
The metadata released by Yahoo! contains image ID and other information for the 100 million images. The feature files, however, contain MD-5 hashed image feature identifier. In order to link the feature data with metadata, a Python dictionary table (key: image ID, value: feature ID) has been built, which returns the feature ID for a given image ID.
- Construct feature set of the subset data:
As described in CW10, processing the whole 100 million image requires a lot of computer resources, which is only available in a computing cloud. For this project, a 5 million image subset, captured in Ireland and UK, was selected (See CW10). Based on our previous experiment, the index model that is built from this sub data can fit into the RAM of a standard desktop machine.
To get the Edge Histogram features of the subset, a second dictionary has been built, which contains all the feature IDs of the selected subset as the key. Thus, for a given feature ID, if it is in this key set, then the feature is in the selected subset.
All features in the feature output file (created from step 1) had been looped through, line by line, if its feature ID is in the subset, it is saved in a separate sub-feature file. The size of the subset feature file is approximate 4.5G.
- Build ANN Index Model:
At this stage, having image IDs and feature IDs for the subset, an attempt to build ANNOY index model using unique image IDs has been made, however, unsuccessful. Two possible reasons for this may be that the image IDs were not in order, or that the image IDs were not in sequence (5 million images extracted from 100 million images).
The work this week was very challenging because of the number and size of files that were used to create the desired final output. The written Python programs took several hours to run and crashed many times. The next step is to set up a database, which will store the metadata of the dataset. Once the meta data is saved in the database, the integration and testing of the auto tagging system cab start.
 Won, Chee Sun, Dong Kwon Park, and Soo-Jun Park. “Efficient use of MPEG-7 edge histogram descriptor.” Etri Journal 24.1 (2002): 23-30 [Online]. Available from: http://etrij.etri.re.kr/etrij/journal/article/article.do?volume=24&issue=1&page=23 [Last Accessed 29 March 2015]
 Manjunath, Bangalore S., et al. “Color and texture descriptors.” Circuits and Systems for Video Technology, IEEE Transactions on 11.6 (2001): 703-715.
 Won, Chee Sun. “Feature extraction and evaluation using edge histogram descriptor in MPEG-7.” Advances in Multimedia Information Processing-PCM 2004. Springer Berlin Heidelberg, 2005. 583-590.