Semantic hashing is a method to map documents to a code e.g., 32-bit memory address so documents with semantically closed content will be mapped to close addresses. This method can be used to implement an information retrieval (IR) system where the query will be a document and search results will contain documents with similar content (semantics). This method was published by G.Hinton in this paper.
For searching – the query string is treated as a document i.e., it’s word-count vector is passed through the encoder to get its matching address. Now, the search result will be the documents that are stored in the query address and also the documents that are stored in close addresses. for example, close address can be addresses that are different up to 4 bits (also known as Hamming distance) from the original address.
One of the common methods in IR systems today is the tf-idf indexing technique. each document is indexed so each term in the corpus is pointing to a list of documents that contains this term. Basic searching is done by looking up the list of documents that matches each term in the search query and intersect those list leaving document contains all terms in the query. The disadvantage of this method is that search time is affected by the number of terms in the query. In contrast- semantic hashing gets the list of relevant documents in a single lookup and so is not affected by query size.
Another method in IR systems is latent-semantic-analysis (LSA). Documents are mapped to word-count-vectors, then the dimension of the vectors is reduced using SVD method. Search is done by mapping the query document to word-count-vector, then reduce it’s dimension and measure the angle between the document vector to all the vectors of the corpus documents. The disadvantage of this method is that search time is linearly depends on the size of the corpus. In contrast- semantic hashing is only affected by the size of the documents list i.e., larger corpus means more collision in memory address mappings and longer document lists. The size of the lists does not increase linearly and the documents are spread across the memory addresses.
Below, are results from the paper where it can be seen that the search quality of this technique is similar to tf-idf (which is considered state of the art for IR systems). The axis are recall and precision values of the systems. The tests included comparison of latent-semantic (LSA) system and semantic-hashing (graph on the left). Also a comparison of LSA and tf-idf and semantic-hashing that was followed by tf-idf filtering (graph on the right). It can be seen in the right graph that semantic-hashing with tf-idf filtering was close to tf-idf which shows that semantic hashing return documents that are similar to what tf-idf method would return – this basically shows that semantic-hashing is working 🙂