This is a Python program to print inverted index for the files provided to the program.
One of the most important tasks performed by every Information Retrieval System.
Output and used files are also included below.
input_path = "input"
from os import listdir, path
word_list = []
all_files = []
i = 0
filenamelist = []
for f in listdir(input_path):
#Ignore Hidden Files
if f.startswith('.') or f.startswith('~'): continue
file_handle = open(path.join(input_path,f))
word_list = file_handle.read().split()
file_handle.close()
#Populate Dictionary
all_files.insert(i,{})
#Start the word positions with 1, so that first word gets recognized as first
pos = 1
for word in word_list:
if all_files[i].has_key(word):
all_files[i][word].append(pos)
else:
#Create a list and insert the position
all_files[i][word] = [pos]
pos += 1
i += 1
filenamelist.append(f[:-4]) #to remove trailing ".txt" (-4 : because ".txt" has 4 letters)
print "-" * 130
print "Word\t\tInverted Index"
print "-" * 130
#Iterate from j =0 to i-1
for j in range(i):
#Iterate over all items of the dictionary
while all_files[j]:
word, posting_list = all_files[j].popitem()
output_string = "["
output_string += "<{0}, {1}, {2}>".format(filenamelist[j],len(posting_list),posting_list)
for k in range(i):
if all_files[k].has_key(word):
fname = filenamelist[k]
postings =all_files[k][word] #Word is actually key in the dictionary
freq = len(postings)
output_string += "<{0}, {1}, {2}>".format(fname,freq,postings)
if(j!=k): all_files[k].pop(word)
output_string += "]"
print "%-15s\t%s" % (word,output_string)
---------------------------------------------------------------------
OUTPUT:
---------------------------------------------------------------------
Word Inverted Index
-----------------------------------------------------------------------------
text [<history,4, [132, 186, 210, 217]><overview, 1, [90]>]
Set. [<history,1, [99]>]
stored [<history,1, [55]><overview, 1, [106]>]
Department [<history, 1, [162]>]
web [<history,1, [232]><overview, 1, [27]>]
We [<history,1, [19]>]
group [<history,1, [109]>]
had [<history,1, [124]>]
Files Used (Text taken from wikipedia.org (information retrieval):
1. History.txt
The idea of using computers to search for relevant pieces of information was popularized in the article As We May Think by Vannevar Bush in 1945.[5] It would appear that Bush was inspired by patents for a 'statistical machine' - filed by Emanuel Goldberg in the 1920s and '30s - that
searched for documents stored on film.[6] The first description of a computer searching for information was described by Holmstrom in 1948,[7] detailing an
early mention of the Univac computer. Automated information retrieval systems were introduced in the 1950s: one even featured in the 1957 romantic comedy,
Desk Set. In the 1960s, the first large information retrieval research group was formed by Gerard Salton at Cornell. By the 1970s several different
retrieval techniques had been shown to perform well on small text corpora such as the Cranfield collection (several thousand documents).[5] Large-scale
retrieval systems, such as the Lockheed Dialog system, came into use early in the 1970s.
In 1992, the US Department of Defense along with the
National Institute of Standards and Technology (NIST), cosponsored the Text
Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of
this was to look into the information retrieval community by supplying the
infrastructure that was needed for evaluation of text retrieval methodologies
on a very large text collection. This catalyzed research on methods that scale
to huge corpora. The introduction of web search engines has boosted the need
for very large scale retrieval systems even further.
2. Introduction.txt
Information retrieval (IR) is the activity of obtaining
information resources relevant to an information need from a collection of
information resources. Searches can be based on metadata or on full-text (or
other content-based) indexing.
Automated information retrieval systems are used to reduce
what has been called "information overload". Many universities and
public libraries use IR systems to provide access to books, journals and other
documents. Web search engines are the most visible IR applications.
3. Overview.txt
An information
retrieval process begins when a user enters a query into the system. Queries
are formal statements of information needs, for example search strings in web
search engines. In information retrieval a query does not uniquely identify a
single object in the collection. Instead, several objects may match the query,
perhaps with different degrees of relevancy.
An object is an entity that is represented by information in a database. User queries are
matched against the database information. Depending on the application the data
objects may be, for example, text documents, images,[1] audio,[2] mind maps[3]
or videos. Often the documents themselves are not kept or stored directly in
the IR system, but are instead represented in the system by document surrogates
or metadata.
Most IR systems compute a numeric score on how well each object in the database matches
the query, and rank the objects according to this value. The top ranking
objects are then shown to the user. The process may then be iterated if the
user wishes to refine the query.[4]
Comments
Post a Comment