Skip to main content

Python program to print inverted index.


This is a Python program to print inverted index for the files provided to the program.
 
One of the most important tasks performed by every Information Retrieval System.
 
Output and used files are also included below.

 
 
input_path = "input"
from os import listdir, path

word_list = []
all_files = []
i = 0
filenamelist = []
for f in listdir(input_path):
    #Ignore Hidden Files
    if f.startswith('.') or f.startswith('~'): continue
    file_handle = open(path.join(input_path,f))
    word_list = file_handle.read().split()
    file_handle.close()
    
    #Populate Dictionary
    all_files.insert(i,{})
    #Start the word positions with 1, so that first word gets recognized as first
    pos = 1
    for word in word_list:
        if all_files[i].has_key(word):
            all_files[i][word].append(pos)
        else:
            #Create a list and insert the position
            all_files[i][word] = [pos]
        pos += 1  
    
    i += 1
    filenamelist.append(f[:-4]) #to remove trailing ".txt" (-4 : because ".txt" has 4 letters)

print "-" * 130
print "Word\t\tInverted Index"
print "-" * 130

#Iterate from j =0 to i-1
for j in range(i):
    #Iterate over all items of the dictionary
    while all_files[j]:
        word, posting_list = all_files[j].popitem() 
        output_string = "["
        output_string += "<{0}, {1}, {2}>".format(filenamelist[j],len(posting_list),posting_list)
        for k in range(i):
            
            if all_files[k].has_key(word):
                
                fname = filenamelist[k]
                postings =all_files[k][word] #Word is actually key in the dictionary 
                freq = len(postings)
                output_string += "<{0}, {1}, {2}>".format(fname,freq,postings)
                 
                if(j!=k): all_files[k].pop(word)
        output_string += "]"
        print "%-15s\t%s" % (word,output_string) 
 
---------------------------------------------------------------------
                                OUTPUT: 
---------------------------------------------------------------------
Word        Inverted Index
-----------------------------------------------------------------------------
text              [<history,4, [132, 186, 210, 217]><overview, 1, [90]>]
Set.              [<history,1, [99]>]
stored            [<history,1, [55]><overview, 1, [106]>]
Department        [<history, 1, [162]>]
web               [<history,1, [232]><overview, 1, [27]>] We                [<history,1, [19]>]
group             [<history,1, [109]>]
had               [<history,1, [124]>]




Files Used (Text taken from wikipedia.org (information retrieval):
1. History.txt
The idea of using computers to search for relevant pieces of information was popularized in the article As We May Think by Vannevar Bush in 1945.[5] It would appear that Bush was inspired by patents for a 'statistical machine' - filed by Emanuel Goldberg in the 1920s and '30s - that searched for documents stored on film.[6] The first description of a computer searching for information was described by Holmstrom in 1948,[7] detailing an early mention of the Univac computer. Automated information retrieval systems were introduced in the 1950s: one even featured in the 1957 romantic comedy, Desk Set. In the 1960s, the first large information retrieval research group was formed by Gerard Salton at Cornell. By the 1970s several different retrieval techniques had been shown to perform well on small text corpora such as the Cranfield collection (several thousand documents).[5] Large-scale retrieval systems, such as the Lockheed Dialog system, came into use early in the 1970s.
In 1992, the US Department of Defense along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection. This catalyzed research on methods that scale to huge corpora. The introduction of web search engines has boosted the need for very large scale retrieval systems even further.
2. Introduction.txt
Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing.
Automated information retrieval systems are used to reduce what has been called "information overload". Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR applications.
 
 
3. Overview.txt
An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy.
An object is an entity that is represented by information in a database. User queries are matched against the database information. Depending on the application the data objects may be, for example, text documents, images,[1] audio,[2] mind maps[3] or videos. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates or metadata.
Most IR systems compute a numeric score on how well each object in the database matches the query, and rank the objects according to this value. The top ranking objects are then shown to the user. The process may then be iterated if the user wishes to refine the query.[4]

Comments

Popular posts from this blog

4. Lex and Yacc Program to detect errors in a 'C' Language Program

Lex and Yacc Program to detect errors in a 'C' Language Program   Lex Code : %{ #include"y.tab.h" #include<stdio.h> int LineNo = 1 ; %} identifier [ a - zA - Z ][ _a - zA - Z0 - 9 ]* number [ 0 - 9 ]+|([ 0 - 9 ]*\.[ 0 - 9 ]+) %% main \(\) return MAIN ; if return IF ; else return ELSE ; while return WHILE ; int | char | flaot return TYPE ; { identifier } return VAR ; { number } return NUM ; \> | \< | \<= | \>= | == return RELOP ; [\ t ] ; [\ n ] LineNo ++; . return yytext [ 0 ]; %% Yacc Code : %{ #include<string.h> #include<stdio.h> extern int LineNo ; int errno = 0 ; %} % token NUM VAR RELOP % token MAIN IF ELSE WHILE TYPE % left '-' '+' % left '*' '/' %% PROGRAM : MAIN BLOCK ; BLOCK : '{' CODE '}' ; CODE : BLOCK | STATEMENT CODE | STATEMENT ; STATEMENT : DECST ';' | DECST { printf ( "\nLine number %d...

Selenium + Python + UnexpectedAlertPresentException: Dealing with annoying alerts

Handling  UnexpectedAlertPresentException   Alerts who hates them? I Do!  Who doesn't hate an annoying alert causing your tests / scraping job to fail? I must say they are pretty much on point on the Unexpected part!  Fortunately, there are easy ways to mitigate the issue. 1. Disable alerts completely: driver . execute_script( 'window.alert = function(){};' ); execute this script just before where you anticipate the alert and you're golden. 2. You want to see the alert text but not disturb the execution flow. driver . execute_script( 'window.alert = console.info;' ); Now the alerts have been redirected to the console and you don't have to worry about them. (Unless you have to - then you'd have to monitor the console) 3. You know exactly when it comes and want to accept the alert and move on. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 from selenium import webdriver from selenium.webdriver.s...

2. Lex program that detects statement type i.e. Simple or Compound

Lex program that detects statement type i.e. Simple or Compound Note: Only AND | OR | BUT conjunctions are supported. Program: % option noyywrap %{ char test = 's' ; %} %% ( "" [ aA ][ nN ][ dD ] "" )|( "" [ oO ][ rR ] "" )|( "" [ bB ][ uU ][ tT ] "" ) { test = 'c' ;} . {;} \ n return 0 ; %% main () { yylex (); if ( test == 's' ) printf ( "\n Its a simple sentence" ); else if ( test == 'c' ) printf ( "\n This is compound sentence" ); }