Skip to main content

Python program to print inverted index.


This is a Python program to print inverted index for the files provided to the program.
 
One of the most important tasks performed by every Information Retrieval System.
 
Output and used files are also included below.

 
 
input_path = "input"
from os import listdir, path

word_list = []
all_files = []
i = 0
filenamelist = []
for f in listdir(input_path):
    #Ignore Hidden Files
    if f.startswith('.') or f.startswith('~'): continue
    file_handle = open(path.join(input_path,f))
    word_list = file_handle.read().split()
    file_handle.close()
    
    #Populate Dictionary
    all_files.insert(i,{})
    #Start the word positions with 1, so that first word gets recognized as first
    pos = 1
    for word in word_list:
        if all_files[i].has_key(word):
            all_files[i][word].append(pos)
        else:
            #Create a list and insert the position
            all_files[i][word] = [pos]
        pos += 1  
    
    i += 1
    filenamelist.append(f[:-4]) #to remove trailing ".txt" (-4 : because ".txt" has 4 letters)

print "-" * 130
print "Word\t\tInverted Index"
print "-" * 130

#Iterate from j =0 to i-1
for j in range(i):
    #Iterate over all items of the dictionary
    while all_files[j]:
        word, posting_list = all_files[j].popitem() 
        output_string = "["
        output_string += "<{0}, {1}, {2}>".format(filenamelist[j],len(posting_list),posting_list)
        for k in range(i):
            
            if all_files[k].has_key(word):
                
                fname = filenamelist[k]
                postings =all_files[k][word] #Word is actually key in the dictionary 
                freq = len(postings)
                output_string += "<{0}, {1}, {2}>".format(fname,freq,postings)
                 
                if(j!=k): all_files[k].pop(word)
        output_string += "]"
        print "%-15s\t%s" % (word,output_string) 
 
---------------------------------------------------------------------
                                OUTPUT: 
---------------------------------------------------------------------
Word        Inverted Index
-----------------------------------------------------------------------------
text              [<history,4, [132, 186, 210, 217]><overview, 1, [90]>]
Set.              [<history,1, [99]>]
stored            [<history,1, [55]><overview, 1, [106]>]
Department        [<history, 1, [162]>]
web               [<history,1, [232]><overview, 1, [27]>] We                [<history,1, [19]>]
group             [<history,1, [109]>]
had               [<history,1, [124]>]




Files Used (Text taken from wikipedia.org (information retrieval):
1. History.txt
The idea of using computers to search for relevant pieces of information was popularized in the article As We May Think by Vannevar Bush in 1945.[5] It would appear that Bush was inspired by patents for a 'statistical machine' - filed by Emanuel Goldberg in the 1920s and '30s - that searched for documents stored on film.[6] The first description of a computer searching for information was described by Holmstrom in 1948,[7] detailing an early mention of the Univac computer. Automated information retrieval systems were introduced in the 1950s: one even featured in the 1957 romantic comedy, Desk Set. In the 1960s, the first large information retrieval research group was formed by Gerard Salton at Cornell. By the 1970s several different retrieval techniques had been shown to perform well on small text corpora such as the Cranfield collection (several thousand documents).[5] Large-scale retrieval systems, such as the Lockheed Dialog system, came into use early in the 1970s.
In 1992, the US Department of Defense along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection. This catalyzed research on methods that scale to huge corpora. The introduction of web search engines has boosted the need for very large scale retrieval systems even further.
2. Introduction.txt
Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing.
Automated information retrieval systems are used to reduce what has been called "information overload". Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR applications.
 
 
3. Overview.txt
An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy.
An object is an entity that is represented by information in a database. User queries are matched against the database information. Depending on the application the data objects may be, for example, text documents, images,[1] audio,[2] mind maps[3] or videos. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates or metadata.
Most IR systems compute a numeric score on how well each object in the database matches the query, and rank the objects according to this value. The top ranking objects are then shown to the user. The process may then be iterated if the user wishes to refine the query.[4]

Comments

Popular posts from this blog

Selenium + Python + UnexpectedAlertPresentException: Dealing with annoying alerts

Handling  UnexpectedAlertPresentException   Alerts who hates them? I Do!  Who doesn't hate an annoying alert causing your tests / scraping job to fail? I must say they are pretty much on point on the Unexpected part!  Fortunately, there are easy ways to mitigate the issue. 1. Disable alerts completely: driver . execute_script( 'window.alert = function(){};' ); execute this script just before where you anticipate the alert and you're golden. 2. You want to see the alert text but not disturb the execution flow. driver . execute_script( 'window.alert = console.info;' ); Now the alerts have been redirected to the console and you don't have to worry about them. (Unless you have to - then you'd have to monitor the console) 3. You know exactly when it comes and want to accept the alert and move on. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 from selenium import webdriver from selenium.webdriver.s

Python Program for Soundex Algorithm

This is a python implementation for Soundex Algorithm. This Program builds a JSON document as a dictionary and is kept on building at every execution. Its constantly appended and referenced while the program  is executed. Program: from re import sub def remove_symbols (input_string): #Convert the characters to lower case and then use #Regular expressions to remove non a-z chars return sub( '[^A-Z]+' , '' , input_string) def clean (input_string): #Convert the characters to lower case and then use #Regular expressions to remove non a-z chars return sub( '[^a-z]+' , '' , input_string . lower()) word = "Input" def soundex (word): #Step 1: Capitalize all letters in the word and drop all punctuation marks. word = remove_symbols(word . upper()) #Step 2: Retain the first letter of the word. first_letter = word[ 0 ] word = word[ 1 :] #Step 3 & 4: Change ( 'A&#

How to convert a Helium Wallet Address to Solana Wallet address?

Helium went to Solana Blockchain, on April 18, 2023. Helium addresses are not available on the Solana blockchain as addresses on the Solana blockchain are base-58 encoded. Here is a quick snippet on how to translate an existing Helium wallet address to a Solana wallet address using Python. You will need the base58 module for this, get it here: pip install base58 Code Chunk: def convert_hnt_wallet_addr_to_sol ( helium_wallet_address ) : return base58. b58encode ( base58. b58decode ( hnt_wallet_address ) [ 2 :- 4 ] ) . decode ( ) Using this convert your Helium wallet address to a Solana address! You can further explore our blog for interesting reads  OR   - you can contact us to learn a bit more over a FREE personal Skype coaching session. Just click on "Leave a message" and reach out to us. We get a lot of volume these days so FREE Sessions won't be here for a long time, Grab this opportunity while you can!