This is a python implementation for Soundex Algorithm.
This Program builds a JSON document as a dictionary and is kept on building at every execution. Its constantly appended and referenced while the program is executed.
Program:
This Program builds a JSON document as a dictionary and is kept on building at every execution. Its constantly appended and referenced while the program is executed.
Program:
from re import sub def remove_symbols(input_string): #Convert the characters to lower case and then use #Regular expressions to remove non a-z chars return sub('[^A-Z]+', '', input_string) def clean(input_string): #Convert the characters to lower case and then use #Regular expressions to remove non a-z chars return sub('[^a-z]+', '', input_string.lower()) word = "Input" def soundex(word): #Step 1: Capitalize all letters in the word and drop all punctuation marks. word = remove_symbols(word.upper()) #Step 2: Retain the first letter of the word. first_letter = word[0] word = word[1:] #Step 3 & 4: Change ( 'A', E', 'I', 'O', 'U', 'H', 'W', 'Y') to 0 #And ('B','F','P','V') => 1 #('C','G','J','K','Q','S','X','Z') => 2 #('D,'T') => 3 , ('L') =. 4 , ('M','N') => 5 and ('R') => 6 pre = ['[AEIOUWHY]','[BFPV]','[CGJKQSXZ]','[DT]','[L]','[MN]','[R]'] post= ['0','1','2','3','4','5','6'] for find , replace in zip(pre, post): word = sub(find, replace, word) #Step 5: Remove all pairs of digits which occur beside each other from the string that resulted after Step 4. new_word = "" maxpos = len(word) - 1 for i in range(maxpos+1): if i< maxpos and word[i] != word[i+1]: new_word += word[i] elif i == maxpos and word[i] != word[i-1]: new_word += word[i] #Step 6: Remove all zeros from the string that results from step 5.0 (placed there in step 3) #(Retaining the first character as well) word = first_letter + sub('0','', new_word) #Step 7: Pad the string that reVeekramsulted from step (6) with trailing zeros and return only the first four positions, #which will be of the form <uppercase letter> <digit> <digit> <digit> length = len(word) if length >= 4: word = word[:4] else: word = word + ("0" * (4 - length)) #print input, word return word import json fp = open("D:\\Vikram Projects\\Eclipse Workspace\\Soundex Algorithm\\repository.txt") dic = json.load(fp) #dic = dic[0] fp.close() from os import listdir, path dict = {} files = [] for fle in listdir("data"): #Ignore the ~ and . i.e. hidden / system files f = open(path.join("data", fle)) if fle.startswith('~') or fle.startswith('.'): continue dict[fle] = list() for line in f.readlines(): dict[fle] += map(clean, line.split()) files.append(fle) word = raw_input("Enter a word: ").lower() code = soundex(word) if dic.has_key(code): print word+" has following similar words: " print dic[code] else: print word + " has no phonetically similar words." dic[code] = list() try: _ = dic[code].index(word) except(ValueError): dic[code].append(word) print "And it's present in following files: " for fle in files: for word in dic[code]: try: if dict[fle].index(str(word)) != -1: print word +" is found in --> " + fle continue except (ValueError): pass #insert further into dictionary #dic[code].add() fp = open("repository.txt",'w') json.dump(dic, fp)
------------------------------------------------------------------------------
Repository.txt{"C416": ["calpurnia", "calpoornia","calpornia"],"V265": ["vikrant","vikramjeet", "veekram", "vikram"], "A123": ["abheejit"], "V220": ["vishakha"], "M622": ["markus", "markoos","merkus"],"J310": ["jaydeep","jaydip"],"V240": ["vishal"],"V625": ["virkam"],"V230": ["viksto"],"B632": ["brutus", "brutoos"]}Input Files:F1.txt
Brutus killed calpurnia, brutoos the evil brother of calpornia, avenged her death. vikram gets angry when called veekram.
F2.txtMarkus is the step brother of brutus. Brutus and markus are each others good friends. calpoornia, is also a friend.Vikram is a friend of Markus.
F3.txtbrutus and markus were classmates but they changed roads after college. They do not have any similar interests now. brutoos is a butcher and merkus weaves.veekram and markus were alsoclassmates but they rarely spoke.OuptutEnter a word: merkusmerkus has following similar words:[u'markus', u'markoos']And it's present in following files:markus is found in --> f2.txtmarkus is found in --> f3.txtmerkus is found in --> f3.txt---------------------------------------------------------------------Enter a word: vikramvikram has following similar words:[u'vikrant', u'vikramjeet', u'veekram', u'vikram']And it's present in following files:veekram is found in --> f1.txtvikram is found in --> f1.txtvikram is found in --> f2.txtveekram is found in --> f3.txt---------------------------------------------------------------------Enter a word: santiagosantiago has no phonetically similar words.And it's present in following files:
Comments
Post a Comment