Damerau-Levenshtein Distance (Edit distance) 구하기
Data/Text/Knowledge Analysis & Mining/Python 2013. 3. 18. 19:36오타 수정 (misspell correction) 등에 활용할 수 있는 문자열 비교 알고리즘 (string distance metric)이 있다.
일반적으로 어떤 문자열 A에서 몇 자(character)를 수정(delete, insert, substitute, transpose)하여
B가 되는가를 숫자로 표현한 것을 A와B의 Edit distance라고 부른다.
2가지 metric 차이점
* transpose는 두개의 인접한 문자를 서로 바꾸는 것이다. ex) lettre --> letter
L-distance: insert,delete,substitute
DL-distance : insert,delete,substitute,transpose
Damerau-Levenshtein Distance is a metric for measuring how far two given strings are, in terms of 4 basic operations:
- deletion
- insertion
- substitution
- transposition
Levenshtein Distance
- deletion
- insertion
- substitution
아래는 어떤 문자열 A에 대해 여러 다른 문자들과 DL distance를 구하고자 할때
약간 시간을 단축할 수 있도록 class를 이용하는 것이다. class constructor에서 array값을 미리 채워 놓는다.
"""
Compute the Damerau-Levenshtein distance between two given
strings (s1 and s2)
from http://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/
"""
def damerau_levenshtein_distance(s1, s2):
d = {}
lenstr1 = len(s1)
lenstr2 = len(s2)
for i in xrange(-1,lenstr1+1):
d[(i,-1)] = i+1
for j in xrange(-1,lenstr2+1):
d[(-1,j)] = j+1
for i in xrange(lenstr1):
for j in xrange(lenstr2):
if s1[i] == s2[j]:
cost = 0
else:
cost = 1
d[(i,j)] = min(
d[(i-1,j)] + 1, # deletion
d[(i,j-1)] + 1, # insertion
d[(i-1,j-1)] + cost, # substitution
)
if i and j and s1[i]==s2[j-1] and s1[i-1] == s2[j]:
d[(i,j)] = min (d[(i,j)], d[i-2,j-2] + cost) # transposition
return d[lenstr1-1,lenstr2-1]
"""
Compute the Damerau-Levenshtein distance between a given string s1
and many other strings s2.
"""
class DLDistance:
def __init__(self, s1):
self.s1 = s1
self.d = {}
self.lenstr1 = len(self.s1)
for i in xrange(-1,self.lenstr1+1):
self.d[(i,-1)] = i+1
def distance(self, s2):
lenstr2 = len(s2)
for j in xrange(-1,lenstr2+1):
self.d[(-1,j)] = j+1
for i in xrange(self.lenstr1):
for j in xrange(lenstr2):
if self.s1[i] == s2[j]:
cost = 0
else:
cost = 1
self.d[(i,j)] = min(
self.d[(i-1,j)] + 1, # deletion
self.d[(i,j-1)] + 1, # insertion
self.d[(i-1,j-1)] + cost, # substitution
)
if i and j and self.s1[i]==s2[j-1] and self.s1[i-1] == s2[j]:
self.d[(i,j)] = min (self.d[(i,j)], self.d[i-2,j-2] + cost) # transposition
return self.d[self.lenstr1-1,lenstr2-1]
def hamming_distance(s1, s2):
assert len(s1) == len(s2)
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
def main():
base = u'whatsapp'
cmpstrs = [u'whatapp', u'what\'app', u'whatu', u'whoisthis']
dl = DLDistance(base)
for s in cmpstrs:
print damerau_levenshtein_distance(base, s)
print dl.distance(s)
print hamming_distance(u'whatsapp', u'whatapps')
if __name__ == '__main__':
main()
'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글
[python] gzip 압축 파일 읽기/쓰기 (utf-8 인코딩) (0) | 2013.03.22 |
---|---|
[python] 나눗셈 구현 (divide operator with bit-operations) (0) | 2013.03.22 |
[python] utf-8로 stdin 및 stdout 입출력 하기 (0) | 2013.03.18 |
[python] dict merge (0) | 2013.03.11 |
best 최고 python IDE - PyCharm (0) | 2013.02.14 |
WRITTEN BY
- manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform