python 및 머신러닝 교육, 슬로우캠퍼스

Damerau-Levenshtein Distance (Edit distance) 구하기

Data/Text/Knowledge Analysis & Mining/Python 2013. 3. 18. 19:36

share this post

오타 수정 (misspell correction) 등에 활용할 수 있는 문자열 비교 알고리즘 (string distance metric)이 있다.

일반적으로 어떤 문자열 A에서 몇 자(character)를 수정(delete, insert, substitute, transpose)하여

B가 되는가를 숫자로 표현한 것을 A와B의 Edit distance라고 부른다.

2가지 metric 차이점

* transpose는 두개의 인접한 문자를 서로 바꾸는 것이다. ex) lettre --> letter

L-distance: insert,delete,substitute

DL-distance : insert,delete,substitute,transpose

Damerau-Levenshtein Distance is a metric for measuring how far two given strings are, in terms of 4 basic operations:

deletion
insertion
substitution
transposition

Levenshtein Distance

deletion
insertion
substitution

아래는 어떤 문자열 A에 대해 여러 다른 문자들과 DL distance를 구하고자 할때

약간 시간을 단축할 수 있도록 class를 이용하는 것이다. class constructor에서 array값을 미리 채워 놓는다.

"""

Compute the Damerau-Levenshtein distance between two given

strings (s1 and s2)

from http://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/

"""

def damerau_levenshtein_distance(s1, s2):

d = {}

lenstr1 = len(s1)

lenstr2 = len(s2)

for i in xrange(-1,lenstr1+1):

d[(i,-1)] = i+1

for j in xrange(-1,lenstr2+1):

d[(-1,j)] = j+1

for i in xrange(lenstr1):

for j in xrange(lenstr2):

if s1[i] == s2[j]:

cost = 0

else:

cost = 1

d[(i,j)] = min(

d[(i-1,j)] + 1, # deletion

d[(i,j-1)] + 1, # insertion

d[(i-1,j-1)] + cost, # substitution

)

if i and j and s1[i]==s2[j-1] and s1[i-1] == s2[j]:

d[(i,j)] = min (d[(i,j)], d[i-2,j-2] + cost) # transposition

return d[lenstr1-1,lenstr2-1]

"""

Compute the Damerau-Levenshtein distance between a given string s1

and many other strings s2.

"""

class DLDistance:

def __init__(self, s1):

self.s1 = s1

self.d = {}

self.lenstr1 = len(self.s1)

for i in xrange(-1,self.lenstr1+1):

self.d[(i,-1)] = i+1

def distance(self, s2):

lenstr2 = len(s2)

for j in xrange(-1,lenstr2+1):

self.d[(-1,j)] = j+1

for i in xrange(self.lenstr1):

for j in xrange(lenstr2):

if self.s1[i] == s2[j]:

cost = 0

else:

cost = 1

self.d[(i,j)] = min(

self.d[(i-1,j)] + 1, # deletion

self.d[(i,j-1)] + 1, # insertion

self.d[(i-1,j-1)] + cost, # substitution

)

if i and j and self.s1[i]==s2[j-1] and self.s1[i-1] == s2[j]:

self.d[(i,j)] = min (self.d[(i,j)], self.d[i-2,j-2] + cost) # transposition

return self.d[self.lenstr1-1,lenstr2-1]

def hamming_distance(s1, s2):

assert len(s1) == len(s2)

return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

def main():

base = u'whatsapp'

cmpstrs = [u'whatapp', u'what\'app', u'whatu', u'whoisthis']

dl = DLDistance(base)

for s in cmpstrs:

print damerau_levenshtein_distance(base, s)

print dl.distance(s)

print hamming_distance(u'whatsapp', u'whatapps')

if __name__ == '__main__':

main()

저작자표시 비영리 변경금지 (새창열림)

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

[python] gzip 압축 파일 읽기/쓰기 (utf-8 인코딩) (0)	2013.03.22
[python] 나눗셈 구현 (divide operator with bit-operations) (0)	2013.03.22
[python] utf-8로 stdin 및 stdout 입출력 하기 (0)	2013.03.18
[python] dict merge (0)	2013.03.11
best 최고 python IDE - PyCharm (0)	2013.02.14

WRITTEN BY

: manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

python 및 머신러닝 교육, 슬로우캠퍼스

Damerau-Levenshtein Distance (Edit distance) 구하기

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

티스토리툴바