python 및 머신러닝 교육, 슬로우캠퍼스


오타 수정 (misspell correction) 등에 활용할 수 있는 문자열 비교 알고리즘 (string distance metric)이 있다.


일반적으로 어떤 문자열 A에서 몇 자(character)를 수정(delete, insert, substitute, transpose)하여

B가 되는가를 숫자로 표현한 것을 A와B의 Edit distance라고 부른다.


2가지 metric 차이점

* transpose는 두개의 인접한 문자를 서로 바꾸는 것이다.  ex) lettre --> letter


L-distance: insert,delete,substitute

DL-distance : insert,delete,substitute,transpose 


Damerau-Levenshtein Distance is a metric for measuring how far two given strings are, in terms of 4 basic operations:

  • deletion
  • insertion
  • substitution
  • transposition

 

Levenshtein Distance

  • deletion
  • insertion
  • substitution


아래는 어떤 문자열 A에 대해 여러 다른 문자들과 DL distance를 구하고자 할때

약간 시간을 단축할 수 있도록 class를 이용하는 것이다. class constructor에서 array값을 미리 채워 놓는다.



"""

Compute the Damerau-Levenshtein distance between two given

strings (s1 and s2)

from http://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/

"""

def damerau_levenshtein_distance(s1, s2):

    d = {}

    lenstr1 = len(s1)

    lenstr2 = len(s2)

    for i in xrange(-1,lenstr1+1):

        d[(i,-1)] = i+1

    for j in xrange(-1,lenstr2+1):

        d[(-1,j)] = j+1


    for i in xrange(lenstr1):

        for j in xrange(lenstr2):

            if s1[i] == s2[j]:

                cost = 0

            else:

                cost = 1

            d[(i,j)] = min(

                           d[(i-1,j)] + 1, # deletion

                           d[(i,j-1)] + 1, # insertion

                           d[(i-1,j-1)] + cost, # substitution

                          )

            if i and j and s1[i]==s2[j-1] and s1[i-1] == s2[j]:

                d[(i,j)] = min (d[(i,j)], d[i-2,j-2] + cost) # transposition


    return d[lenstr1-1,lenstr2-1]


"""

Compute the Damerau-Levenshtein distance between a given string s1

and many other strings s2.

"""

class DLDistance:

def __init__(self, s1):

self.s1 = s1

self.d = {}

self.lenstr1 = len(self.s1)

for i in xrange(-1,self.lenstr1+1):

self.d[(i,-1)] = i+1

def distance(self, s2):

lenstr2 = len(s2)

for j in xrange(-1,lenstr2+1):

self.d[(-1,j)] = j+1


for i in xrange(self.lenstr1):

for j in xrange(lenstr2):

if self.s1[i] == s2[j]:

cost = 0

else:

cost = 1

self.d[(i,j)] = min(

  self.d[(i-1,j)] + 1, # deletion

  self.d[(i,j-1)] + 1, # insertion

  self.d[(i-1,j-1)] + cost, # substitution

 )

if i and j and self.s1[i]==s2[j-1] and self.s1[i-1] == s2[j]:

self.d[(i,j)] = min (self.d[(i,j)], self.d[i-2,j-2] + cost) # transposition


return self.d[self.lenstr1-1,lenstr2-1]


def hamming_distance(s1, s2):

    assert len(s1) == len(s2)

    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))


def main():

base = u'whatsapp'

cmpstrs = [u'whatapp', u'what\'app', u'whatu', u'whoisthis']

dl = DLDistance(base)

for s in cmpstrs:

print damerau_levenshtein_distance(base, s)

print dl.distance(s)


print hamming_distance(u'whatsapp', u'whatapps')

if __name__ == '__main__':

    main()




WRITTEN BY
manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

,