python 및 머신러닝 교육, 슬로우캠퍼스


CRM이 한참 인기이던 2000년초반에도 이통사들이 무선인터넷 포털에 개인화 서비스를 넣기 위해 노력했지만 성공하지 못했다. 네이버도 최근 몇년 동안 네이버미 등이 별로 크게 흥행하지 못했다.


아래의 업계 관계자 말처럼, 사용자는 그러한 설정행위 자체가 귀찮다

사용자는 그저 "남들이 많이 보는 것들이 뭔지 더 궁금하고, 남들이 어떻게 반응하는지(댓글)가 더 궁금할 것"으로 생각된다.


내가 세팅하지 않아도 적당한 콘텐트를 이쁘게 보여 준다면 그게 최고. 

알아서 적당한 것을 가져다 주는 serendipity 서비스. 이것은 context-aware 분석에 기반한 것.  Google Now, Siri 등이 이러한 지향점을 가지고 있다. 이러한 서비스에 익숙해지게 되면 우리는 우리의 개인정보를 구글님에게 제공하는 것에 점점 더 무감각해지게 될 것이다.




이스트인터넷의 ‘줌’은 즐겨 찾는 사이트로 바로 이동하거나 원하는 콘텐츠를 바로 확인할 수 있게 해주는 ‘줌앱’을 시작 페이지에 위젯 형태로 골라 넣는 구조다. 사용자들이 원할 만한 모든 콘텐츠를 보여주는 기존 포털과 달리 필요 없는 정보는 덜어내는 형태다. 정상원 부사장은 “사용자에게 편리한 서비스를 고민하다보니 자연스럽게 개인화로 이어졌다”고 말했다.

 하지만 콘텐츠 선택을 사용자에게 맡긴 것이 도리어 부정적으로 작용할 수 있다는 지적도 나온다. 사람들은 맞춤형 콘텐츠보다는 다른 사람이 관심 갖는 콘텐츠를 함께 즐기기 원한다는 설명이다. 이른바 ‘포털형’ 콘텐츠 소비다. 업계 관계자는 “사용자가 자신이 무엇을 원하는지 안다는 개인화 서비스의 전제가 잘못됐을 수 있다”고 말했다.


http://www.etnews.com/news/contents/internet/2505074_1488.html


WRITTEN BY
manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

,

python 및 머신러닝 교육, 슬로우캠퍼스


pdf 텍스트 읽기

PDFMiner - http://www.unixuser.org/~euske/python/pdfminer/



http://redjonathan.blogspot.kr/2013/06/reportlab-python-pdf.html



한글 또는 유니코드


사용법은 어렵지 않다. 매뉴얼도 잘 되어있다. 아무래도 한글이 잘 찍히느냐가 관심사일 것이다. 잘 된다. 글꼴 지정만 잘 해주면 된다. 아스키로만 된 문서를 찍는 기본적인 튜토리얼은 생략하고 한글을 찍는 방법을 보자. 한글 뿐만 아니라 유니코드의 문자를 다양하게 사용하는 경우도 마찬가지이다.

#-*- coding: utf-8 -*-
# hello.py
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

pdfmetrics.registerFont(TTFont("HCR Batang", "HANBatang-LVT.ttf"))

c = canvas.Canvas("hello.pdf")
c.setFont("HCR Batang", 16)
c.drawString(100,100, u"안녕")
c.showPage()
c.save()

위 코드에 대해 몇가지 코멘트를 달면 다음과 같다.

  • 위 스크립트를 저정할 때 UTF-8 인코딩으로 저장해야한다. 한글을 직접 포함할 때 UTF-8을 사용하는 것이 좋다.
  • TTFont, pdfmetrics를 이용해 트루타입 글꼴에 이름을 부여하여 reportlab에서 사용할 수 있다.
  • 글꼴파일 확인은 Linux나 MacOSX에서는 fc-list로 확인할 수 있다. 물론 다른 방법으로 해도 되겠지만...
  • 위 예제에서는 함초롬바탕 첫가끝 글꼴을 사용하였다. 다운로드하려면: 함초롬체/GSUB @ KTUG
  • 위 예제는 Canvas를 이용한 것으로 문서를 찍는 것이 아니라 그림을 그리면서 텍스트를 추가하는 예제이다.


문서 생성


아무래도 Canvas를 이용해서 그림을 그릴 일 보다는 문서를 생성할 일이 더 많을 것 같다. 문서를 생성할 때는 platypus가 사용된다. 간단한 예제를 보면 다음과 같다. 역시 한글을 사용하려면 글꼴을 이름을 붙여 등록해야하는 것은 물론이고 ParagraphStyle에서 원하는 한글 글꼴을 사용하도록 지정하여 새로운 스타일일 만들어주어야한다. 간단한 예제를 보이면 다음과 같다. 

#-*- coding: utf-8 -*- 
# doc.py
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.pagesizes import A4
from reportlab.platypus import Paragraph, SimpleDocTemplate
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

pdfmetrics.registerFont(TTFont("HCR Batang", "HANBatang.ttf"))

styles = getSampleStyleSheet()
styles.add(ParagraphStyle(name="Hangul", fontName="HCR Batang"))

hangul1 = u"안녕하세요."
hangul2 = u"반갑습니다."

story = []
story.append(Paragraph(hangul1, style=styles["Hangul"]))
story.append(Paragraph(hangul2, style=styles["Hangul"]))
doc = SimpleDocTemplate("doc.pdf", pagesize=A4)
doc.build(story)


단락 내의 세세한 스타일 지정


위 예제에서는 Paragraph 단위로 스타일을 지정하는 것만 보았다. 한 단어만 빨간색으로 하고 싶거나 하면 어떻게 하는가? 이때는, 다른 방법이 있는지 모르겠지만, XML 태그를 이용한 스타일 마크업을 사용하는 것이 편리하다. 예를 들어 위 문서 생성 예제를 그대로 쓰는 대신에

hangul1 = u'<font color="red">안녕</font>하세요'



---

http://comments.gmane.org/gmane.comp.python.reportlab.user/10022

# font

FONT_FILE = '%s/Fonts/%s' % (os.environ['WINDIR'], 'BATANG.TTC') 
FONT_NAME = '바탕' 
pdfmetrics.registerFont(TTFont(FONT_NAME, FONT_FILE))


Then, after you have picked the style:

doc = SimpleDocTemplate("phello.pdf")
Story = [Spacer(1,2*inch)]
style = styles["Normal"]


'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

mechanize 예시  (0) 2013.10.18
[Git] 기본 설정 및 사용  (0) 2013.07.30
OCR + python  (0) 2013.07.26
python pdf library 비교  (0) 2013.07.26
mongoDB, python, twitter Oauth  (0) 2013.07.25

WRITTEN BY
manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

,

python 및 머신러닝 교육, 슬로우캠퍼스

OCR + python




PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.

PyTesser uses the Tesseract OCR engine, converting images to an accepted format and calling the Tesseract executable as an external script. A Windows executable is provided along with the Python scripts. The scripts should work in other operating systems as well.

Dependencies

PIL is required to work with images in memory. PyTesser has been tested with Python 2.4 in Windows XP.

Usage Example

>>> from pytesser import *
>>> image = Image.open('fnord.tif')  # Open image object using PIL
>>> print image_to_string(image)     # Run tesseract.exe on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord





'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

[Git] 기본 설정 및 사용  (0) 2013.07.30
python pdf - reportlab  (0) 2013.07.26
python pdf library 비교  (0) 2013.07.26
mongoDB, python, twitter Oauth  (0) 2013.07.25
unicode, chatdet  (0) 2013.07.21

WRITTEN BY
manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

,

python 및 머신러닝 교육, 슬로우캠퍼스

pyhton으로 pdf file 만들기



Note that the similar-appearing pyfpdf of Mariano Reingart is most comparable to ReportLab, in that both ReportLab and pyfpdf emphasize document generation. Interestingly enough, pyfpdf builds in a basic HTML->PDF converter;




pdfrw - Python library to read and write PDF files - Google Project ...

17 September 2012 

예) https://code.google.com/p/pdfrw/wiki/ExampleTools  --> Watermarking PDFs



pyfpdf - Simple PDF generation for Python (FPDF PHP port) AKA ...


Latest Relesed Version: 1.7 (August 15th, 2012) - Current Development Version: 1.7.1


예)  http://pyfpdf.googlecode.com/hg/tutorial/tuto3.pdf


pypdf --> pypdf2

http://knowah.github.io/PyPDF2/

https://github.com/knowah/PyPDF2/  - pure python




ReportLab 

https://bitbucket.org/rptlab/reportlab


 easy_install reportlab or pip install reportlab 


ReportLab 기반 예) https://docs.djangoproject.com/en/dev/howto/outputting-pdf/






'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

python pdf - reportlab  (0) 2013.07.26
OCR + python  (0) 2013.07.26
mongoDB, python, twitter Oauth  (0) 2013.07.25
unicode, chatdet  (0) 2013.07.21
python map reduce lambda  (0) 2013.07.20

WRITTEN BY
manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

,

python 및 머신러닝 교육, 슬로우캠퍼스



활용 및 배울 기술:

twitter.com OAUTH 연동, twitter API사용 하여 리소스 가져오기

  -- oauth, python tweepy.py




== 파이선 외부 라이브러리 소스 확인 위치

/usr/local/lib/python2.7/site-packages/

/usr/local/lib/python2.7/site-packages/tweepy



http://elle.atzine.com/elle/elleweb_template_fashion.iht?contId=B11_20110527_09100&menuId=M00024



mongo DB (weechat 서버 상태)

=== mong DB

port=27017

dbpath=/data/db/


== mongo DB 기동 확인 (mongo daemon 서버)

$ ps -ef | grep mongo

root      3243     1  0 Aug06 ?        00:00:06 ./mongod


== 설치 위치

/home/xxxx/bin/ 아래에 실행파일로 있음.  (컴파일, install 없이 binary만 복사해도 됨)


== 실제 mongo DB data 파일  위치

/data/db


== mongo DB 접속해 보기 (‘mongo’라고 실행하면 command를 입력할 수 있는 prompt로 들어간다)

mongo 에서는 db가 mysql의 db, collection이 table 쯤에 해당한다.

아래와 같은 순서로 입력해 본다.


# mongo

> help

> show dbs

> use weechat2  

> show collections
> db.users.find()     // users 라는 테이블 전체 레코드 보기

> db.users.find({"id":100004})     // users 라는 테이블에서 id가 10004 인 레코드 출력

> db.users.findOne()

> db.users.count()

> db.users.remove() // delete all

> db.users.insert( {“id”:100, “name”:”han”} )  // 레코드 추가

> db.users.update()  // 레코드 업데이트

> db.users.drop()   // collection (table) 전체 drop 하기



mongo와 mysql 명령어 비교   http://www.mongodb.org/display/DOCS/SQL+to+Mongo+Mapping+Chart



mac OS X 에서 mongo 설치하기

http://www.mongodb.org/display/DOCS/Quickstart+OS+X


mongo와 python → pymong

샘플소스



mongo명령어 prompt와 유사한 문법으로 pymong 사용 가능

import pymongo

from pymongo import Connection

connection = Connection()

db = connection.mydb1

table = db.followers

print table


table.insert({"id":"xxxxx", "flist":[("aaa", 10), ("bbb", 20)]})

c=table.find({"id": "xxxxxx"})

for cc in c:

       flist = cc["flist"]

      print flist



table = db.twitter

print table

table.insert({"id":"xxxxxx", "flist":[("aaa", 10), ("bbb", 20)]})


table = db.weechat.myrooms

print table

table.insert({"id":"xxxxxx", "rlist":[("roomtoken1", "membertoken1"), ("roomtoken2", "membertoken2"]})


table = db.weechat.roominfos


print table

table.insert({"id":"xxxxxx", "rlist":["roomtoken1", "roomtoken2"]})





http://www.snailinaturtleneck.com/blog/tag/mongodb/


'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

OCR + python  (0) 2013.07.26
python pdf library 비교  (0) 2013.07.26
unicode, chatdet  (0) 2013.07.21
python map reduce lambda  (0) 2013.07.20
google app engine urlfetch, urllib2  (0) 2013.07.16

WRITTEN BY
manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

,

python 및 머신러닝 교육, 슬로우캠퍼스



형태소 분석기 

은전한닢    http://eunjeon.blogspot.kr

Mecab기반.  이용운(bibreen), 유영호 (mousegood)

https://bitbucket.org/bibreen/mecab-ko-dic bitbucket에 올려있는 mecab용 한글 사전

http://mind42.com/mindmap/b269c84a-3975-48ef-946e-8900f3414661?rel=url  관련 설명 마인드맵


류창우 hunspell-ko-dict  

http://twitter.com/changwoo

https://code.google.com/p/ko-po-check/

http://github.com/changwoo


Python기반 검색엔진(2009, kaist)

https://github.com/serialx


꼬꼬마 한글 형태소 분석기 (서울대)

http://kkma.snu.ac.kr/documents/


WRITTEN BY
manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

,

python 및 머신러닝 교육, 슬로우캠퍼스

구글 검색의 7%가 misspell (오타) 라는 통계을 볼 수 있습니다. 2012년 4월


http://www.seroundtable.com/google-match-misspellings-adwords-15029.html



7% Of Searches Are Misspelled: Google AdWords To Change Match Type Settings

Apr 18, 2012 • 8:36 am | comments (9)by  twitter Google+ | Filed Under Google AdWords
 

Google announced that in about 30 days from now, the way they handle AdWords match types for exact and phrase based matching will change. Going forward, Google will auto-match for phrase and exact match even for variants including misspellings, singular/plural forms, stemmings, accents and abbreviations.

Why? Google has said that over 7% of all queries contain misspellings. Google also explained that Google's organic results already do this and they want to make sure the AdWords results do the same. Google added that in early tests it has led to an increase in clicks by 3%.

Here is an example:

AdWords Exact Variant Options

Don't like that Google is doing this? Don't worry, you can turn it off. Once Google activates this log in to AdWords and select the campaign settings tab. Under “Advanced settings” select Keyword matching options. There will be an option to turn it off there, here is a picture:

AdWords Exact Variant Options

Forum discussion at WebmasterWorld.



WRITTEN BY
manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

,

python 및 머신러닝 교육, 슬로우캠퍼스

http://norvig.com/spell-correct.html


How to Write a Spelling Corrector

In the past week, two friends (Dean and Bill) independently told me they were amazed at how Google does spelling correction so well and quickly. Type in a search like [speling] and Google comes back in 0.1 seconds or so with Did you mean: spelling. (Yahoo and Microsoft are similar.) What surprised me is that I thought Dean and Bill, being highly accomplished engineers and mathematicians, would have good intuitions about statistical language processing problems such as spelling correction. But they didn't, and come to think of it, there's no reason they should: it was my expectations that were faulty, not their knowledge.

I figured they and many others could benefit from an explanation. The full details of an industrial-strength spell corrector are quite complex (you con read a little about it here or here). What I wanted to do here is to develop, in less than a page of code, a toy spelling corrector that achieves 80 or 90% accuracy at a processing speed of at least 10 words per second.

So here, in 21 lines of Python 2.5 code, is the complete spelling corrector:

import re, collections

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
    model
= collections.defaultdict(lambda: 1)
   
for f in features:
        model
[f] += 1
   
return model

NWORDS
= train(words(file('big.txt').read()))

alphabet
= 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits    
= [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    
= [a + b[1:] for a, b in splits if b]
   transposes
= [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces  
= [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    
= [a + c + b     for a, b in splits for c in alphabet]
   
return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
   
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates
= known([word]) or known(edits1(word)) or known_edits2(word) or [word]
   
return max(candidates, key=NWORDS.get)

The code defines the function correct, which takes a word as input and returns a likely correction of that word. For example:

>>> correct('speling')
'spelling'
>>> correct('korrecter')
'corrector'

The version of edits1 shown here is a variation on one proposed by Darius Bacon; I think this is clearer than the version I originally had. Darius also fixed a bug in the function correct.

How It Works: Some Probability Theory

How does it work? First, a little theory. Given a word, we are trying to choose the most likely spelling correction for that word (the "correction" may be the original word itself). There is no way to know for sure (for example, should "lates" be corrected to "late" or "latest"?), which suggests we use probabilities. We will say that we are trying to find the correctionc, out of all possible corrections, that maximizes the probability of c given the original word w:

argmaxc P(c|w)

By Bayes' Theorem this is equivalent to:

argmaxc P(w|c) P(c) / P(w)

Since P(w) is the same for every possible c, we can ignore it, giving:

argmaxc P(w|c) P(c)

There are three parts of this expression. From right to left, we have:

  1. P(c), the probability that a proposed correction c stands on its own. This is called the language model: think of it as answering the question "how likely is c to appear in an English text?" So P("the") would have a relatively high probability, while P("zxzxzxzyyy") would be near zero.

  2. P(w|c), the probability that w would be typed in a text when the author meant c. This is the error model: think of it as answering "how likely is it that the author would type w by mistake when c was intended?"

  3. argmaxc, the control mechanism, which says to enumerate all feasible values of c, and then choose the one that gives the best combined probability score.

One obvious question is: why take a simple expression like P(c|w) and replace it with a more complex expression involving two models rather than one? The answer is that P(c|w) is already conflating two factors, and it is easier to separate the two out and deal with them explicitly. Consider the misspelled word w="thew" and the two candidate corrections c="the" and c="thaw". Which has a higher P(c|w)? Well, "thaw" seems good because the only change is "a" to "e", which is a small change. On the other hand, "the" seems good because "the" is a very common word, and perhaps the typist's finger slipped off the "e" onto the "w". The point is that to estimate P(c|w) we have to consider both the probability of c and the probability of the change from c to w anyway, so it is cleaner to formally separate the two factors.

Now we are ready to show how the program works. First P(c). We will read a big text file, big.txt, which consists of about a million words. The file is a concatenation of several public domain books from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. (On the plane home when I was writing the first version of the code all I had was a collection of Sherlock Holmes stories that happened to be on my laptop; I added the other sources later and stopped adding texts when they stopped helping, as we shall see in the Evaluation section.)

We then extract the individual words from the file (using the function words, which converts everything to lowercase, so that "the" and "The" will be the same and then defines a word as a sequence of alphabetic characters, so "don't" will be seen as the two words "don" and "t"). Next we train a probability model, which is a fancy way of saying we count how many times each word occurs, using the function train. It looks like this:

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model
= collections.defaultdict(lambda: 1)
   
for f in features:
        model
[f] += 1
   
return model

NWORDS
= train(words(file('big.txt').read()))

At this point, NWORDS[w] holds a count of how many times the word w has been seen. There is one complication: novel words. What happens with a perfectly good word of English that wasn't seen in our training data? It would be bad form to say the probability of a word is zero just because we haven't seen it yet. There are several standard approaches to this problem; we take the easiest one, which is to treat novel words as if we had seen them once. This general process is called smoothing, because we are smoothing over the parts of the probability distribution that would have been zero, bumping them up to the smallest possible count. This is achieved through the class collections.defaultdict, which is like a regular Python dict (what other languages call hash tables) except that we can specify the default value of any key; here we use 1.

Now let's look at the problem of enumerating the possible corrections c of a given word w. It is common to talk of theedit distance between two words: the number of edits it would take to turn one into the other. An edit can be a deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). Here's a function that returns a set of all words c that are one edit away from w:

def edits1(word):
   splits    
= [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    
= [a + b[1:] for a, b in splits if b]
   transposes
= [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces  
= [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    
= [a + c + b     for a, b in splits for c in alphabet]
   
return set(deletes + transposes + replaces + inserts)

This can be a big set. For a word of length n, there will be n deletions, n-1 transpositions, 26n alterations, and 26(n+1) insertions, for a total of 54n+25 (of which a few are typically duplicates). For example, len(edits1('something')) -- that is, the number of elements in the result of edits1('something') -- is 494.

The literature on spelling correction claims that 80 to 95% of spelling errors are an edit distance of 1 from the target. As we shall see shortly, I put together a development corpus of 270 spelling errors, and found that only 76% of them have edit distance 1. Perhaps the examples I found are harder than typical errors. Anyway, I thought this was not good enough, so we'll need to consider edit distance 2. That's easy: just apply edits1 to all the results of edits1:

def edits2(word):
   
return set(e2 for e1 in edits1(word) for e2 in edits1(e1))

This is easy to write, but we're starting to get into some serious computation: len(edits2('something')) is 114,324. However, we do get good coverage: of the 270 test cases, only 3 have an edit distance greater than 2. That is, edits2 will cover 98.9% of the cases; that's good enough for me. Since we aren't going beyond edit distance 2, we can do a small optimization: only keep the candidates that are actually known words. We still have to consider all the possibilities, but we don't have to build up a big set of them. The function known_edits2 does this:

def known_edits2(word):
   
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

Now, for example, known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}, rather than the set of 114,324 words generated by edits2. That speeds things up by about 10%.

Now the only part left is the error model, P(w|c). Here's where I ran into difficulty. Sitting on the plane, with no internet connection, I was stymied: I had no training data to build a model of spelling errors. I had some intuitions: mistaking one vowel for another is more probable than mistaking two consonants; making an error on the first letter of a word is less probable, etc. But I had no numbers to back that up. So I took a shortcut: I defined a trivial model that says all known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0. By "known word" I mean a word that we have seen in the language model training data -- a word in the dictionary. We can implement this strategy as follows:

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates
= known([word]) or known(edits1(word)) or known_edits2(word) or [word]
   
return max(candidates, key=NWORDS.get)

The function correct chooses as the set of candidate words the set with the shortest edit distance to the original word, as long as the set has some known words. Once it identifies the candidate set to consider, it chooses the element with the highest P(c) value, as estimated by the NWORDS model.

Evaluation

Now it is time to evaluate how well this program does. On the plane I tried a few examples, and it seemed okay. After my plane landed, I downloaded Roger Mitton's Birkbeck spelling error corpus from the Oxford Text Archive. From that I extracted two test sets of corrections. The first is for development, meaning I get to look at it while I'm developing the program. The second is a final test set, meaning I'm not allowed to look at it, nor change my program after evaluating on it. This practice of having two sets is good hygiene; it keeps me from fooling myself into thinking I'm doing better than I am by tuning the program to one specific set of tests. Here I show an excerpt of the two tests and the function to run them; to see the complete set of tests (along with the rest of the program), see the file spell.py.

tests1 = { 'access': 'acess', 'accessing': 'accesing', 'accommodation':
   
'accomodation acommodation acomodation', 'account': 'acount', ...}

tests2
= {'forbidden': 'forbiden', 'decisions': 'deciscions descisions',
   
'supposedly': 'supposidly', 'embellishing': 'embelishing', ...}

def spelltest(tests, bias=None, verbose=False):
   
import time
    n
, bad, unknown, start = 0, 0, 0, time.clock()
   
if bias:
       
for target in tests: NWORDS[target] += bias
   
for target,wrongs in tests.items():
       
for wrong in wrongs.split():
            n
+= 1
            w
= correct(wrong)
           
if w!=target:
                bad
+= 1
                unknown
+= (target not in NWORDS)
               
if verbose:
                   
print '%r => %r (%d); expected %r (%d)' % (
                        wrong
, w, NWORDS[w], target, NWORDS[target])
   
return dict(bad=bad, n=n, bias=bias, pct=int(100. - 100.*bad/n),
                unknown
=unknown, secs=int(time.clock()-start) )

print spelltest(tests1)
print spelltest(tests2) ## only do this after everything is debugged

This gives the following output:

{'bad': 68, 'bias': None, 'unknown': 15, 'secs': 16, 'pct': 74, 'n': 270}
{'bad': 130, 'bias': None, 'unknown': 43, 'secs': 26, 'pct': 67, 'n': 400}

So on the development set of 270 cases, we get 74% correct in 13 seconds (a rate of 17 Hz), and on the final test set we get 67% correct (at 15 Hz).

Update: In the original version of this essay I incorrectly reported a higher score on both test sets, due to a bug. The bug was subtle, but I should have caught it, and I apologize for misleading those who read the earlier version. In the original version of spelltest, I left out the if bias: in the fourth line of the function (and the default value was bias=0, not bias=None). I figured that when bias = 0, the statement NWORDS[target] += bias would have no effect. In fact it does not change the value ofNWORDS[target], but it does have an effect: it makes (target in NWORDS) true. So in effect the spelltestroutine was cheating by making all the unknown words known. This was a humbling error, and I have to admit that much as I like defaultdict for the brevity it adds to programs, I think I would not have had this bug if I had used regular dicts.

Update 2: defaultdict strikes again. Darius Bacon pointed out that in the function correct, I had accessed NWORDS[w]. This has the unfortunate side-effect of adding w to the defaultdict, if w was not already there (i.e., if it was an unknown word). Then the next time, it would be present, and we would get the wrong answer. Darius correctly suggested changing to NWORDS.get. (This works becausemax(None, i) is i for any integer i.)

In conclusion, I met my goals for brevity, development time, and runtime speed, but not for accuracy.

Future Work

Let's think about how we could do better. (I've done some more in a separate chapter for a book.) We'll again look at all three factors of the probability model: (1) P(c); (2) P(w|c); and (3) argmaxc. We'll look at examples of what we got wrong. Then we'll look at some factors beyond the three...

  1. P(c), the language model. We can distinguish two sources of error in the language model. The more serious is unknown words. In the development set, there are 15 unknown words, or 5%, and in the final test set, 43 unknown words or 11%. Here are some examples of the output of spelltest with verbose=True:
    correct('economtric') => 'economic' (121); expected 'econometric' (1)
    correct
    ('embaras') => 'embargo' (8); expected 'embarrass' (1)
    correct
    ('colate') => 'coat' (173); expected 'collate' (1)
    correct
    ('orentated') => 'orentated' (1); expected 'orientated' (1)
    correct
    ('unequivocaly') => 'unequivocal' (2); expected 'unequivocally' (1)
    correct
    ('generataed') => 'generate' (2); expected 'generated' (1)
    correct
    ('guidlines') => 'guideline' (2); expected 'guidelines' (1)

    In this output we show the call to correct and the result (with the NWORDS count for the result in parentheses), and then the word expected by the test set (again with the count in parentheses). What this shows is that if you don't know that 'econometric' is a word, you're not going to be able to correct 'economtric'. We could mitigate by adding more text to the training corpus, but then we also add words that might turn out to be the wrong answer. Note the last four lines above are inflections of words that do appear in the dictionary in other forms. So we might want a model that says it is okay to add '-ed' to a verb or '-s' to a noun.

    The second potential source of error in the language model is bad probabilities: two words appear in the dictionary, but the wrong one appears more frequently. I must say that I couldn't find cases where this is the only fault; other problems seem much more serious.

    We can simulate how much better we might do with a better language model by cheating on the tests: pretending that we have seen the correctly spelled word 1, 10, or more times. This simulates having more text (and just the right text) in the language model. The function spelltest has a parameter, bias, that does this. Here's what happens on the development and final test sets when we add more bias to the correctly-spelled words:

    BiasDev.Test
    074%67%
    174%70%
    1076%73%
    10082%77%
    100089%80%

    On both test sets we get significant gains, approaching 80-90%. This suggests that it is possible that if we had a good enough language model we might get to our accuracy goal. On the other hand, this is probably optimistic, because as we build a bigger language model we would also introduce words that are the wrong answer, which this method does not do.

    Another way to deal with unknown words is to allow the result of correct to be a word we have not seen. For example, if the input is "electroencephalographicallz", a good correction would be to change the final "z" to an "y", even though "electroencephalographically" is not in our dictionary. We could achieve this with a language model based on components of words: perhaps on syllables or suffixes (such as "-ally"), but it is far easier to base it on sequences of characters: 2-, 3- and 4-letter sequences.

  2. P(w|c), the error model. So far, the error model has been trivial: the smaller the edit distance, the smaller the error. This causes some problems, as the examples below show. First, some cases where correct returns a word at edit distance 1 when it should return one at edit distance 2:
    correct('reciet') => 'recite' (5); expected 'receipt' (14)
    correct
    ('adres') => 'acres' (37); expected 'address' (77)
    correct
    ('rember') => 'member' (51); expected 'remember' (162)
    correct
    ('juse') => 'just' (768); expected 'juice' (6)
    correct
    ('accesing') => 'acceding' (2); expected 'assessing' (1)

    Here, for example, the alteration of 'd' to 'c' to get from 'adres' to 'acres' should count more than the sum of the two changes 'd' to 'dd' and 's' to 'ss'.

    Also, some cases where we choose the wrong word at the same edit distance:

    correct('thay') => 'that' (12513); expected 'they' (4939)
    correct
    ('cleark') => 'clear' (234); expected 'clerk' (26)
    correct
    ('wer') => 'her' (5285); expected 'were' (4290)
    correct
    ('bonas') => 'bones' (263); expected 'bonus' (3)
    correct
    ('plesent') => 'present' (330); expected 'pleasant' (97)

    The same type of lesson holds: In 'thay', changing an 'a' to an 'e' should count as a smaller change than changing a 'y' to a 't'. How much smaller? It must be a least a factor of 2.5 to overcome the prior probability advantage of 'that' over 'they'.

    Clearly we could use a better model of the cost of edits. We could use our intuition to assign lower costs for doubling letters and changing a vowel to another vowel (as compared to an arbitrary letter change), but it seems better to gather data: to get a corpus of spelling errors, and count how likely it is to make each insertion, deletion, or alteration, given the surrounding characters. We need a lot of data to do this well. If we want to look at the change of one character for another, given a window of two characters on each side, that's 266, which is over 300 million characters. You'd want several examples of each, on average, so we need at least a billion characters of correction data; probably safer with at least 10 billion.

    Note there is a connection between the language model and the error model. The current program has such a simple error model (all edit distance 1 words before any edit distance 2 words) that it handicaps the language model: we are afraid to add obscure words to the model, because if one of those obscure words happens to be edit distance 1 from an input word, then it will be chosen, even if there is a very common word at edit distance 2. With a better error model we can be more aggressive about adding obscure words to the dictionary. Here are some examples where the presence of obscure words in the dictionary hurts us:

    correct('wonted') => 'wonted' (2); expected 'wanted' (214)
    correct
    ('planed') => 'planed' (2); expected 'planned' (16)
    correct
    ('forth') => 'forth' (83); expected 'fourth' (79)
    correct
    ('et') => 'et' (20); expected 'set' (325)

  3. The enumeration of possible corrections, argmaxc. Our program enumerates all corrections within edit distance 2. In the development set, only 3 words out of 270 are beyond edit distance 2, but in the final test set, there were 23 out of 400. Here they are:
    purple perpul
    curtains courtens
    minutes muinets
    
    successful sucssuful
    hierarchy heiarky
    profession preffeson
    weighted wagted
    inefficient ineffiect
    availability avaiblity
    thermawear thermawhere
    nature natior
    dissension desention
    unnecessarily unessasarily
    disappointing dissapoiting
    acquaintances aquantences
    thoughts thorts
    criticism citisum
    immediately imidatly
    necessary necasery
    necessary nessasary
    necessary nessisary
    unnecessary unessessay
    night nite
    minutes muiuets
    assessing accesing
    necessitates nessisitates
    

    We could consider extending the model by allowing a limited set of edits at edit distance 3. For example, allowing only the insertion of a vowel next to another vowel, or the replacement of a vowel for another vowel, or replacing close consonants like "c" to "s" would handle almost all these cases.

  4. There's actually a fourth (and best) way to improve: change the interface to correct to look at more context. So far, correct only looks at one word at a time. It turns out that in many cases it is difficult to make a decision based only on a single word. This is most obvious when there is a word that appears in the dictionary, but the test set says it should be corrected to another word anyway:
    correct('where') => 'where' (123); expected 'were' (452)
    correct
    ('latter') => 'latter' (11); expected 'later' (116)
    correct
    ('advice') => 'advice' (64); expected 'advise' (20)

    We can't possibly know that correct('where') should be 'were' in at least one case, but should remain 'where' in other cases. But if the query had been correct('They where going') then it seems likely that "where" should be corrected to "were".

    The context of the surrounding words can help when there are obvious errors, but two or more good candidate corrections. Consider:

    correct('hown') => 'how' (1316); expected 'shown' (114)
    correct
    ('ther') => 'the' (81031); expected 'their' (3956)
    correct
    ('quies') => 'quiet' (119); expected 'queries' (1)
    correct
    ('natior') => 'nation' (170); expected 'nature' (171)
    correct
    ('thear') => 'their' (3956); expected 'there' (4973)
    correct
    ('carrers') => 'carriers' (7); expected 'careers' (2)

    Why should 'thear' be corrected as 'there' rather than 'their'? It is difficult to tell by the single word alone, but if the query were correct('There's no there thear') it would be clear.

    To build a model that looks at multiple words at a time, we will need a lot of data. Fortunately, Google has released a database of word counts for all sequences up to five words long, gathered from a corpus of a trillion words.

    I believe that a spelling corrector that scores 90% accuracy will need to use the context of the surrounding words to make a choice. But we'll leave that for another day...

  5. We could improve our accuracy scores by improving the training data and the test data. We grabbed a million words of text and assumed they were all spelled correctly; but it is very likely that the training data contains several errors. We could try to identify and fix those. Less daunting a task is to fix the test sets. I noticed at least three cases where the test set says our program got the wrong answer, but I believe the program's answer is better than the expected answer:
    correct('aranging') => 'arranging' (20); expected 'arrangeing' (1)
    correct
    ('sumarys') => 'summary' (17); expected 'summarys' (1)
    correct
    ('aurgument') => 'argument' (33); expected 'auguments' (1)

    We could also decide what dialect we are trying to train for. The following three errors are due to confusion about American versus British spelling (our training data contains both):

    correct('humor') => 'humor' (17); expected 'humour' (5)
    correct
    ('oranisation') => 'organisation' (8); expected 'organization' (43)
    correct
    ('oranised') => 'organised' (11); expected 'organized' (70)
  6. Finally, we could improve the implementation by making it much faster, without changing the results. We could re-implement in a compiled language rather than an interpreted one. We could have a lookup table that is specialized to strings rather than Python's general-purpose dict. We could cache the results of computations so that we don't have to repeat them multiple times. One word of advice: before attempting any speed optimizations, profile carefully to see where the time is actually going.

Further Reading

Errata

Originally my program was 20 lines, but Ivan Peev pointed out that I had used string.lowercase, which in some locales in some versions of Python, has more characters than just the a-z I intended. So I added the variable alphabet to make sure. I could have used string.ascii_lowercase.

Thanks to Jay Liang for pointing out there are only 54n+25 distance 1 edits, not 55n+25 as I originally wrote.

Thanks to Dmitriy Ryaboy for pointing out there was a problem with unknown words; this allowed me to find theNWORDS[target] += bias bug.

Other Computer Languages

After I posted this article, various people wrote versions in different programming languages. While the purpose of this article was to show the algorithms, not to highlight Python (and certainly not to play "code golf" in an attempt to find the shortest program), the other examples may be interesting for those who like comparing languages, or for those who want to borrow an implementation in their desired language:

LanguageLines
Code
Author
(and link to implementation)
Awk15Tiago "PacMan" Peczenyj
Awk28Gregory Grefenstette
C184Marcelo Toledo
C++98Felipe Farinon
C#43Lorenzo Stoakes
C#69Frederic Torres
Clojure18Rich Hickey
D23Leonardo M
Erlang87Federico Feroldi
F#16Dejan Jelovic
F#34Sebastian G
Go57Yi Wang
Groovy22Rael Cunha
Haskell24Grzegorz
Java35Rael Cunha
Java372Dominik Schulz
Javascript92Shine Xavier
Javascript53Panagiotis Astithas
Lisp26Mikael Jansson
Perl63riffraff
PHP68Felipe Ribeiro
PHP103Joe Sanders
Python21Peter Norvig
Rebol133Cyphre
Ruby34Brian Adkins
Scala23Thomas Jung
Scheme45Shiro
Scheme89Jens Axel

Other Natural Languages

This essay has been translated into:

Thanks to all the authors for creating these implementations and translations.


Peter Norvig



WRITTEN BY
manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

,

python 및 머신러닝 교육, 슬로우캠퍼스


한글 코드값 (유니코드),  초성/중성/종성 분리 방법을 구할 수 있다.


http://en.wikipedia.org/wiki/Korean_language_and_computers


Hangul Syllables Area[edit]

To find Hangul Syl­la­bles in Uni­code, you can apply a sim­ple for­mula. The for­mula and ta­bles are as fol­lows:

[{(initial)×588}+{(medial)×28}+(final)]+44032 (0xAC00)

Initial Jamo[edit]

Medial Jamo[edit]

Final Jamo[edit]

Example[edit]

For ex­am­ple, If you want to find the code­point of “” in Uni­code:

  • The value of ini­tial Jamo ㅎ is 18
  • The value of me­dial Jamo ㅏ is 0
  • The value of final Jamo ㄴ is 4

So, the for­mula will be {(18×588)+(0×28)+4}+44032, and the re­sult is 54620. It means the Uni­code value of 한 is 54620 in dec­i­mal, &#54620; by the nu­meric char­ac­ter ref­er­ence, and U+D55C in stan­dard Uni­code no­ta­tion.




http://cafe.naver.com/flashdev/38144


예전에 "테너스"라는 필명으로 작성했던 글 "[텍스트를 자유로이 시리즈 1편] 한글을 분리해보자!"의 연재입니다.

 

1편의 내용에서 발췌하면

 

한글은 유니코드로 44032번부터라고 한다.

한글은 초성, 중성, 종성으로 이루어져있으며
초성은 19개 : 'ㄱ', 'ㄲ', 'ㄴ', 'ㄷ', 'ㄸ', 'ㄹ', 'ㅁ', 'ㅂ', 'ㅃ', 'ㅅ', 'ㅆ', 'ㅇ', 'ㅈ', 'ㅉ', 'ㅊ', 'ㅋ', 'ㅌ', 'ㅍ', 'ㅎ'
중성은 21개 : 'ㅏ', 'ㅐ', 'ㅑ', 'ㅒ', 'ㅓ', 'ㅔ', 'ㅕ', 'ㅖ', 'ㅗ', 'ㅘ', 'ㅙ', 'ㅚ', 'ㅛ', 'ㅜ', 'ㅝ', 'ㅞ', 'ㅟ', 'ㅠ', 'ㅡ', 'ㅢ', 'ㅣ'
종성은 28개 : ' ','ㄱ', 'ㄲ', 'ㄳ', 'ㄴ', 'ㄵ', 'ㄶ', 'ㄷ', 'ㄹ', 'ㄺ', 'ㄻ', 'ㄼ', 'ㄽ', 'ㄾ', 'ㄿ', 'ㅀ', 'ㅁ', 'ㅂ', 'ㅄ', 'ㅅ', 'ㅆ', 'ㅇ', 'ㅈ', 'ㅊ', 'ㅋ', 'ㅌ', 'ㅍ', 'ㅎ'
가 있다. -종성의 0번째는 공백이다.-

 

이것이 한글 조합의 핵심입니다.

각 자음과 모음에는 코드가 할당되는데 일정한 공식으로 조합된 글자의 코드값을 얻어낼 수 있습니다.

예를 들면 "각"은

초성 "ㄱ"

중성 "ㅏ"

종성 "ㄱ"

으로 이루어져 있습니다. 각각을 코드로 바꾸어보면

초성 "ㄱ" -> 0

중성 "ㅏ" -> 0

종성 "ㄱ" -> 1

이와 같이 됩니다.

이것을 다음 공식에 대입하면

 

한글 코드 = 44032 + (초성 * 588) + (중성 * 28) + 종성

 

"각"의 코드값이 나오는 것을 알 수 있습니다.

다음 예제를 실행시켜보세요.

-----------------------------------------------------------------

var 초성=0;
var 중성=0;
var 종성=1;

trace(String.fromCharCode( 44032 + (초성 * 588) + (중성 * 28) + 종성));

----------------------------------------------------------------

결과는 "각"이 됩니다.

 

첨부한 예제 파일인 "타이핑_예제"는 예전에 swf 파일만 올렸던 자료입니다.

원본도 같이 공개하는데 다듬을 시간이 없어서 지저분합니다.

이것을 개량하여 터치 스크린 키보드도 개발이 가능합니다.

 

키보드 입력을 구현하기 위해서는 다섯가지 종류의 합성에 대해 알아야합니다.

첫째, "ㄱ", "ㄴ", "ㅕ" 등과 같이 자음이나 모음 한 자만 입력된 경우의 코드값 구하기

둘째, "가", "러" 등과 같이 종성이 없는 경우

셋째, "각", "쀏" 과 같이 초성, 중성, 종성 모두가 조합된 경우

넷째, "ㅚ" 등과 같이 모음과 모음의 합성

다섯째, "ㄳ", "ㅄ" 등과 같은 자음과 자음의 합성입니다.

 

첫째, 자음의 코드는 다음 공식으로 구합니다.

한글 코드값 = 12593 + 초성코드

모음의 코드는 다음의 공식으로 구합니다.

한글 코드값 = 12623 + 중성코드

 

둘째, 종성없이 자음+모음 형태의 한글의 코드는 다음 공식으로 구합니다.

44032 + (초성 * 588) + (중성 * 28));

초성+중성+종성의 합성 공식에서 종성=0이라고 생각하면 됩니다.

 

셋째, 앞에서 밝혔듯 초성+중성+종성의 합성 공식은 다음과 같습니다.

한글 코드 = 44032 + (초성 * 588) + (중성 * 28) + 종성

 

넷째, 모음과 모음의 합성은 스위치문으로 각 경우마다 처리해야합니다.

가령 코드가 9인 "ㅘ"의 경우 코드가 8인 "ㅗ"와  0인 "ㅏ"의 합성입니다.

 

다섯째, 자음과 자음의 합성 역시 모음+모음의 경우와 같은 방법으로 합성해냅니다.

 

결론........................................................................................................

키보드 입력 인터페이스를 구현하는 경우 위와 같은 다섯가지 경우를 생각해야하며

자음일때 자음을 입력하면 다음 글자로 넘어가고, 자음에 모음을 입력하면 합성해주고

"ㄱ"        "ㄴ"                     "ㄱㄴ"                    "ㄱ"    "ㅓ"                "거"

위와 같은 합성할 수 있으면 합성하고 합성할 수 없는 문자를 입력하면 다음으로 넘기는 등의

처리까지 깔끔하게 해주어야 비로소 한글 입력 인터페이스를 구현할 수 있습니다.

 

한글 조합에 대한 강좌는 이번 강좌로 마무리하겠습니다.

다음엔 색다른 주제로 다시 찾아 뵙겠습니다.

'Data/Text/Knowledge Analysis & Mining > 한글처리' 카테고리의 다른 글

unicode script 목록  (0) 2013.07.31
한글처리 오픈 소스  (0) 2013.07.24

WRITTEN BY
manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

,

python 및 머신러닝 교육, 슬로우캠퍼스

오타교정 :   

    • correction of misspelled words, suggesting the corrected word.
    • 구글의  'Did you mean',  'Show results for' 기능




한글 오타교정 참고 사례

야후코리아 오타 교정 기능 비디오 (전희원)

http://www.youtube.com/watch?v=c7YEcYjEFgk


야후코리아 오타 교정 노문 (전희원, 한국정보공학회)


한글 검색 질의어 오타 패턴 분석과 사용자 로그를 이용한 질의어 오타 교정 시스템 구축 

http://www.slideshare.net/gogamza/ss-6265729


전희원씨 (고감자) 블로그

http://freesearch.pe.kr/archives/tag/speller 



Lucene spellcheck package

요즘 루씬 코드 리딩을 하고 있다. 루씬 core 패키지는 예전에 한번 분석 해본 경험이 있어서 이번엔 contrib 패키지를 중점적으로 살펴보고 있다. 그중에서도 spellcheck 모듈은 가장 최근에 성능좋은 라이브러리로 구현한 경험이 있어서 관심이 갔다.  이 패키지 내에서는 Jaro Winkler Distance 라는 짧은 이름에서 사용 가능한 string 비교 클래스가 구현이 되어 있었으며 n-gram 기반의 string 비교 클래스도 있었다. 그러나 메인으로는 예상대로 edit distance가 사용되고 있었다.


Damerau-Levenshtein distance



How to Write a Spelling Corrector - Peter Norvig





WRITTEN BY
manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

,