'Data/Text/Knowledge Analysis & Mining/Python'에 해당하는 글 24건

2013.07.16 python - JSON 데이타 load 하기
2013.06.26 '쿵푸 팬더'의 사부는 너구리 ?
2013.04.23 [python] gzip, bzip 파일 부분 해제
2013.04.18 [python] proxy 설정, urllib2
2013.03.22 [python] gzip 압축 파일 읽기/쓰기 (utf-8 인코딩)
2013.03.22 [python] 나눗셈 구현 (divide operator with bit-operations)
2013.03.18 Damerau-Levenshtein Distance (Edit distance) 구하기
2013.03.18 [python] utf-8로 stdin 및 stdout 입출력 하기
2013.03.11 [python] dict merge
2013.02.14 best 최고 python IDE - PyCharm

python 및 머신러닝 교육, 슬로우캠퍼스

python - JSON 데이타 load 하기

Data/Text/Knowledge Analysis & Mining/Python 2013. 7. 16. 15:49

share this post

아래는 UTF-8 인코딩 파일에서 JSON 데이타 포맷을 읽어,

메모리에 json 타입으로 로딩하는 것이다.

파일의 이름이 '-' 이면, 파일이 아니라 stdin 에서 읽어 들인다.

json.loads() 함수가 JSON 포맷을 문자열을 메모리 데이타로 load하는 기능을 한다.

utf-8 파일을 읽어들일 때는 codecs를 이용한다.

import codecs

import json

def load_jsonfile(fname):

if fname=='-':

fp = codecs.getreader('utf-8')(sys.stdin)

else:

fp = codecs.open(fname, 'rb', encoding='utf-8')

lines = fp.read()

fp.close()

jdata = json.loads(lines)

return jdata

저작자표시 비영리 변경금지

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

python map reduce lambda (0)	2013.07.20
google app engine urlfetch, urllib2 (0)	2013.07.16
'쿵푸 팬더'의 사부는 너구리 ? (0)	2013.06.26
[python] gzip, bzip 파일 부분 해제 (0)	2013.04.23
[python] proxy 설정, urllib2 (0)	2013.04.18

WRITTEN BY

: manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

python 및 머신러닝 교육, 슬로우캠퍼스

'쿵푸 팬더'의 사부는 너구리 ?

Data/Text/Knowledge Analysis & Mining/Python 2013. 6. 26. 13:26

share this post

우리가 흔히 말하는 팬더 곰은 Giant panda 라고 합니다.

영화 '쿵푸 팬더'의 주인공은 팬더 곰 입니다. 팬더의 스승은 너구리일까요?

팬더의 사부도 팬더 입니다. 그 작은 팬더를 영어로 red panda 라고 부릅니다.

아래는 어제 미국 워싱톤 국립 공원에서 Rusty라는 이름의 red panda가 탈출했다고 다시 잡혀왔다는 기사입니다. (2013/06/25)

Red panda found after escaping from Washington's National zoo

트위터에서도 #Rusty 가 이슈가 되었습니다.

Spotted near 20th and Biltmore near Airy View condos at 1:25. Are you missing this guy?? @NationalZoo pic.twitter.com/v7KulXpNKE
— Ashley Foughty (@AshleyFoughty) June 24, 2013

CNN 뉴스에도 ㅎㅎ.

http://edition.cnn.com/video/data/2.0/video/us/2013/06/24/tsr-todd-dnt-red-panda-found.cnn.html

http://www.usnews.com/news/articles/2013/06/24/photos-the-red-pandas-many-faces-rusty-the-red-panda-escaped-the-smithsonian-national-zoo-early-monday-in-washington-dc

The Peeking Panda: A red panda living with two giant pandas peers over a tree trunk at the River Safari Zoo in Singapore. (Roslan Rahman/AFP/Getty Images)

저작자표시 비영리 변경금지

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

google app engine urlfetch, urllib2 (0)	2013.07.16
python - JSON 데이타 load 하기 (0)	2013.07.16
[python] gzip, bzip 파일 부분 해제 (0)	2013.04.23
[python] proxy 설정, urllib2 (0)	2013.04.18
[python] gzip 압축 파일 읽기/쓰기 (utf-8 인코딩) (0)	2013.03.22

WRITTEN BY

: manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

python 및 머신러닝 교육, 슬로우캠퍼스

[python] gzip, bzip 파일 부분 해제

Data/Text/Knowledge Analysis & Mining/Python 2013. 4. 23. 08:59

share this post

사용법: python gzcat.py 5000 abc.txt.gz abc.5k.txt

abc.txt.gz 파일에서 앞부분의 5,000 라인을 abc.5k.txt로 저장한다.

저작자표시 비영리 변경금지

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

python - JSON 데이타 load 하기 (0)	2013.07.16
'쿵푸 팬더'의 사부는 너구리 ? (0)	2013.06.26
[python] proxy 설정, urllib2 (0)	2013.04.18
[python] gzip 압축 파일 읽기/쓰기 (utf-8 인코딩) (0)	2013.03.22
[python] 나눗셈 구현 (divide operator with bit-operations) (0)	2013.03.22

WRITTEN BY

: manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

python 및 머신러닝 교육, 슬로우캠퍼스

[python] proxy 설정, urllib2

Data/Text/Knowledge Analysis & Mining/Python 2013. 4. 18. 20:49

share this post

사내 proxy를 통과하여 웹페이지를 받아오고자 할 때, urllib2의 ProxyHandler를 설정한다.

아래는 proxy 주소가 123.100.20.30:8080 인 경우 이다.

proxy = urllib2.ProxyHandler({'http': '123.100.20.30:8080', 'https': '123.100.20.30:8080'})

opener = urllib2.build_opener(proxy)

urllib2.install_opener(opener)

conn = urllib2.urlopen(addr)

print conn.read()

저작자표시 비영리 변경금지

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

'쿵푸 팬더'의 사부는 너구리 ? (0)	2013.06.26
[python] gzip, bzip 파일 부분 해제 (0)	2013.04.23
[python] gzip 압축 파일 읽기/쓰기 (utf-8 인코딩) (0)	2013.03.22
[python] 나눗셈 구현 (divide operator with bit-operations) (0)	2013.03.22
Damerau-Levenshtein Distance (Edit distance) 구하기 (0)	2013.03.18

WRITTEN BY

: manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

python 및 머신러닝 교육, 슬로우캠퍼스

[python] gzip 압축 파일 읽기/쓰기 (utf-8 인코딩)

Data/Text/Knowledge Analysis & Mining/Python 2013. 3. 22. 19:32

share this post

gzip 파일을 읽기/쓰기 위해서는 기본 패키지에 있는 gzip을 이용하면 된다.

아래는 utf-8로 구성되어 있는 파일이 gzip 으로 압축된 경우에 읽고,

처리 결과를 다시 utf-8 형식으로 출력하고 gzip 으로 압축하여 파일을 쓰는 예시이다.

import sys

import codecs

import gzip

# read/write gzip file

reader=codecs.getreader("utf-8")

writer=codecs.getwriter("utf-8")

# two input files given, one output file given

A=reader(gzip.open(sys.argv[1], 'rb'))

B=reader(gzip.open(sys.argv[2], 'rb'))

C=writer(gzip.open(sys.argv[3], 'wb'))

gzip 압축이 아닌, 일반 text 파일이 utf-8 로 encoding된 경우의 읽기, 쓰기 방법은 아래와 같다.

#A=codecs.open(sys.argv[1], 'rb', encoding='utf-8')

#B=codecs.open(sys.argv[2], 'rb', encoding='utf-8')

#C=codecs.open(sys.argv[3], 'wb', encoding='utf-8')

저작자표시 비영리 변경금지

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

[python] gzip, bzip 파일 부분 해제 (0)	2013.04.23
[python] proxy 설정, urllib2 (0)	2013.04.18
[python] 나눗셈 구현 (divide operator with bit-operations) (0)	2013.03.22
Damerau-Levenshtein Distance (Edit distance) 구하기 (0)	2013.03.18
[python] utf-8로 stdin 및 stdout 입출력 하기 (0)	2013.03.18

WRITTEN BY

: manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

python 및 머신러닝 교육, 슬로우캠퍼스

[python] 나눗셈 구현 (divide operator with bit-operations)

Data/Text/Knowledge Analysis & Mining/Python 2013. 3. 22. 15:37

share this post

나눗셈을 구현하기: 몫과 나머지를 구하는 함수를 직접 구현해 본다.

1) 느린 방법: for loop 이용하기 --> 아주 큰 숫자에 대해 너무 느린 문제가 있다.

2) 빠른 방법: bit operation (shift) 이용하여 빠른 버전을 구현할 수 있다.

아래의 두 함수를 비교해 볼 수 있다.

divAndMod( y, x, debug=0) 으로 실행하면 디버그 메시지 없이 볼 수 있습니다.

divAndMod()를 더 최적화할 수 있는 방법을 제시해 주시면 대환영입니다 ^^.

저작자표시 비영리 변경금지

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

[python] proxy 설정, urllib2 (0)	2013.04.18
[python] gzip 압축 파일 읽기/쓰기 (utf-8 인코딩) (0)	2013.03.22
Damerau-Levenshtein Distance (Edit distance) 구하기 (0)	2013.03.18
[python] utf-8로 stdin 및 stdout 입출력 하기 (0)	2013.03.18
[python] dict merge (0)	2013.03.11

WRITTEN BY

: manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

python 및 머신러닝 교육, 슬로우캠퍼스

Damerau-Levenshtein Distance (Edit distance) 구하기

Data/Text/Knowledge Analysis & Mining/Python 2013. 3. 18. 19:36

share this post

오타 수정 (misspell correction) 등에 활용할 수 있는 문자열 비교 알고리즘 (string distance metric)이 있다.

일반적으로 어떤 문자열 A에서 몇 자(character)를 수정(delete, insert, substitute, transpose)하여

B가 되는가를 숫자로 표현한 것을 A와B의 Edit distance라고 부른다.

2가지 metric 차이점

* transpose는 두개의 인접한 문자를 서로 바꾸는 것이다. ex) lettre --> letter

L-distance: insert,delete,substitute

DL-distance : insert,delete,substitute,transpose

Damerau-Levenshtein Distance is a metric for measuring how far two given strings are, in terms of 4 basic operations:

deletion
insertion
substitution
transposition

Levenshtein Distance

deletion
insertion
substitution

아래는 어떤 문자열 A에 대해 여러 다른 문자들과 DL distance를 구하고자 할때

약간 시간을 단축할 수 있도록 class를 이용하는 것이다. class constructor에서 array값을 미리 채워 놓는다.

"""

Compute the Damerau-Levenshtein distance between two given

strings (s1 and s2)

from http://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/

"""

def damerau_levenshtein_distance(s1, s2):

d = {}

lenstr1 = len(s1)

lenstr2 = len(s2)

for i in xrange(-1,lenstr1+1):

d[(i,-1)] = i+1

for j in xrange(-1,lenstr2+1):

d[(-1,j)] = j+1

for i in xrange(lenstr1):

for j in xrange(lenstr2):

if s1[i] == s2[j]:

cost = 0

else:

cost = 1

d[(i,j)] = min(

d[(i-1,j)] + 1, # deletion

d[(i,j-1)] + 1, # insertion

d[(i-1,j-1)] + cost, # substitution

)

if i and j and s1[i]==s2[j-1] and s1[i-1] == s2[j]:

d[(i,j)] = min (d[(i,j)], d[i-2,j-2] + cost) # transposition

return d[lenstr1-1,lenstr2-1]

"""

Compute the Damerau-Levenshtein distance between a given string s1

and many other strings s2.

"""

class DLDistance:

def __init__(self, s1):

self.s1 = s1

self.d = {}

self.lenstr1 = len(self.s1)

for i in xrange(-1,self.lenstr1+1):

self.d[(i,-1)] = i+1

def distance(self, s2):

lenstr2 = len(s2)

for j in xrange(-1,lenstr2+1):

self.d[(-1,j)] = j+1

for i in xrange(self.lenstr1):

for j in xrange(lenstr2):

if self.s1[i] == s2[j]:

cost = 0

else:

cost = 1

self.d[(i,j)] = min(

self.d[(i-1,j)] + 1, # deletion

self.d[(i,j-1)] + 1, # insertion

self.d[(i-1,j-1)] + cost, # substitution

)

if i and j and self.s1[i]==s2[j-1] and self.s1[i-1] == s2[j]:

self.d[(i,j)] = min (self.d[(i,j)], self.d[i-2,j-2] + cost) # transposition

return self.d[self.lenstr1-1,lenstr2-1]

def hamming_distance(s1, s2):

assert len(s1) == len(s2)

return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

def main():

base = u'whatsapp'

cmpstrs = [u'whatapp', u'what\'app', u'whatu', u'whoisthis']

dl = DLDistance(base)

for s in cmpstrs:

print damerau_levenshtein_distance(base, s)

print dl.distance(s)

print hamming_distance(u'whatsapp', u'whatapps')

if __name__ == '__main__':

main()

저작자표시 비영리 변경금지

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

[python] gzip 압축 파일 읽기/쓰기 (utf-8 인코딩) (0)	2013.03.22
[python] 나눗셈 구현 (divide operator with bit-operations) (0)	2013.03.22
[python] utf-8로 stdin 및 stdout 입출력 하기 (0)	2013.03.18
[python] dict merge (0)	2013.03.11
best 최고 python IDE - PyCharm (0)	2013.02.14

WRITTEN BY

: manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

python 및 머신러닝 교육, 슬로우캠퍼스

[python] utf-8로 stdin 및 stdout 입출력 하기

Data/Text/Knowledge Analysis & Mining/Python 2013. 3. 18. 11:52

share this post

stdin을 utf-8 형식으로 설정하기

import codecs

sys.stdin= codecs.getreader('utf-8')(sys.stdin)


utf-8 문자열을 stdout 으로 출력하고자 할 때

import codecs 
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)


아래와 같이 setdefaultencoding() 을 이용한 방법도 잘 동작한다.
reload(sys)
sys.setdefaultencoding('utf-8')

아래는 파일명이 '-'으로 주어지면 stdin 에서 읽고,

그외에는 일반 파일에서 utf-8 형식으로 읽는 것이다.

활용법: gzip -dc xxx.gz | python doit.py -

gzip -dc 는 압축을 풀어 stdout 으로 출력한다.

doit.py

if fname=='-':
        fp = codecs.getreader('utf-8')(sys.stdin)
else:
        fp = codecs.open(fname, 'rb', encoding='utf-8')

for line in fp:
do something on line

저작자표시 비영리 변경금지

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

[python] 나눗셈 구현 (divide operator with bit-operations) (0)	2013.03.22
Damerau-Levenshtein Distance (Edit distance) 구하기 (0)	2013.03.18
[python] dict merge (0)	2013.03.11
best 최고 python IDE - PyCharm (0)	2013.02.14
[python] timedelta값을 실수(real, float) 또는 정수(integer)로 변환 (0)	2013.02.08

WRITTEN BY

: manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

python 및 머신러닝 교육, 슬로우캠퍼스

[python] dict merge

Data/Text/Knowledge Analysis & Mining/Python 2013. 3. 11. 09:15

share this post

python에서 dict데이타끼리 merge하기.

def dict_merge(a, b, func=None):

''' dict A, B를 merge하여 새로운 dict R을 return '''

# new dict 'r'

r = dict(a)

if func==None: func = operator.add

for k,vb in b.iteritems():

va = r.get(k, None)

if va!=None: # A에 있는 value와 B의 value를 add(또는 주는어 func에 따라) 한다.

r[k] = func(va, vb)

else: # A에 없는 key는 B의 key와 value를 R에 추가

r[k] = vb

return r

a={'a':1,'b':2}

b={'c':3, 'b':4}

print dict_merge(a, b)

==>

같은 key에 대해서서는 add한 결과:

{'a': 1, 'c': 3, 'b': 6}

저작자표시 비영리 변경금지

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

Damerau-Levenshtein Distance (Edit distance) 구하기 (0)	2013.03.18
[python] utf-8로 stdin 및 stdout 입출력 하기 (0)	2013.03.18
best 최고 python IDE - PyCharm (0)	2013.02.14
[python] timedelta값을 실수(real, float) 또는 정수(integer)로 변환 (0)	2013.02.08
[python] addition of list value (list 더하기) (0)	2013.02.07

WRITTEN BY

: manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

python 및 머신러닝 교육, 슬로우캠퍼스

best 최고 python IDE - PyCharm

Data/Text/Knowledge Analysis & Mining/Python 2013. 2. 14. 19:56

share this post

PyCharm은 Django, Google App Engine 등을 지원하며, Javascript, HTML편집까지

지원하는 최고의 IDE로 보입니다.

---- OLD ---

Ninja 와 PyScripter 를 설치하여 보았습니다.

PyScripter 가 화면이 훨씬 깔끔하고, Editor의 폰트 설정, 디버깅 모드 등 기능이 더 다양합니다.

PyScripter http://code.google.com/p/pyscripter/

Ninja http://ninja-ide.org/

PyScripter 실행시 아래와 같은 오류가 발생할 수 있습니다.

"UnicodeEncodeError: 'ascii' codec can't encode character ~~~"

해결방법:

C:\Python27\Lib\site.py 의 내용 중 한글자만 수정하면 됩니다.

def setencoding():

"""Set the string encoding used by the Unicode implementation. The

default is 'ascii', but if you're willing to experiment, you can

change this."""

encoding = "ascii" # Default value set by _PyUnicode_Init()

# 원래 0 으로 되어있던 것을 1로 수정하면 된다.

# Windows의 locale에 따른 처리를 가능하게 함.

if 1:

# Enable to support locale aware default string encodings.

import locale

loc = locale.getdefaultlocale()

if loc[1]:

encoding = loc[1]

if 0:

# Enable to switch off string to Unicode coercion and implicit

# Unicode to string conversion.

encoding = "undefined"

if encoding != "ascii":

# On Non-Unicode builds this will raise an AttributeError...

sys.setdefaultencoding(encoding) # Needs Python Unicode build !

저작자표시 비영리 변경금지

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

[python] utf-8로 stdin 및 stdout 입출력 하기 (0)	2013.03.18
[python] dict merge (0)	2013.03.11
[python] timedelta값을 실수(real, float) 또는 정수(integer)로 변환 (0)	2013.02.08
[python] addition of list value (list 더하기) (0)	2013.02.07
[python] 문자열을 시간값으로 변환 strptime() (0)	2013.02.07

WRITTEN BY

: manager@
Data Analysis, Text/Knowledge Mining, Python, Cloud Computing, Platform

'Data/Text/Knowledge Analysis & Mining/Python'에 해당하는 글 24건

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

'Data/Text/Knowledge Analysis & Mining > Python' 카테고리의 다른 글

티스토리툴바