[머신러닝] 단어간 유사도 분석(Text Similarity Analysis)

AI/Machine Learning

[머신러닝] 단어간 유사도 분석(Text Similarity Analysis)

밍츠 2022. 3. 27. 15:42

Text Similarity Analysis

- 문서 간 얼마나 비슷한지, 단어 간 유사도를 분석한다.

TF-IDF

- 많은 문서에 공통적으로 들어있는 단어의 경우 문서 구별 능력이 떨어진다고 보아 가중치를 축소하는 방법

Cosine similarity

- 두 벡터 사이 각도의 코사인 값을 이용하여 측정하는 값으로, 두 벡터의 유사한 정도를 의미한다.

1. 두 개의 영화 리뷰 텍스트 간 유사도 계산하기

영화 :

The Shawshank Redemption (1994)
The Godfather (1972)

라이브러리 import

# 유사도 분석에 필요한 패키지를 불러온다
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

file = open('shawshank.txt', 'r', encoding = 'utf-8')
lines = file.readlines()  # 영화 리뷰 파일의 모든 라인을 읽어와 리스트로 저장
doc1 = ' '.join(lines)
# doc1 = ''  # 리뷰 데이터를 담기 위한 String 변수 생성
# for line in lines:  # for문을 통해 lines에 있는 모든 텍스트를 doc1에 이어 붙임
#     doc1 += line

file = open('godfather.txt', 'r', encoding = 'utf-8')
lines = file.readlines()  # 영화 리뷰 파일의 모든 라인을 읽어와 리스트로 저장
doc2 = ' '.join(lines)

print(len(doc1),len(doc2))

결과 : 50241 57443

doc1, doc2를 합쳐 corpus list 생성

corpus = [doc1, doc2]

TF-IDF 행렬 만들기

vectorizer = TfidfVectorizer()  # TfidfVectorizer() 객체 변수 생성

X = vectorizer.fit_transform(corpus).todense()

- vectorizer.fit() + vectorizer.transform() : 예측 + 적용 -> vectorizer.fit_transform()

- 희소 행렬 내의 무수히 많은 0의 값이 존재하는데 이는 공간을 차지하므로 0인 값을 제외하고 저장하는 방식으로 처리한다. 우리가 얻어내야 하는 값은 모두 채워진 행렬이므로 todense()를 사용하여 값이 모두 채워진 행렬 값을 얻어낸다.

-> .todense()를 하지 않아도 코사인 유사도 계산에는 차이가 없다.

-> 육안으로 좋지만 메모리 차지가 많아 사용하지 않는 것이 좋다.

print(X)

결과 :

[[0.0071001  0.00332632 0.         ... 0.         0.00166316 0.        ]
 [0.00889703 0.         0.00138938 ... 0.00138938 0.         0.00138938]]

print(type(X))

결과:

<class 'numpy.matrix'>

X(희소 행렬의 크기 확인)

print(X.shape)

결과:

(2, 3276)

matrix 값을 보기 쉽게 DataFrame으로 바꿔서 확인한다.

import pandas as pd

pd.DataFrame(X)

어떤 단어의 TF-IDF 값인지 확인

- .get_feature_names() 사용

vectorizer.get_feature_names()

- 결과 (아래 생략):

['10',
 '100',
 '10engrossing',
 '10i',
 '10one',
 '10s',
 '10the',
 '10think',
 '11',
 '13',
 '18',
 '1940s',
 '1946',
 '1947',
 '1955',
 '1966',
 '1970',
 '1970s',
 '1972',
 '1992',
 '1994',
 '1995',
 '20',
 '2001',
 '2002',
 '2003',
 '20s',
 '234th',
 '25',
 '250',
 '25m',
 '28',
 '29',
 '30',
 '327',
 '33',
 '37',
 '3m',
 '40',
 '43',
 '50',
 '500',
 '70',
 '727',
 '90',
 '90s',
 '94',
 '_a',
 '_carrie_',
 '_casablanca_',
 '_cool',
 '_papillon_',
 '_shawshank_',
 '_the',
 'abbe',
 'ability',
 'abject',
 'able',
 'about',
 'above',
 'absolute',
 'absolutely',
 'absorbed',
 'abstract',
 'abundance',
 'academy',
 'accept',
 'acclaimed',
 'accounts',
 'accused',
 'achieve',
 'achieved',
 'achieves',
 'achieving',
 'achingly',
 'acknowledge',
 'across',
 'act',
 'acted',
 'acting',
 'action',
 'actions',
 'activities',
 'actor',
 'actors',
 'acts',
 'actual',
 'actually',
 'acutely',
 ....(생략)
 ]

- 결과를 보면 10,100과 같은 것들이 출력된다. 선행작업으로 불용어 제거를 해주면 좋다.

Cosine_similarity로 유사도 확인한다.

print("Similarity between 'The Shawshank Redemption' and 'The Godfather': ", cosine_similarity(X[0], X[1])) # 코사인 유사도

결과 :

Similarity between 'The Shawshank Redemption' and 'The Godfather':  [[0.9437827]]

-> 두 영화 리뷰 유사도가 높다.

두 영화를 안봐서 모르겠지만 비슷한가 보다 시간 날 때 봐야겠다. ㅎㅎ

다른 영화랑 비교 : The Shawshank Redemption (1994) & Inception (2010)

with open('shawshank.txt', 'r', encoding = 'utf-8') as f: 
    lines = f.readlines()  # 영화 리뷰 파일의 모든 라인을 읽어와 리스트로 저장
    doc1 = ' '.join(lines)
    
with open('inception.txt', 'r', encoding= 'utf-8') as f:  
    lines = f.readlines()  # 영화 리뷰 파일의 모든 라인을 읽어와 리스트로 저장
    doc2 = ' '.join(lines)
    
corpus = [doc1, doc2]  # doc1, doc2를 합쳐 corpus list를 생성
vectorizer = TfidfVectorizer()  # TfidfVectorizer() 객체 변수 생성
X = vectorizer.fit_transform(corpus).todense() # fit_transform()를 통해 corpus의 텍스트 데이터를 벡터화해 X에 저장하고 X를 dense한 matrix로 변환

print("Similarity between 'The Shawshank Redemption' and 'Inception': ", cosine_similarity(X[0], X[1]))

결과 :

Similarity between 'The Shawshank Redemption' and 'Inception':  [[0.19704257]]

-> 두 영화 리뷰 유사도가 낮다.

세 영화 리뷰 텍스트 간 유사도 계산

# 영화 리뷰.txt 불러오기
file = open('shawshank.txt', 'r', encoding = 'utf-8')
lines = file.readlines()  
doc1 = ' '.join(lines)

file = open('godfather.txt', 'r', encoding = 'utf-8')
lines = file.readlines()  
doc2 = ' '.join(lines)

file = open('inception.txt', 'r', encoding = 'utf-8')
lines = file.readlines()
doc3 = ' '.join(lines)

# 영화 리뷰 문서를 리스트에 담아 TF-IDF 값을 행렬로 받는다.
corpus = [doc1, doc2, doc3]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus).todense()

# 영화 간 cosine similarity 계산
print("Similarity between 'The Shawshank Redemption' and 'The Godfather': ", cosine_similarity(X[0], X[1]))
print("Similarity between 'The Shawshank Redemption' and 'Inception': ", cosine_similarity(X[0], X[2]))
print("Similarity between 'The Godfather' and 'Inception': ", cosine_similarity(X[1], X[2]))

결과 :

Similarity between 'The Shawshank Redemption' and 'The Godfather':  [[0.93484399]]
Similarity between 'The Shawshank Redemption' and 'Inception':  [[0.18080469]]
Similarity between 'The Godfather' and 'Inception':  [[0.16267018]]

- 문제점 : 문서가 많아질수록 두 개씩 짝짓는 행위를 반복해야 한다.

하나의 행 VS 전체 행 구도로 Cosine similarity 계산

#X[0]와 X의 모든 문서와의 Cosine similarity 값을 계산한다.
cosine_similarity( X[0] , X )
pd.DataFrame(cosine_similarity(X[0],X))

결과 :

전치 행렬을 이용하여 행과 열을 바꿔준다

cosine_similarity( X[0] , X ).T
pd.DataFrame( cosine_similarity(X[0], X).T ) # T == Transpose (전치 행렬)

결과 :

이번에는 각 행 vs 전체 행 구도로 Cosine similarity을 계산한다.

result = pd.DataFrame(cosine_similarity( X , X ))

result.columns = ['Shawshank', 'Godfather', 'Inception']
result.index = ['Shawshank', 'Godfather', 'Inception']

result

결과 :

DataFrame을 색깔을 칠해 보여주는 도구인 heatmap을 이용하여 시각화한다.

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))

sns.heatmap(result, annot=True, fmt='f', linewidths=5, cmap='RdYlBu')

sns.set(font_scale=2)
plt.tick_params(top=True, bottom=False, labeltop=True, labelbottom=False) # Let the horizontal axes labeling appear on top.
plt.show()

결과 :