[스크래핑] Web scraping for news articles (2)

Notice

Recent Posts

Recent Comments

Link

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

개발자식

[스크래핑] Web scraping for news articles (2) 본문

Data/Python

[스크래핑] Web scraping for news articles (2)

밍츠 2022. 3. 28. 23:23

"데이터 분석"을 검색하여 네이버 뉴스 스크래핑하기

수집할 데이터

뉴스 제목
뉴스 발행 날짜
본문
뉴스 링크
발행 언론사

1. 필요한 라이브러리

import requests
from bs4 import BeautifulSoup

import pandas as pd
from datetime import datetime #현재날짜&시간 받아오기
import time
import re

2. "네이버 뉴스" 표시되어 있는 뉴스만 스크래핑

query = '데이터분석'
url = "https://search.naver.com/search.naver?where=news&query=" + query
web = requests.get(url).content
source = BeautifulSoup(web, 'html.parser')

urls_list = []

for urls in source.find_all('a', {'class' : "info"}):
    if urls["href"].startswith("https://news.naver.com"):
        urls_list.append(urls["href"])

urls_list

3. 각 기사들의 데이터를 종류별로 나눠 담는다.

titles = [] #제목
dates = [] #발행 날짜
articles = [] #본문
article_urls = [] #뉴스 url
press_companies = [] #언론사

error_urls=[]

for url in urls_list:
    try:
        headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
        web_news = requests.get(url, headers=headers).content
        source_news = BeautifulSoup(web_news, 'html.parser')

        # 2) 기사 제목 
        title = source_news.find('h3', {'id' : 'articleTitle'}).get_text()
        print('Processing article : {}'.format(title))
        # find 함수로 search 실패 시 Return 없음 (NoneType) -> NoneType object has no attribute named "get_text"

        # 3) 기사 날짜
        date = source_news.find('span', {'class' : 't11'}).get_text()

        # 4) 기사 본문
        article = source_news.find('div', {'id' : 'articleBodyContents'}).get_text()
        article = article.replace("\n", "")
        article = article.replace("// flash 오류를 우회하기 위한 함수 추가function _flash_removeCallback() {}", "")
        article = article.replace("동영상 뉴스       ", "")
        article = article.replace("동영상 뉴스", "")
        article = article.strip()
        
        # 5) 기사 발행 언론사
        press_company = source_news.find('address', {'class' : 'address_cp'}).find('a').get_text()
        
        # 위 2~5를 통해 성공적으로 제목/날짜/본문/언론사 정보가 모두 추출되었을 때에만 리스트에 추가해 길이를 동일하게 유지해줍니다.
        titles.append(title)
        dates.append(date)
        articles.append(article)
        article_urls.append(url) # 6) 기사 URL 
        press_companies.append(press_company)
    
    except:
        print('*** 다음 링크의 뉴스를 크롤링하는 중 에러가 발생했습니다 : {}'.format(url))
        error_urls.append(url)

* 크롬 개발자 도구를 이용하여 찾고 돌려가 보며 진행한다.

기사 제목

- h1, h2 태그인 경우 1개만 존재하지만 h3는 아니므로 확인해 봐야 한다.

기사 날짜

- '2022.03.25. 오전 10:18' 형태이다. 문자열 인덱싱을 이용하여 추후에 필요한 정보만 뽑아낼 수 있다.

본문

- 본문 내용이 아닌 자바스크립트나 \n을 replace()를 이용하여 없애준다.

- 특정 뉴스 웹 페이지 크롤링 중 에러가 발생할 시 회피하기 위해 try~ except로 예외처리를 해준다.

- 수집하고 싶은 데이터가 모두 추출되었을 때 리스트에 넣어 각 리스트 길이를 동일하게 유지해준다.

4. 데이터를 DataFrame으로 바꾸고 엑셀 파일로 저장한다.

article_df = pd.DataFrame({'Title':titles, 
                           'Date':dates, 
                           'Article':articles, 
                           'URL':article_urls, 
                           'PressCompany':press_companies})

article_df.to_excel('result_{}.xlsx'.format(datetime.now().strftime('%y%m%d_%H%M')), index=False, encoding='utf-8')
article_df.head()

결과 :

- DataFrame 생성을 딕셔너리 형태로 열 단위로 넣어준다.

- datetime 라이브러리를 이용하여 현재 날짜로 파일 이름을 저장한다. (result_연도 월일_시분. xlsx)

datetime 라이브러리

- now()를 이용하여 현재 시각을 출력한다.

- strftime()를 이용하여 날짜와 시간 정보를 문자열로 바꿔준다.

5. 여러 페이지에 걸쳐 크롤링하기

페이지 넘어가면서 스크래핑하는 것을 페이지네이션이라고 한다.

아래의 페이지의 각 번호의 태그를 확인해보면 아래 사진과 같이 start=에서 숫자가 다른 것을 알 수 있다.

-> 한 페이지당 기사는 10개로 1부터 10씩 늘어난다.

페이지네이션 구현

1. a 태그로 구성 (네이버 뉴스)

2. javascript 내부 실행

페이지네이션 접근 방법

1. 하드 코딩

2. range 함수

max_page = 5
start_points = []

for point in range(1, max_page*10+1, 10): #1,42,10
    start_points.append(str(point))

3. while문

max_page = 5
current_call = 1 #1,11,21,31,41,...
last_call = (max_page - 1) * 10 + 1 # max_page이 5일 경우 41에 해당 

while current_call <= last_call:
    print(current_call) # 1, 11, 21, 31, 41 
    current_call += 10

최종:

max_page = 5
current_call = 1
last_call = (max_page - 1) * 10 + 1 # max_page이 5일 경우 41에 해당 

while current_call <= last_call: # 조건문이 참인 "동안" 실행
    
    print('\n{}번째 기사글부터 크롤링을 시작합니다.'.format(current_call))
    
    url = "https://search.naver.com/search.naver?where=news&query=" + query + "&start=" + str(current_call)
    web = requests.get(url).content
    source = BeautifulSoup(web, 'html.parser')

    urls_list = []
    for urls in source.find_all('a', {'class' : "info"}):
        if urls["href"].startswith("https://news.naver.com"):
            urls_list.append(urls["href"])

    for url in urls_list:
        try:
            headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
            web_news = requests.get(url, headers=headers).content
            source_news = BeautifulSoup(web_news, 'html.parser')

            title = source_news.find('h3', {'id' : 'articleTitle'}).get_text()
            print('Processing article : {}'.format(title))

            date = source_news.find('span', {'class' : 't11'}).get_text()

            article = source_news.find('div', {'id' : 'articleBodyContents'}).get_text()
            article = article.replace("\n", "")
            article = article.replace("// flash 오류를 우회하기 위한 함수 추가function _flash_removeCallback() {}", "")
            article = article.replace("동영상 뉴스       ", "")
            article = article.replace("동영상 뉴스", "")
            article = article.strip()

            press_company = source_news.find('address', {'class' : 'address_cp'}).find('a').get_text()
            
            titles.append(title)
            dates.append(date)
            articles.append(article)
            press_companies.append(press_company)
            article_urls.append(url)
        except:
            print('*** 다음 링크의 뉴스를 크롤링하는 중 에러가 발생했습니다 : {}'.format(url))
            
    # 대량의 데이터를 대상으로 크롤링을 할 때에는 요청 사이에 쉬어주는 타이밍을 넣는 것이 좋습니다.
    time.sleep(5)
    current_call += 10

    
# 각 데이터 종류별 list에 담아둔 전체 데이터를 DataFrame에 모으고 엑셀 파일로 저장합니다.
# 파일명을 result_연도월일_시분.csv 로 지정합니다.
article_df = pd.DataFrame({'Title':titles, 
                           'Date':dates, 
                           'Article':articles, 
                           'URL':article_urls, 
                           'PressCompany':press_companies})

article_df.to_excel('result_{}.xlsx'.format(datetime.now().strftime('%y%m%d_%H%M')), index=False, encoding='utf-8')
article_df.head()

- 5페이지까지 페이지 네이션 , 반복문으로 접근

- url에 &start=페이지 번호에서 페이지 번호를 str()로 형변환한다.

- 대량의 데이터를 크롤링할 때에는 요청 사이에 쉬어주는 타이밍을 time.sleep(5)로 넣어준다.

6. 날짜 지정하여 크롤링하기

뉴스 검색 옵션에 기간을 설정하여 크롤링한다.

이때 url을 분석해보자

https://search.naver.com/search.naver?
    where=news&
    query=데이터분석&
    sort=0&
    photo=0&
    field=0&
    pd=3&
    ds=2022.01.01&
    de=2022.03.27&
    docid=&
    related=0&
    mynews=0&
    office_type=0&
    office_section_code=0&
    news_office_checked=&
    nso=so%3Ar%2Cp%3Afrom20220101to20220327&
    is_sug_officeid=0

여기서 날짜와 관련된 url이 쿼리가 무엇일지 생각해보면 ds, de 일 것 같다.

하지만 where, query, ds, de 만 남겨놓고 url을 확인해보면 설정한 날짜로 검색이 안 되는 것을 알 수 있다.

-> nso도 필요하다.

https://search.naver.com/search.naver?
    where=news&
    query=데이터분석&
    ds=2022.01.01&
    de=2022.03.27&
    nso=so%3Ar%2Cp%3Afrom20220101to20220327&

확인 결과 날짜 설정한 대로 잘 나오는 것을 알 수 있다.

실제로는 nso만 있어도 잘 나오는 것을 알 수 있다.

최종:

url 접근 방식 빼고 달라진 것은 없다.

query = '데이터분석'
start_date = '2021.01.01'
end_date = '2021.01.30'

max_page = 5

# 주어진 일자를 쿼리에 맞는 형태로 변경해줍니다.
start_date = start_date.replace(".", "")
end_date = end_date.replace(".", "")

# 지정한 기간 내 원하는 페이지 수만큼의 기사를 크롤링합니다.
current_call = 1
last_call = (max_page - 1) * 10 + 1 # max_page이 5일 경우 41에 해당 

while current_call <= last_call:
    
    print('\n{}번째 기사글부터 크롤링을 시작합니다.'.format(current_call))
    
    url = "https://search.naver.com/search.naver?where=news&query=" + query \
          + "&nso=so%3Ar%2Cp%3Afrom" + start_date \
          + "to" + end_date \
          + "%2Ca%3A&start=" + str(current_call)

    web = requests.get(url).content
    source = BeautifulSoup(web, 'html.parser')

    urls_list = []
    for urls in source.find_all('a', {'class' : "info"}):
        if urls["href"].startswith("https://news.naver.com"):
            urls_list.append(urls["href"])

    for url in urls_list:
        try:
            headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
            web_news = requests.get(url, headers=headers).content
            source_news = BeautifulSoup(web_news, 'html.parser')

            title = source_news.find('h3', {'id' : 'articleTitle'}).get_text()
            print('Processing article : {}'.format(title))

            date = source_news.find('span', {'class' : 't11'}).get_text()

            article = source_news.find('div', {'id' : 'articleBodyContents'}).get_text()
            article = article.replace("\n", "")
            article = article.replace("// flash 오류를 우회하기 위한 함수 추가function _flash_removeCallback() {}", "")
            article = article.replace("동영상 뉴스       ", "")
            article = article.replace("동영상 뉴스", "")
            article = article.strip()

            press_company = source_news.find('address', {'class' : 'address_cp'}).find('a').get_text()
            
            titles.append(title)
            dates.append(date)
            articles.append(article)
            press_companies.append(press_company)
            article_urls.append(url)
        except:
            print('*** 다음 링크의 뉴스를 크롤링하는 중 에러가 발생했습니다 : {}'.format(url))
            
    # 대량의 데이터를 대상으로 크롤링을 할 때에는 요청 사이에 쉬어주는 타이밍을 넣는 것이 좋습니다.
    time.sleep(5)
    current_call += 10

    
# 각 데이터 종류별 list에 담아둔 전체 데이터를 DataFrame에 모으고 엑셀 파일로 저장합니다.
# 파일명을 result_연도월일_시분.csv 로 지정합니다.
article_df = pd.DataFrame({'Title':titles, 
                           'Date':dates, 
                           'Article':articles, 
                           'URL':article_urls, 
                           'PressCompany':press_companies})

article_df.to_excel('result_{}.xlsx'.format(datetime.now().strftime('%y%m%d_%H%M')), index=False, encoding='utf-8')
article_df.head()

7. 기사 정렬 순서 지정하여 크롤링하기

https://search.naver.com/search.naver?where=news&query=데이터분석&sm=tab_opt&sort=1&photo=0&field=0&pd=0&ds=&de=&docid=&related=0&mynews=0&office_type=0&office_section_code=0&news_office_checked=&nso=so%3Add%2Cp%3Aall&is_sug_officeid=0
https://search.naver.com/search.naver?where=news&query=데이터분석&sm=tab_opt&sort=0&photo=0&field=0&pd=0&ds=&de=&docid=&related=0&mynews=0&office_type=0&office_section_code=0&news_office_checked=&nso=so%3Ar%2Cp%3Aall&is_sug_officeid=0
https://search.naver.com/search.naver?where=news&query=데이터분석&sm=tab_opt&sort=2&photo=0&field=0&pd=0&ds=&de=&docid=&related=0&mynews=0&office_type=0&office_section_code=0&news_office_checked=&nso=so%3Ar%2Cp%3Aall&is_sug_officeid=0

정렬 옵션을 하나씩 누르며 확인해보면 sort 쿼리가 0,1,2, 로 다르다.

# 관련 도순 = 0 최신순 = 1 오래된 순 = 2

최종 코드는 위의 코드에서 url 부분만 변경해준다.

#sort_type에 들어갈 숫자는 0 or 1 or 2
url = "https://search.naver.com/search.naver?where=news&query=" + query \
          + "&sort=" + str(sort_type) \
          + "&start=" + str(current_call)

최신순을 옵션으로 설정한 크롤링 결과를 보면 데이터 분석이라고 검색했지만 관련이 없는 내용이 있다.

이는 키워드 존재만으로 검색 결과에 뜬 것으로

온전히 데이터분석 키워드로 검색하고 싶으면 큰따옴표를 붙여준다.

->query = '\"데이터분석\"'

\의 기능은 큰따옴표의 기능을 없애는 것

최종적으로 검색하고 싶은 키워드, 기간, 정렬 타입, 페이지네이션 할 페이지 개수 등을 입력하여 웹 크롤링 할 수 있다.

글이 길어져 코드만 다음 글에 포스팅 해야겠다.

'Data > Python' 카테고리의 다른 글

[크롤링] Selenium_구글 번역기 (0)	2022.03.29
[스크래핑] Web scraping for news articles (3) (0)	2022.03.29
[스크래핑] Web scraping for news articles (1) (0)	2022.03.28
[Python] 클래스 (Class) (0)	2022.03.27
[크롤링] 웹 크롤링 BeautifulSoup (2) (0)	2022.03.25

'Data/Python' Related Articles

Comments

개발자식

[스크래핑] Web scraping for news articles (2) 본문

[스크래핑] Web scraping for news articles (2)

"데이터 분석"을 검색하여 네이버 뉴스 스크래핑하기

1. 필요한 라이브러리

2. "네이버 뉴스" 표시되어 있는 뉴스만 스크래핑

3. 각 기사들의 데이터를 종류별로 나눠 담는다.

4. 데이터를 DataFrame으로 바꾸고 엑셀 파일로 저장한다.

5. 여러 페이지에 걸쳐 크롤링하기

6. 날짜 지정하여 크롤링하기

7. 기사 정렬 순서 지정하여 크롤링하기

'Data > Python' 카테고리의 다른 글

티스토리툴바