파이썬 크롤링(유튜브 댓글) 주니어들이 봐도 가능 할 정도로 쉽게

Notice

Recent Posts

Recent Comments

Link

« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Tags more

Archives

Today

Total

관리 메뉴

멍청해서 기록한다

파이썬 크롤링(유튜브 댓글) 주니어들이 봐도 가능 할 정도로 쉽게 본문

Language/Python

파이썬 크롤링(유튜브 댓글) 주니어들이 봐도 가능 할 정도로 쉽게

개발근로자 2020. 1. 28. 14:31

1. 파이썬 설치

https://www.python.org/downloads/release/python-366/

Python Release Python 3.6.6

The official home of the Python Programming Language

www.python.org

하.. 3.7이 현재 최신 버전인데 정신 건강에 좋으려면 3.6으로 설치하자 암걸린다.

2. 아나콘다 설치

https://www.anaconda.com/distribution/#download-section

Anaconda Python/R Distribution - Free Download

Anaconda Distribution is the world's most popular Python data science platform. Download the free version to access over 1500 data science packages and manage libraries and dependencies with Conda.

www.anaconda.com

주피터 실행 확인

3. 텐서플로우 설치(2020.02.07 추가: 텐서플로우 안써도 되더라...)

pip install tensorflow

하.. 현재 텐서플로우 버전 2.0 인데..

이것도 지원을 안하나봄 주피터가 안되는것 같은데.. 맘 편하게 언인스톨 후 재설치

pip install --ignore-installed --upgrade tensorflow==1.7.0

위는 보면 알겠지만 1.7 설치

만약 상위버전을 설치했다면

pip uninstall tensorflow실행 후 다운버전으로 재설치 해보자.

만약 그래도 안된다면

pip list 명령으로 상위버전 존재 하는지 체크

4. 셀레늄 설치

pip3 install selenium

하.. pip랑 pip3이랑 다른놈이더라.. 별게 다 걸리넼ㅋㅋㅋㅋ

5. 소스 생성

from selenium import webdriver as wd
from bs4 import BeautifulSoup
import time
import pandas as pd
from IPython.display import display

# 최대 필드 출력수 설정
pd.set_option('display.max_colwidth',100)
# 최대 컬럼수 설정
pd.set_option('display.max_columns', 10)
# 최대 로우수 설정
pd.set_option('display.max_rows', 100)

driver = wd.Chrome(r'드라이버 경로')
url = '가져올 영상 주소'
driver.get(url)

last_page_height = driver.execute_script("return document.documentElement.scrollHeight")

while True:
    driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
    time.sleep(1.0)
    new_page_height = driver.execute_script("return document.documentElement.scrollHeight")

    if new_page_height == last_page_height:
        break
    last_page_height = new_page_height
    #else:
       # last_page_height == new_page_height
       # continue
        #break

html_source = driver.page_source

driver.close()

soup = BeautifulSoup(html_source, 'lxml')

youtube_user_IDs = soup.select('div#header-author > a > span')

youtube_comments = soup.select('yt-formatted-string#content-text')

str_youtube_userIDs = []
str_youtube_comments = []

for i in range(len(youtube_user_IDs)):
            str_tmp = str(youtube_user_IDs[i].text)
            # print(str_tmp)
            str_tmp = str_tmp.replace('\n', '')
            str_tmp = str_tmp.replace('\t', '')
            str_tmp = str_tmp.replace(' ','')
            str_youtube_userIDs.append(str_tmp)

            str_tmp = str(youtube_comments[i].text)
            str_tmp = str_tmp.replace('\n', '')
            str_tmp = str_tmp.replace('\t', '')
            str_tmp = str_tmp.replace(' ', '')
            str_youtube_comments.append(str_tmp)

#for i in range(len(str_youtube_userIDs)):
#    print(str_youtube_userIDs[i], str_youtube_comments[i])

pd_data = {"ID":str_youtube_userIDs, "Comment":str_youtube_comments}
youtube_pd = pd.DataFrame(pd_data)
display(youtube_pd)

저작자표시 비영리 동일조건 (새창열림)

멍청해서 기록한다

파이썬 크롤링(유튜브 댓글) 주니어들이 봐도 가능 할 정도로 쉽게 본문

파이썬 크롤링(유튜브 댓글) 주니어들이 봐도 가능 할 정도로 쉽게

티스토리툴바