info@raymondandmelusine.com

Video Data Mining

Tool: Python (with YouTube API).

The goal of this task is collecting data about youtube videos. YouTube API is used to get data. Here is a quickstart to install Google APIs Client Library for Python.After the library is installed, we need to set up API’s variables.

import google_auth_oauthlib.flow
import googleapiclient.discovery
import googleapiclient.errors
api_service_name = “youtube”
api_version = “v3”
client_secrets_file = “client_secret_444209051230-pl8n2f9kboq8uajjnq9vdbasa5k4nc02.apps.googleusercontent.com.json”

Then we get credentials and create an API client.

flow = google_auth_oauthlib.flow.InstalledAppFlow.from_client_secrets_file(client_secrets_file, scopes)
credentials = flow.run_console()
youtube = googleapiclient.discovery.build(api_service_name, api_version, credentials=credentials)

request = youtube.videos().list(
part=”snippet,contentDetails,statistics”,
id=”koPmuEyP3a0″
)
response = request.execute()

Now response variable contains all data about videos in scopes. All we need to do is to get data we need from response and save it in Excel file.

wb = Workbook()
ws1 = wb.active
ws1.title = “Video”
ws1[‘A1’] = ‘Title’
ws1[‘A2’] = ‘Description’
ws1[‘A3’] = ‘Views’
ws1[‘A4’] = ‘Likes’
ws1[‘A5’] = ‘Dislikes’
ws1[‘A6’] = ‘Comments Count’
ws1[‘A7’] = ‘EN Subtitles’

for video in response[‘items’]:

ws1[‘B1’] = video[‘snippet’][‘title’]
ws1[‘B2’] = video[‘snippet’][‘description’]
ws1[‘B3’] = video[‘statistics’][‘viewCount’]
ws1[‘B4’] = video[‘statistics’][‘likeCount’]
ws1[‘B5’] = video[‘statistics’][‘dislikeCount’]
ws1[‘B6’] = video[‘statistics’][‘commentCount’]

We also want to save English subtitles in the Excel file. To do it we need to import YouTubeTranscriptApi library.

from youtube_transcript_api import YouTubeTranscriptApi
ws1[‘B7’] = YouTubeTranscriptApi.get_transcript(“koPmuEyP3a0”,languages=[‘en’])

On the second tab of the Excel file we want to save all comments people wrote for the first video in scopes.

comments_thread = youtube.commentThreads().list(
part=’snippet’,
videoId=”koPmuEyP3a0″,
maxResults = 100,
textFormat=’plainText’,
).execute()

ws2 = wb.create_sheet(‘Comments’)
ws2[‘A1’] = ‘Comment ID’
ws2[‘B1’] = ‘Comment’
ws2[‘C1’] = ‘Likes’
ws2[‘D1’] = ‘Dislikes’
wb.save(““)
wb.close()
i=2

while(True):

for item in comments_thread[‘items’]:

wb2 = load_workbook(““)
ws2 = wb2[‘Comments’]
ws2.cell(row=i,column=1).value=item[‘snippet’][‘topLevelComment’][‘snippet’][‘textDisplay’]
ws2.cell(row=i,column=2).value=item[‘snippet’][‘topLevelComment’][‘snippet’][‘likeCount’]
if ‘dislikeCount’ in item:

ws2.cell(row=i,column=3).value=item[‘snippet’][‘topLevelComment’][‘snippet’][‘dislikeCount’]

else:

ws2.cell(row=i,column=3).value=0

i=i+1

comment_childs = youtube.comments().list(
part=”snippet”,
parentId=item[‘snippet’][‘topLevelComment’][‘id’],
maxResults = 100,
textFormat=”plainText”
).execute()

for child in comment_childs[“items”]:

ws2.cell(row=i,column=2).value=child[‘snippet’][‘textDisplay’]
ws2.cell(row=i,column=3).value=child[‘snippet’][‘likeCount’]
if ‘dislikeCount’ in child:

ws2.cell(row=i,column=4).value=child[‘snippet’][‘dislikeCount’]

else:

ws2.cell(row=i,column=4).value=0

i=i+1

wb2.save(“data_.xlsx”)

if ‘nextPageToken’ in comments_thread:

comments_thread = youtube.commentThreads().list(
part=’snippet’,
videoId=”koPmuEyP3a0″,
maxResults = 100,
pageToken = comments_thread[‘nextPageToken’],
textFormat=’plainText’,
).execute()

else: break

wb2.close()

There is a problem with getting all comments of videos in YouTube because of YouTube API quota. Our default quota is 10 000 units per day. So we can get about 500 comments per day. That’s why we save data in the Excel file for every item in comments_thread. But if you have thousands of comments, it’s not the way to collect data.

There are two ways to make it faster. The first one is request additional quota, but this way didn’t work for us as we didn’t get any answer for our request. The second way is use several accounts.