Python Programming
  • Home
  • Intro
    • History & Background
    • Python Setup
  • Exercises
    • Chapter 5: Lists, Tuples, Sets
    • Chapter 6: Strings
    • Chapter 7: Dictionaries
    • Chapter 8: Control flow
    • Chapter 9: Functions
    • Chapter 14: Exceptions
    • Chapter 15: Classes
  • Exploring Data
    • NumPy & pandas
    • Inspecting data
    • Visualization
  • Library System
  • Netflix Movie Analysis
    • Notes
    • Project-Native
    • Project-pandas
  • References
    • QPB Part 1
    • QPB Part 2
    • QPB Part 3
    • QPB Part 4

On this page

  • Cleaned Data
  • Data Wrangling
    • subsetting
    • query()
    • sort_values()
    • assign()
    • groupby(), agg(), apply()
  • Exploratory Data Analysis (EDA)
    • 평점 분포
    • 요일별 평점
    • 제목의 길이?
    • 시청자별 분석
    • 장르별 분석
    • 출시년도
  • Project-Native 구현
    • 기본통계
    • 시청자 분석
      • 가장 활발한 시청자 TOP 5
      • 시청자별 통계
    • 작품별 분석
    • 장르 분석
    • 시간대 분석
    • 평점 분석
    • 종합 리포트
  • Module로 변환/활용

Project: Netflix Prize Data

Load packages
# numerical calculation & data frames
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn.objects as so

# pandas options
pd.set_option('mode.copy_on_write', True)  # pandas 2.0
pd.options.display.float_format = '{:.3f}'.format  # pd.reset_option('display.float_format')
# pd.options.display.max_rows = 7  # max number of rows to display

# NumPy options
np.set_printoptions(precision = 2, suppress=True)  # suppress scientific notation
np.set_printoptions(legacy='1.25')  # numpy 1.25 표현 방식


# matplotlib options
from matplotlib import style
theme_dict = {**style.library['ggplot'], "grid.linestyle": ":", 'axes.facecolor': 'white', 'grid.color': '.6',}
so.Plot.config.theme.update(theme_dict)

# theme_dict = {**sns.axes_style("whitegrid"), "grid.linestyle": ":"}
# so.Plot.config.theme.update(theme_dict)

# For high resolution display
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats("retina")

Cleaned Data

File: netflix_ratings.parquet

netflix = pd.read_parquet("data/netflix_ratings_titles.parquet")
netflix.info()
<class 'pandas.DataFrame'>
RangeIndex: 1246558 entries, 0 to 1246557
Data columns (total 17 columns):
 #   Column        Non-Null Count    Dtype         
---  ------        --------------    -----         
 0   viewer_id     1246558 non-null  Int32         
 1   rating        1246558 non-null  Int8          
 2   watch_date    1246558 non-null  datetime64[ns]
 3   release_year  1246558 non-null  int64         
 4   title         1246558 non-null  str           
 5   record_id     1246558 non-null  str           
 6   media_type    1246558 non-null  str           
 7   director      1245701 non-null  str           
 8   cast          1246558 non-null  str           
 9   country       1246109 non-null  str           
 10  date_added    1246558 non-null  str           
 11  MPAA_rating   1246558 non-null  str           
 12  duration      1246558 non-null  str           
 13  genre         1246558 non-null  object        
 14  description   1246558 non-null  str           
 15  runtime       1245701 non-null  Int32         
 16  num_episodes  857 non-null      Int32         
dtypes: Int32(3), Int8(1), datetime64[ns](1), int64(1), object(1), str(10)
memory usage: 593.3+ MB
netflix.head(3)
viewer_id rating watch_date release_year title record_id media_type director cast country date_added MPAA_rating duration genre description runtime num_episodes
0 2191540 4 2004-06-23 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States December 1, 2020 PG-13 99 min [Comedies, Romantic Movies] After falling for a pretty art teacher who has... 99 <NA>
1 998702 2 2005-08-26 2002 Men in Black II s7444 Movie Barry Sonnenfeld Tommy Lee Jones, Will Smith, Rip Torn, Lara Fl... United States October 1, 2019 PG-13 88 min [Action & Adventure, Comedies, Sci-Fi & Fantasy] Will Smith and Tommy Lee Jones reprise their r... 88 <NA>
2 1914163 4 2005-03-21 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States December 1, 2020 PG-13 99 min [Comedies, Romantic Movies] After falling for a pretty art teacher who has... 99 <NA>

Data Wrangling

참고서: Python for Data Analysis (3e) by Wes McKinney (파이썬 라이브러리를 활용한 데이터 분석)

대략 다음과 같은 transform들을 조합하여 분석에 필요한 상태로 바꿈

  • 변수들(열)과 관측치(행)를 선택: subsetting
  • 조건에 맞는 부분(관측치, 행)만 필터링: query()
  • 조건에 맞도록 행을 재정렬: sort_values()
  • 변수들과 함수들을 이용하여 새로운 변수를 생성: assign()
  • 카테고리별로 나뉘어진 데이터에 대한 통계치를 생성: groupby(), agg(), apply()
netflix_1990 = (
    netflix
    .loc[:, ["title", "rating", "watch_date", "release_year"]]  # 특정 열을 선택
    .query("release_year >= 1990")                              # 조건에 맞는 행만 필터링
    .assign(                                                    # 새로운 열/변수를 생성
        decade=lambda x: x["release_year"] // 10 * 10,          # 10년 단위
        weekday=lambda x: x["watch_date"].dt.day_name().str[:3] # 요일
    )
    .sort_values("release_year")                                # 조건에 따라 행을 정렬
)
netflix_1990
title rating watch_date release_year decade weekday
406248 Rocky V 3 2004-02-20 1990 1990 Fri
847433 Rocky V 1 2004-02-13 1990 1990 Fri
268823 Rocky V 2 2005-05-18 1990 1990 Wed
... ... ... ... ... ... ...
476261 King's Ransom 2 2005-08-31 2005 2000 Wed
476074 Coach Carter 2 2005-09-14 2005 2000 Wed
109441 The Amityville Horror 3 2005-11-07 2005 2000 Mon

1149195 rows × 6 columns

표시줄 수 지정
# max number of rows to display
pd.options.display.max_rows = 6

# reset
pd.options.display.max_rows = 0

subsetting

변수들(열)과 관측치(행)를 선택

  • Bracket []
  • Dot-notation .
  • iloc, loc
# Bracket [strings]
netflix[["title", "rating"]]
title rating
0 50 First Dates 4
1 Men in Black II 2
2 50 First Dates 4
... ... ...
1246555 Bad Boys II 3
1246556 Something's Gotta Give 4
1246557 The American President 5

1246558 rows × 2 columns

# Bracket [slicing]
netflix[:3]
viewer_id rating watch_date release_year title record_id media_type director cast country date_added MPAA_rating duration genre description runtime num_episodes
0 2191540 4 2004-06-23 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States December 1, 2020 PG-13 99 min [Comedies, Romantic Movies] After falling for a pretty art teacher who has... 99 <NA>
1 998702 2 2005-08-26 2002 Men in Black II s7444 Movie Barry Sonnenfeld Tommy Lee Jones, Will Smith, Rip Torn, Lara Fl... United States October 1, 2019 PG-13 88 min [Action & Adventure, Comedies, Sci-Fi & Fantasy] Will Smith and Tommy Lee Jones reprise their r... 88 <NA>
2 1914163 4 2005-03-21 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States December 1, 2020 PG-13 99 min [Comedies, Romantic Movies] After falling for a pretty art teacher who has... 99 <NA>
# dot-notation
netflix.title
0                  50 First Dates
1                 Men in Black II
2                  50 First Dates
                    ...          
1246555               Bad Boys II
1246556    Something's Gotta Give
1246557    The American President
Name: title, Length: 1246558, dtype: str
# `loc`: label-based indexing
netflix.loc[:, ["title", "rating"]]
title rating
0 50 First Dates 4
1 Men in Black II 2
2 50 First Dates 4
... ... ...
1246555 Bad Boys II 3
1246556 Something's Gotta Give 4
1246557 The American President 5

1246558 rows × 2 columns

# `iloc`: position-based indexing
netflix.iloc[100:103, :2]
viewer_id rating
100 563447 4
101 2233688 4
102 2174838 2

query()

조건에 맞는 부분(관측치, 행)만 필터링

Conditional operators
>,  >=,  <,  <=,
==  (equal to),  !=  (not equal to)
and, & (and)
or, | (or)
not, ~ (not)
in (includes), not in (not included)

netflix.query("release_year >= 1990 & release_year < 2000")
viewer_id rating watch_date release_year title record_id media_type director cast country date_added MPAA_rating duration genre description runtime num_episodes
3 2041113 3 2003-12-30 1995 Bad Boys s6212 Movie Michael Bay Will Smith, Martin Lawrence, Téa Leoni, Tchéky... United States October 1, 2019 R 119 min [Action & Adventure, Comedies] In this fast-paced actioner, two Miami narcoti... 119 <NA>
4 948394 4 2005-04-01 1996 Dragonheart s6642 Movie Rob Cohen Sean Connery, Dennis Quaid, David Thewlis, Pet... United States January 1, 2020 PG-13 103 min [Action & Adventure, Sci-Fi & Fantasy] In ancient times when majestic fire-breathers ... 103 <NA>
8 79898 4 2005-06-24 1999 Superstar s8131 Movie Bruce McCulloch Molly Shannon, Will Ferrell, Elaine Hendrix, H... United States November 20, 2019 PG-13 82 min [Comedies] A socially awkward Catholic schoolgirl vows to... 82 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1246552 1161577 4 2004-06-06 1995 Bad Boys s6212 Movie Michael Bay Will Smith, Martin Lawrence, Téa Leoni, Tchéky... United States October 1, 2019 R 119 min [Action & Adventure, Comedies] In this fast-paced actioner, two Miami narcoti... 119 <NA>
1246553 26124 3 2005-09-02 1996 The Mirror Has Two Faces s819 Movie Barbra Streisand Barbra Streisand, Jeff Bridges, Lauren Bacall,... United States June 2, 2021 PG-13 127 min [Comedies, Dramas, Romantic Movies] Tired of being single, middle-aged professor R... 127 <NA>
1246557 811068 5 2004-08-26 1995 The American President s8188 Movie Rob Reiner Michael Douglas, Annette Bening, Martin Sheen,... United States January 1, 2021 PG-13 113 min [Comedies, Dramas, Romantic Movies] The widowed president strikes up a romance wit... 113 <NA>

428354 rows × 17 columns

netflix.query("rating in [1, 5]")
viewer_id rating watch_date release_year title record_id media_type director cast country date_added MPAA_rating duration genre description runtime num_episodes
9 268405 5 2005-11-02 2004 A Cinderella Story s128 Movie Mark Rosman Hilary Duff, Chad Michael Murray, Jennifer Coo... United States, Canada September 1, 2021 PG 95 min [Children & Family Movies, Comedies] Teen Sam meets the boy of her dreams at a danc... 95 <NA>
15 914520 1 2004-06-14 2003 S.W.A.T. s7913 Movie Clark Johnson Samuel L. Jackson, Colin Farrell, Michelle Rod... United States January 1, 2021 PG-13 117 min [Action & Adventure] A veteran cop is tasked with drafting and trai... 117 <NA>
18 86305 5 2005-10-24 2005 The Amityville Horror s8189 Movie Andrew Douglas Ryan Reynolds, Melissa George, Chloë Grace Mor... United States January 1, 2020 R 89 min [Horror Movies] This hair-raising remake of the 1979 horror hi... 89 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1246548 2612134 5 2003-12-23 2003 Bad Boys II s6213 Movie Michael Bay Will Smith, Martin Lawrence, Jordi Mollà, Gabr... United States October 1, 2019 R 147 min [Action & Adventure, Comedies] In this hyperkinetic sequel, a pair of Miami n... 147 <NA>
1246551 340807 5 2003-08-12 1999 American Beauty s6136 Movie Sam Mendes Kevin Spacey, Annette Bening, Thora Birch, Wes... United States January 1, 2020 R 122 min [Dramas] While struggling to endure his tightly wound w... 122 <NA>
1246557 811068 5 2004-08-26 1995 The American President s8188 Movie Rob Reiner Michael Douglas, Annette Bening, Martin Sheen,... United States January 1, 2021 PG-13 113 min [Comedies, Dramas, Romantic Movies] The widowed president strikes up a romance wit... 113 <NA>

320750 rows × 17 columns

# title에 game이 포함되어 있는 행만 필터링
netflix.loc[netflix.title.str.contains("Game"), :]  # .str: 문자열 처리

# 동일한 작업을 query()를 이용해 필터링
netflix.query("title.str.contains('Game')")
viewer_id rating watch_date release_year title record_id media_type director cast country date_added MPAA_rating duration genre description runtime num_episodes
108 1611338 4 2003-02-27 1997 The Game s601 Movie David Fincher Michael Douglas, Sean Penn, Deborah Kara Unger... United States July 1, 2021 R 129 min [Thrillers] An aloof investment banker's life spirals into... 129 <NA>
187 2634345 5 2005-06-28 1997 The Game s601 Movie David Fincher Michael Douglas, Sean Penn, Deborah Kara Unger... United States July 1, 2021 R 129 min [Thrillers] An aloof investment banker's life spirals into... 129 <NA>
245 465580 3 2004-02-09 1997 The Game s601 Movie David Fincher Michael Douglas, Sean Penn, Deborah Kara Unger... United States July 1, 2021 R 129 min [Thrillers] An aloof investment banker's life spirals into... 129 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1246315 715480 4 2005-08-13 1997 The Game s601 Movie David Fincher Michael Douglas, Sean Penn, Deborah Kara Unger... United States July 1, 2021 R 129 min [Thrillers] An aloof investment banker's life spirals into... 129 <NA>
1246407 2567821 2 2003-05-18 1997 The Game s601 Movie David Fincher Michael Douglas, Sean Penn, Deborah Kara Unger... United States July 1, 2021 R 129 min [Thrillers] An aloof investment banker's life spirals into... 129 <NA>
1246415 1154697 4 2004-11-17 1997 The Game s601 Movie David Fincher Michael Douglas, Sean Penn, Deborah Kara Unger... United States July 1, 2021 R 129 min [Thrillers] An aloof investment banker's life spirals into... 129 <NA>

19341 rows × 17 columns

sort_values()

조건에 맞도록 행을 재정렬

netflix.sort_values("release_year")  # year에 대해 오름차순 정렬
netflix.sort_values("release_year", ascending=False)  # year에 대해 내림차순 정렬
netflix.sort_values(["release_year", "rating"], ascending=False)  # year 다음 rating에 대해 내림차순 정렬
viewer_id rating watch_date release_year title record_id media_type director cast country date_added MPAA_rating duration genre description runtime num_episodes
18 86305 5 2005-10-24 2005 The Amityville Horror s8189 Movie Andrew Douglas Ryan Reynolds, Melissa George, Chloë Grace Mor... United States January 1, 2020 R 89 min [Horror Movies] This hair-raising remake of the 1979 horror hi... 89 <NA>
31 666338 5 2005-08-19 2005 Coach Carter s6511 Movie Thomas Carter Samuel L. Jackson, Rob Brown, Robert Ri'chard,... United States, Germany January 1, 2020 PG-13 137 min [Dramas, Sports Movies] Controversial basketball coach Ken Carter puts... 137 <NA>
189 642572 5 2005-07-27 2005 Coach Carter s6511 Movie Thomas Carter Samuel L. Jackson, Rob Brown, Robert Ri'chard,... United States, Germany January 1, 2020 PG-13 137 min [Dramas, Sports Movies] Controversial basketball coach Ken Carter puts... 137 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1230179 2537543 1 2005-12-02 1965 The Cincinnati Kid s8250 Movie Norman Jewison Steve McQueen, Edward G. Robinson, Ann-Margret... United States November 1, 2019 TV-14 103 min [Classic Movies, Dramas] In Depression-era New Orleans, cocksure stud p... 103 <NA>
1232551 455843 1 2004-01-12 1965 Doctor Zhivago s6620 Movie David Lean Omar Sharif, Julie Christie, Geraldine Chaplin... United States, Italy, United Kingdom, Liechten... November 1, 2019 PG-13 200 min [Classic Movies, Dramas, Romantic Movies] A young physician and his beautiful mistress g... 200 <NA>
1240585 2185183 1 2005-06-07 1965 Doctor Zhivago s6620 Movie David Lean Omar Sharif, Julie Christie, Geraldine Chaplin... United States, Italy, United Kingdom, Liechten... November 1, 2019 PG-13 200 min [Classic Movies, Dramas, Romantic Movies] A young physician and his beautiful mistress g... 200 <NA>

1246558 rows × 17 columns

assign()

변수들과 함수들을 이용하여 새로운 변수를 생성

가령, 1990년대, 2000년대 등등과 같이 10년 단위로 새로운 값을 생성하려면,

netflix["release_year"]  # pandas Series 객체, 그 값들은 NumPy array
0          2004
1          2002
2          2004
           ... 
1246555    2003
1246556    2003
1246557    1995
Name: release_year, Length: 1246558, dtype: int64
netflix["release_year"] // 10  # NumPy array에 대한 나눗셈(몫) 연산
0          200
1          200
2          200
          ... 
1246555    200
1246556    200
1246557    199
Name: release_year, Length: 1246558, dtype: int64
netflix.assign(
    decade=netflix["release_year"] // 10 * 10, # 10년 단위
)
viewer_id rating watch_date release_year title record_id media_type director cast country date_added MPAA_rating duration genre description runtime num_episodes decade
0 2191540 4 2004-06-23 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States December 1, 2020 PG-13 99 min [Comedies, Romantic Movies] After falling for a pretty art teacher who has... 99 <NA> 2000
1 998702 2 2005-08-26 2002 Men in Black II s7444 Movie Barry Sonnenfeld Tommy Lee Jones, Will Smith, Rip Torn, Lara Fl... United States October 1, 2019 PG-13 88 min [Action & Adventure, Comedies, Sci-Fi & Fantasy] Will Smith and Tommy Lee Jones reprise their r... 88 <NA> 2000
2 1914163 4 2005-03-21 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States December 1, 2020 PG-13 99 min [Comedies, Romantic Movies] After falling for a pretty art teacher who has... 99 <NA> 2000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1246555 1991791 3 2004-01-06 2003 Bad Boys II s6213 Movie Michael Bay Will Smith, Martin Lawrence, Jordi Mollà, Gabr... United States October 1, 2019 R 147 min [Action & Adventure, Comedies] In this hyperkinetic sequel, a pair of Miami n... 147 <NA> 2000
1246556 925565 4 2005-11-20 2003 Something's Gotta Give s8056 Movie Nancy Meyers Jack Nicholson, Diane Keaton, Keanu Reeves, Fr... United States August 1, 2019 PG-13 128 min [Comedies, Romantic Movies] Still sexy at 60, Harry Sanborn wines and dine... 128 <NA> 2000
1246557 811068 5 2004-08-26 1995 The American President s8188 Movie Rob Reiner Michael Douglas, Annette Bening, Martin Sheen,... United States January 1, 2021 PG-13 113 min [Comedies, Dramas, Romantic Movies] The widowed president strikes up a romance wit... 113 <NA> 1990

1246558 rows × 18 columns

# lambda를 이용한 값 생성
netflix.assign(
    decade=lambda x: x["release_year"] // 10 * 10, # 10년 단위
    weekday=lambda x: x["watch_date"].dt.day_name().str[:3] # 요일
)
viewer_id rating watch_date release_year title record_id media_type director cast country date_added MPAA_rating duration genre description runtime num_episodes decade weekday
0 2191540 4 2004-06-23 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States December 1, 2020 PG-13 99 min [Comedies, Romantic Movies] After falling for a pretty art teacher who has... 99 <NA> 2000 Wed
1 998702 2 2005-08-26 2002 Men in Black II s7444 Movie Barry Sonnenfeld Tommy Lee Jones, Will Smith, Rip Torn, Lara Fl... United States October 1, 2019 PG-13 88 min [Action & Adventure, Comedies, Sci-Fi & Fantasy] Will Smith and Tommy Lee Jones reprise their r... 88 <NA> 2000 Fri
2 1914163 4 2005-03-21 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States December 1, 2020 PG-13 99 min [Comedies, Romantic Movies] After falling for a pretty art teacher who has... 99 <NA> 2000 Mon
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1246555 1991791 3 2004-01-06 2003 Bad Boys II s6213 Movie Michael Bay Will Smith, Martin Lawrence, Jordi Mollà, Gabr... United States October 1, 2019 R 147 min [Action & Adventure, Comedies] In this hyperkinetic sequel, a pair of Miami n... 147 <NA> 2000 Tue
1246556 925565 4 2005-11-20 2003 Something's Gotta Give s8056 Movie Nancy Meyers Jack Nicholson, Diane Keaton, Keanu Reeves, Fr... United States August 1, 2019 PG-13 128 min [Comedies, Romantic Movies] Still sexy at 60, Harry Sanborn wines and dine... 128 <NA> 2000 Sun
1246557 811068 5 2004-08-26 1995 The American President s8188 Movie Rob Reiner Michael Douglas, Annette Bening, Martin Sheen,... United States January 1, 2021 PG-13 113 min [Comedies, Dramas, Romantic Movies] The widowed president strikes up a romance wit... 113 <NA> 1990 Thu

1246558 rows × 19 columns

groupby(), agg(), apply()

카테고리별로 나뉘어진 데이터에 대한 통계치를 생성

  • groupby()는 데이터를 의미있는 그룹으로 나누어 분석할 수 있도록 해줌
  • .count(), .sum(), .mean(), .min(), .max()과 같은 통계치를 구하는 methods와 함께 효과적으로 자주 활용됨

아래 표는 groupby()와 함께 자주 쓰이는 효율적인 methods

Source: Ch.10 in Python for Data Analysis (3e) by Wes McKinney

netflix.columns
Index(['viewer_id', 'rating', 'watch_date', 'release_year', 'title',
       'record_id', 'media_type', 'director', 'cast', 'country', 'date_added',
       'MPAA_rating', 'duration', 'genre', 'description', 'runtime',
       'num_episodes'],
      dtype='str')
netflix.groupby("title")["rating"].mean()
title
50 First Dates                 3.752
A Cinderella Story             3.625
About a Boy                    3.589
                                ... 
Training Day                   3.571
Tremors 4: The Legend Begins   3.043
Under Siege                    3.313
Name: rating, Length: 94, dtype: Float64
netflix.groupby("viewer_id")["rating"].agg(["mean", "std"])
mean std
viewer_id
6 3.250 1.035
7 4.071 1.141
8 4.000 <NA>
... ... ...
2649409 4.000 <NA>
2649426 4.000 0.000
2649429 4.833 0.408

316804 rows × 2 columns

def min_max(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])  # Series를 반환

netflix.groupby("title")["rating"].apply(min_max)
title                            
50 First Dates                min    1
                              max    5
A Cinderella Story            min    1
                                    ..
Tremors 4: The Legend Begins  max    5
Under Siege                   min    1
                              max    5
Name: rating, Length: 188, dtype: int8
def standardize_by_user(x):
    return (x - x.mean()) / x.std()

(
    netflix
    .groupby("viewer_id")["rating"]
    .apply(standardize_by_user)
    .reset_index(level=0)
)
viewer_id rating
422187 6 0.725
723559 6 -0.242
726864 6 0.725
... ... ...
1009754 2649429 0.408
1132274 2649429 0.408
1228870 2649429 0.408

1246558 rows × 2 columns

Exploratory Data Analysis (EDA)

평점 분포

mean_ratings = (
    netflix
    .groupby("title")["rating"]
    .agg(["mean", "std", "count"])
    .reset_index()
)
mean_ratings
title mean std count
0 50 First Dates 3.752 1.002 72913
1 A Cinderella Story 3.625 1.036 17594
2 About a Boy 3.589 0.952 36596
... ... ... ... ...
91 Training Day 3.571 0.989 36707
92 Tremors 4: The Legend Begins 3.043 1.194 947
93 Under Siege 3.313 1.019 14769

94 rows × 4 columns

mean_ratings.sort_values("mean", ascending=False)
title mean std count
43 Kabhi Khushi Kabhie Gham 4.237 1.008 97
86 The Pianist 4.116 0.901 30497
18 Coach Carter 4.071 0.891 30022
... ... ... ... ...
80 The Flintstones 2.441 1.045 5203
40 Jeans 2.269 1.023 93
14 Boom 1.882 1.107 51

94 rows × 4 columns

(
    so.Plot(mean_ratings, x="mean", y="std")
    .add(so.Dots(alpha=.5), pointsize="count")
    .add(so.Line(), so.PolyFit(5))
    .scale(pointsize=(3, 20))
    .layout(size=(8, 7))
)

netflix2 = (
    netflix
    .assign(
        decade=lambda x: x["release_year"] // 10 * 10,  # 10년 단위
        weekday=lambda x: x["watch_date"].dt.day_name().str[:3],
        weekend=lambda x: x["weekday"].isin(["Sat", "Sun"]),
        title_length=lambda x: x["title"].str.len()
    )
)

netflix2["weekday"] = netflix2["weekday"].astype("category").cat.set_categories(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])

netflix2.head()
viewer_id rating watch_date release_year title record_id media_type director cast country ... MPAA_rating duration genre description runtime num_episodes decade weekday weekend title_length
0 2191540 4 2004-06-23 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States ... PG-13 99 min [Comedies, Romantic Movies] After falling for a pretty art teacher who has... 99 <NA> 2000 Wed False 14
1 998702 2 2005-08-26 2002 Men in Black II s7444 Movie Barry Sonnenfeld Tommy Lee Jones, Will Smith, Rip Torn, Lara Fl... United States ... PG-13 88 min [Action & Adventure, Comedies, Sci-Fi & Fantasy] Will Smith and Tommy Lee Jones reprise their r... 88 <NA> 2000 Fri False 15
2 1914163 4 2005-03-21 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States ... PG-13 99 min [Comedies, Romantic Movies] After falling for a pretty art teacher who has... 99 <NA> 2000 Mon False 14
3 2041113 3 2003-12-30 1995 Bad Boys s6212 Movie Michael Bay Will Smith, Martin Lawrence, Téa Leoni, Tchéky... United States ... R 119 min [Action & Adventure, Comedies] In this fast-paced actioner, two Miami narcoti... 119 <NA> 1990 Tue False 8
4 948394 4 2005-04-01 1996 Dragonheart s6642 Movie Rob Cohen Sean Connery, Dennis Quaid, David Thewlis, Pet... United States ... PG-13 103 min [Action & Adventure, Sci-Fi & Fantasy] In ancient times when majestic fire-breathers ... 103 <NA> 1990 Fri False 11

5 rows × 21 columns

mean_ratings_by_wday = (
    netflix2
    .groupby(["weekday"])["rating"]
    .agg(["mean", "std", "count"])
    .reset_index()
)
mean_ratings_by_wday
weekday mean std count
0 Mon 3.587 1.060 214181
1 Tue 3.581 1.060 221189
2 Wed 3.593 1.061 206516
... ... ... ... ...
4 Fri 3.581 1.070 167458
5 Sat 3.586 1.075 123017
6 Sun 3.590 1.072 133268

7 rows × 4 columns

요일별 평점

(
    so.Plot(mean_ratings_by_wday, x="weekday", y="mean")
    .add(so.Dot(), pointsize="count")
)

제목의 길이?

(
    so.Plot(netflix2, x="title_length", y="rating")
    .add(so.Dot(), so.Agg("mean"))
)

시청자별 분석

viewing_count = netflix.groupby("viewer_id")["rating"].size()
viewing_count
viewer_id
6           8
7          14
8           1
           ..
2649409     1
2649426     4
2649429     6
Name: rating, Length: 316804, dtype: Int64
# pandas의 method를 사용한 시각화
viewing_count.hist(bins=100);

viewing_count_df = viewing_count.reset_index(name="count")
(
    so.Plot(viewing_count_df, x="count")
    .add(so.Bars(), so.Hist(bins=50))
    .limit(y=(0, 100))
)

user_stats = (
    netflix
    .groupby("viewer_id")["rating"]
    .agg(["mean", "std", "count"])
)
user_stats
mean std count
viewer_id
6 3.250 1.035 8
7 4.071 1.141 14
8 4.000 <NA> 1
... ... ... ...
2649409 4.000 <NA> 1
2649426 4.000 0.000 4
2649429 4.833 0.408 6

316804 rows × 3 columns

user_stat_30 = user_stats.query("count >= 30")
user_stat_30
mean std count
viewer_id
57633 2.176 1.314 34
110938 3.781 0.941 32
147386 3.667 0.711 30
... ... ... ...
2606799 3.237 1.515 38
2625420 2.419 0.923 31
2626336 2.516 1.235 31

78 rows × 3 columns

(
    so.Plot(user_stat_30, x="mean", y="std")
    .add(so.Dot(alpha=.5))
    .add(so.Line(color=".5"), so.PolyFit(5))
    .layout(size=(8, 7))
)

장르별 분석

netflix["genre"]
0                               [Comedies, Romantic Movies]
1          [Action & Adventure, Comedies, Sci-Fi & Fantasy]
2                               [Comedies, Romantic Movies]
                                 ...                       
1246555                      [Action & Adventure, Comedies]
1246556                         [Comedies, Romantic Movies]
1246557                 [Comedies, Dramas, Romantic Movies]
Name: genre, Length: 1246558, dtype: object
netflix_long = netflix.explode('genre')
netflix_long
viewer_id rating watch_date release_year title record_id media_type director cast country date_added MPAA_rating duration genre description runtime num_episodes
0 2191540 4 2004-06-23 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States December 1, 2020 PG-13 99 min Comedies After falling for a pretty art teacher who has... 99 <NA>
0 2191540 4 2004-06-23 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States December 1, 2020 PG-13 99 min Romantic Movies After falling for a pretty art teacher who has... 99 <NA>
1 998702 2 2005-08-26 2002 Men in Black II s7444 Movie Barry Sonnenfeld Tommy Lee Jones, Will Smith, Rip Torn, Lara Fl... United States October 1, 2019 PG-13 88 min Action & Adventure Will Smith and Tommy Lee Jones reprise their r... 88 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1246557 811068 5 2004-08-26 1995 The American President s8188 Movie Rob Reiner Michael Douglas, Annette Bening, Martin Sheen,... United States January 1, 2021 PG-13 113 min Comedies The widowed president strikes up a romance wit... 113 <NA>
1246557 811068 5 2004-08-26 1995 The American President s8188 Movie Rob Reiner Michael Douglas, Annette Bening, Martin Sheen,... United States January 1, 2021 PG-13 113 min Dramas The widowed president strikes up a romance wit... 113 <NA>
1246557 811068 5 2004-08-26 1995 The American President s8188 Movie Rob Reiner Michael Douglas, Annette Bening, Martin Sheen,... United States January 1, 2021 PG-13 113 min Romantic Movies The widowed president strikes up a romance wit... 113 <NA>

2515217 rows × 17 columns

(
    so.Plot(netflix_long, y="genre")  # y에 genre가 나오도록!
    .add(so.Bar(), so.Hist("proportion"))
)

genre_mean = (
    netflix_long
    .groupby('genre')['rating']
    .agg(['mean', 'std', 'count'])
    .reset_index()
)
genre_mean
genre mean std count
0 Action & Adventure 3.518 1.074 466348
1 Anime Series 3.627 1.361 126
2 Children & Family Movies 3.319 1.098 106728
... ... ... ... ...
15 Sports Movies 3.819 1.032 49586
16 TV Dramas 3.765 1.167 731
17 Thrillers 3.551 0.998 133365

18 rows × 4 columns

genre_mean.nlargest(3, "mean")
genre mean std count
9 Independent Movies 3.907 0.993 70433
15 Sports Movies 3.819 1.032 49586
10 International Movies 3.809 0.971 107332
order_by_mean = genre_mean.sort_values("mean", ascending=False)["genre"].values

(
    so.Plot(genre_mean, y="genre", x="mean")
    .add(so.Bar())
    .scale(y=so.Nominal(order=order_by_mean))  # 그래프에 순서 부여
    .limit(x=(3, 4.1))
)

(
    so.Plot(genre_mean, y="genre", x="std")
    .add(so.Bar())
    .scale(y=so.Nominal(order=order_by_mean))  # 그래프에 순서 부여
    .limit(x=(.9, 1.4))
)

netflix_long["weekday"] = netflix_long["watch_date"].dt.day_name().str[:3]
netflix_long["weekday"] = netflix_long["weekday"].astype("category").cat.set_categories(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])

genre_mean_by_weekday = (
    netflix_long
    .groupby(['genre', 'weekday'], observed=True)['rating']
    .agg(['mean', 'std', 'count'])
    .reset_index()
)
genre_mean_by_weekday
genre weekday mean std count
0 Action & Adventure Mon 3.517 1.072 80952
1 Action & Adventure Tue 3.506 1.067 82126
2 Action & Adventure Wed 3.527 1.071 76901
... ... ... ... ... ...
123 Thrillers Fri 3.536 1.008 18130
124 Thrillers Sat 3.561 1.014 13853
125 Thrillers Sun 3.559 0.994 14223

126 rows × 5 columns

order_by_mean = genre_mean.sort_values("mean", ascending=False)["genre"].values

(
    so.Plot(genre_mean_by_weekday, y="genre", x="mean", color="weekday")
    .add(so.Bar(), so.Dodge())
    .scale(y=so.Nominal(order=order_by_mean))  # 그래프에 순서 부여
    # .facet("weekday")
    .limit(x=(3.3, 4.1))
    .layout(size=(9, 9))
)

mean_ratings_by_genre = (
    netflix_long
    .groupby(["title", "genre"])["rating"]
    .agg(["mean", "std", "count"])
    .reset_index()
)
mean_ratings_by_genre
title genre mean std count
0 50 First Dates Comedies 3.752 1.002 72913
1 50 First Dates Romantic Movies 3.752 1.002 72913
2 A Cinderella Story Children & Family Movies 3.625 1.036 17594
... ... ... ... ... ...
206 Tremors 4: The Legend Begins Horror Movies 3.043 1.194 947
207 Tremors 4: The Legend Begins Sci-Fi & Fantasy 3.043 1.194 947
208 Under Siege Action & Adventure 3.313 1.019 14769

209 rows × 5 columns

(
    so.Plot(mean_ratings_by_genre, x="mean", y="std")
    .add(so.Dots(alpha=.5), pointsize="count")
    .add(so.Line(color=".3"), so.PolyFit(1))
    .facet("genre", wrap=4)
    .share(y=False)
    .scale(pointsize=(3, 20))
    .layout(size=(9, 13))
)

출시년도

netflix.head(3)
viewer_id rating watch_date release_year title record_id media_type director cast country date_added MPAA_rating duration genre description runtime num_episodes
0 2191540 4 2004-06-23 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States December 1, 2020 PG-13 99 min [Comedies, Romantic Movies] After falling for a pretty art teacher who has... 99 <NA>
1 998702 2 2005-08-26 2002 Men in Black II s7444 Movie Barry Sonnenfeld Tommy Lee Jones, Will Smith, Rip Torn, Lara Fl... United States October 1, 2019 PG-13 88 min [Action & Adventure, Comedies, Sci-Fi & Fantasy] Will Smith and Tommy Lee Jones reprise their r... 88 <NA>
2 1914163 4 2005-03-21 2004 50 First Dates s6019 Movie Peter Segal Adam Sandler, Drew Barrymore, Rob Schneider, S... United States December 1, 2020 PG-13 99 min [Comedies, Romantic Movies] After falling for a pretty art teacher who has... 99 <NA>
netflix.groupby(["release_year", "title"])["rating"].agg(["mean", "count"])
mean count
release_year title
1965 Doctor Zhivago 3.976 8810
The Cincinnati Kid 3.651 501
1968 Rosemary's Baby 3.608 16421
... ... ... ...
2005 King's Ransom 2.958 4040
Paheli 3.452 31
The Amityville Horror 3.484 9896

94 rows × 2 columns

(
    so.Plot(netflix, x="release_year", y="rating")
    .add(so.Line(marker="."), so.Agg("mean"))
)

# 10년 단위로 
netflix["decade"] = netflix["release_year"] // 10 * 10
(
    so.Plot(netflix, x="decade", y="rating")
    .add(so.Line(marker="."), so.Agg("mean"))
)

Project-Native 구현

netflix = pd.read_parquet("data/netflix_ratings_titles.parquet")

기본통계

  • 총 시청 기록 수: 1246558개
  • 총 시청자 수: 316804명
  • 고유 작품 수: 94개
  • 총 시청 시간: 2391943.3시간
  • 전체 평균 평점: 3.59/10
# 총 시청 기록 수
netflix.shape
(1246558, 17)
# 총 시청자 수
netflix["viewer_id"].nunique()
316804
# 고유 작품 수
netflix["title"].nunique()
94
# 총 시청 시간
netflix["runtime"].sum() / 60
2391943.316666667
# 전체 평균 평점: 3.59/10
netflix["rating"].mean()
3.58698833106843
def get_basic_stats(db):
    return {
        "총 시청 기록수": db.shape[0],
        "총 시청자 수": db["viewer_id"].nunique(),
        "고유 작품 수": db["title"].nunique(),
        "총 시청 시간": db["runtime"].sum() / 60,
        "전체 평균 평점": db["rating"].mean()
    }

get_basic_stats(netflix)
{'총 시청 기록수': 1246558,
 '총 시청자 수': 316804,
 '고유 작품 수': 94,
 '총 시청 시간': 2391943.316666667,
 '전체 평균 평점': 3.58698833106843}
NumPy 스칼라값 표현 예전 방식으로 수정
np.set_printoptions(legacy='1.25')

시청자 분석

가장 활발한 시청자 TOP 5

  1. 387418: 45회 시청
  2. 305344: 43회 시청
  3. 1664010: 43회 시청
  4. 2439493: 40회 시청
  5. 2606799: 38회 시청
# 가장 활발한 시청자 TOP 5
netflix.value_counts("viewer_id")[:5].reset_index(name="count")
viewer_id count
0 387418 45
1 305344 43
2 1664010 43
3 2439493 40
4 2606799 38

시청자별 통계

[6] 시청 횟수: 8회 평균 평점: 3.25/10

[7] 시청 횟수: 14회 평균 평점: 4.07/10

[8] 시청 횟수: 1회 평균 평점: 4.00/10

[10] 시청 횟수: 1회 평균 평점: 4.00/10

[59] 시청 횟수: 1회 평균 평점: 4.00/10

# 시청자별 통계
user_stats = netflix.groupby("viewer_id")["rating"].agg(["mean", "std", "count"])
user_stats[:5]  # 처음 5개 레코드만 표시
mean std count
viewer_id
6 3.250 1.035 8
7 4.071 1.141 14
8 4.000 <NA> 1
10 4.000 <NA> 1
59 4.000 <NA> 1

작품별 분석

평균 평점 TOP 10 작품 (최소 3명 이상 평가)

  1. Kabhi Khushi Kabhie Gham
    평균 평점: 4.24/10 (97명 평가)
  2. The Pianist
    평균 평점: 4.12/10 (30497명 평가)
  3. Coach Carter
    평균 평점: 4.07/10 (30022명 평가)
  4. Kuch Kuch Hota Hai
    평균 평점: 4.00/10 (271명 평가)
  5. Doctor Zhivago
    평균 평점: 3.98/10 (8810명 평가)
  6. Like Water for Chocolate
    평균 평점: 3.97/10 (13459명 평가)
  7. American Beauty
    평균 평점: 3.96/10 (77407명 평가)
  8. Lock, Stock and Two Smoking Barrels 평균 평점: 3.93/10 (19899명 평가)
  9. Charlotte’s Web
    평균 평점: 3.92/10 (13693명 평가)
  10. Jaws
    평균 평점: 3.88/10 (40425명 평가)
# 평균 평점 TOP 10 작품 (최소 3명 이상 평가)
def get_top_rated_media(n=10, min_ratings=3):
    return (
        netflix
        .groupby("title")["rating"]
        .agg(["mean", "count"])
        .query("count >= @min_ratings")
        .nlargest(n, "mean")
    )

get_top_rated_media()
mean count
title
Kabhi Khushi Kabhie Gham 4.237 97
The Pianist 4.116 30497
Coach Carter 4.071 30022
... ... ...
Lock, Stock and Two Smoking Barrels 3.933 19899
Charlotte's Web 3.917 13693
Jaws 3.880 40425

10 rows × 2 columns

장르 분석

장르별 통계

Comedies: 시청 횟수: 606749회 평균 평점: 3.51/10 시청자 수: 243514명 총 시청 시간: 1083382.5시간

Action & Adventure: 시청 횟수: 466348회 평균 평점: 3.52/10 시청자 수: 217876명 총 시청 시간: 912268.1시간

Dramas: 시청 횟수: 428928회 평균 평점: 3.75/10 시청자 수: 200592명 총 시청 시간: 874892.6시간

…

# 장르별 통계: 시청 횟수, 평균 평점, 시청자 수, 총 시청시간

netflix_long = netflix.explode('genre')

genre_stats = (
    netflix_long.groupby("genre")
    .agg(
        count=("rating", "size"),
        mean_rating=("rating", "mean"),
        total_viewers=("viewer_id", "nunique"),  # nunique(): 고유 값 개수 함수
        total_runtime=("runtime", "sum"),
    )
    .assign(
        total_runtime=lambda x: x["total_runtime"] / 60
    )
    .reset_index()
)
genre_stats
genre count mean_rating total_viewers total_runtime
0 Action & Adventure 466348 3.518 217876 912268.133
1 Anime Series 126 3.627 126 0.000
2 Children & Family Movies 106728 3.319 69248 163799.883
... ... ... ... ... ...
15 Sports Movies 49586 3.819 45149 104607.100
16 TV Dramas 731 3.765 731 0.000
17 Thrillers 133365 3.551 89527 271388.767

18 rows × 5 columns

# 장르 분포 시각화: 시청 횟수
order_by_count = genre_stats.sort_values("count", ascending=False)["genre"].values
(
    so.Plot(genre_stats, y="genre", x="count")
    .add(so.Bar())
    .scale(y=so.Nominal(order=order_by_count))
)

시간대 분석

월별 시청 추이

1999-11 1회
1999-12 30회
2000-01 1009회
2000-02 1089회
2000-03 916회

netflix = netflix.assign(
    year_month=lambda x: x["watch_date"].dt.to_period("M")
)
# 월별 시청 추이
netflix.value_counts("year_month").sort_index()
year_month
1999-11        1
1999-12       30
2000-01     1009
           ...  
2005-10    65141
2005-11    38613
2005-12    21827
Freq: M, Name: count, Length: 74, dtype: int64
viewing_by_month = netflix.value_counts("year_month").sort_index().reset_index()

(
    so.Plot(viewing_by_month, x="count", y="year_month")
    .add(so.Bars())
    .layout(size=(6, 10))
)

# 시각화: seaborn에서 직접 계산하여 그리기
(
    so.Plot(netflix, y="year_month")
    .add(so.Bars(), so.Hist())
    .layout(size=(6, 10))
)

# 요일별 시청 패턴
netflix = netflix.assign(
    weekday=lambda x: x["watch_date"].dt.day_name().str[:3] # 요일
)

netflix["weekday"] = netflix["weekday"].astype("category").cat.set_categories(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]) # 요일 순서 지정을 위해 카테고리 변환
netflix.value_counts("weekday").sort_index()
weekday
Mon    214181
Tue    221189
Wed    206516
        ...  
Fri    167458
Sat    123017
Sun    133268
Name: count, Length: 7, dtype: int64
# 요일별 시청 패턴 시각화
(
    so.Plot(netflix, y="weekday")
    .add(so.Bars(), so.Hist())
)

평점 분석

전체 평점 분포

5점: 266619개
4점: 433481개
3점: 365585개
2점: 126742개
1점: 54131개

# 전체 평점 분포
netflix.value_counts("rating").sort_index()
rating
1     54131
2    126742
3    365585
4    433481
5    266619
Name: count, dtype: int64
# 시각화
(
    so.Plot(netflix, y="rating")
    .add(so.Bar(), so.Hist(discrete=True))
)

종합 리포트

시청 기록 종합 리포트

  • 총 시청 기록 수: 1246558개
  • 총 시청자 수: 316804명
  • 고유 작품 수: 94개
  • 총 시청 시간: 2391943.3시간
  • 전체 평균 평점: 3.59/10

가장 인기 있는 장르: Comedies

평균 평점이 가장 높은 TOP 3 작품:

  1. Kabhi Khushi Kabhie Gham (평균: 4.24/10, 97명 평가)
  2. The Pianist (평균: 4.12/10, 30497명 평가)
  3. Coach Carter (평균: 4.07/10, 30022명 평가)
def get_summary_report(db):
    return {
        "총 시청 기록 수": db.shape[0],
        "총 시청자 수": db["viewer_id"].nunique(),
        "고유 작품 수": db["title"].nunique(),
        "총 시청 시간": db["runtime"].sum() / 60,
        "전체 평균 평점": db["rating"].mean(),
        "가장 인기 있는 장르": genre_stats.set_index("genre")["count"].idxmax(),
        "평균 평점이 가장 높은 TOP 3 작품": get_top_rated_media(3, 3).reset_index().to_dict(orient="records")
    }

get_summary_report(netflix)
{'총 시청 기록 수': 1246558,
 '총 시청자 수': 316804,
 '고유 작품 수': 94,
 '총 시청 시간': 2391943.316666667,
 '전체 평균 평점': 3.58698833106843,
 '가장 인기 있는 장르': 'Comedies',
 '평균 평점이 가장 높은 TOP 3 작품': [{'title': 'Kabhi Khushi Kabhie Gham',
   'mean': 4.237113402061856,
   'count': 97},
  {'title': 'The Pianist', 'mean': 4.116142571400466, 'count': 30497},
  {'title': 'Coach Carter', 'mean': 4.070748118046765, 'count': 30022}]}

Module로 변환/활용

반복되는 분석 패턴을 netflix_utils.py 모듈로 분리하면, 코드 재사용성과 가독성이 높아집니다.

import netflix_utils as nu
import netflix_utils as nu

# 데이터 로딩 및 환경 설정
nu.setup_display(max_rows=6)
netflix = nu.load_netflix_data("data/netflix_ratings.parquet")
netflix.head(3)
movie_id user_id rating date title genre year
0 3282 972104 4 2005-09-16 Sideways [Comedy, Drama, Romance] 2004
1 143 2297762 5 2004-08-07 The Game [Drama, Mystery, Thriller] 1997
2 1744 1489846 3 2003-05-22 Beverly Hills Cop [Action, Comedy, Crime] 1984
# 파생 변수 한 번에 추가 (decade, weekday, title_length)
netflix_enriched = nu.enrich_netflix(netflix)
netflix_enriched.head(3)
movie_id user_id rating date title genre year decade weekday title_length
0 3282 972104 4 2005-09-16 Sideways [Comedy, Drama, Romance] 2004 2000 Fri 8
1 143 2297762 5 2004-08-07 The Game [Drama, Mystery, Thriller] 1997 1990 Sat 8
2 1744 1489846 3 2003-05-22 Beverly Hills Cop [Action, Comedy, Crime] 1984 1980 Thu 17
# 영화별 통계 (최소 30명 이상 평가한 영화만)
m_stats = nu.movie_stats(netflix, min_count=30)
m_stats.sort_values("mean", ascending=False).head()
title mean std count
734 The Sixth Sense 4.329 0.793 15166
733 The Silence of the Lambs 4.310 0.815 12940
101 Braveheart 4.289 0.901 13590
72 Batman Begins 4.215 0.860 5529
510 Sense and Sensibility 4.195 0.896 1133
# 사용자별 통계 (최소 30개 이상 평가한 사용자만)
u_stats = nu.user_stats(netflix, min_count=30)
u_stats.head()
mean std count
user_id
1333 2.674 0.778 43
2213 3.871 0.846 31
2455 3.433 0.817 30
2905 3.700 1.418 30
3321 2.977 1.012 43
# 장르별 통계
g_stats = nu.genre_stats(netflix)
g_stats.sort_values("mean", ascending=False)
genre mean std count
10 Film-Noir 3.965 0.935 9020
20 War 3.908 1.037 32304
3 Biography 3.878 0.995 121688
... ... ... ... ...
4 Comedy 3.528 1.067 770697
12 Horror 3.386 1.065 177556
17 Sci-Fi 3.332 1.090 115480

22 rows × 4 columns

# 범용 집계: 요일별 평점
netflix_enriched = nu.add_ordered_weekday(netflix)
nu.rating_stats_by(netflix_enriched, "weekday")
weekday mean std count
0 Mon 3.584 1.054 324006
1 Tue 3.582 1.054 331272
2 Wed 3.590 1.055 311012
... ... ... ... ...
4 Fri 3.589 1.061 245186
5 Sat 3.596 1.065 184385
6 Sun 3.596 1.060 199471

7 rows × 4 columns

# 시각화: 평균 vs 표준편차 산점도
nu.plot_mean_vs_std(m_stats, title="Movie: Mean vs Std")

# 시각화: 장르 순위
nu.plot_genre_ranking(g_stats, metric="mean", title="Genre Mean Rating")

# 시각화: 연도별 평점 추세
nu.plot_trend_by_year(netflix, x_col="year", title="Rating Trend by Year")

# 사용자별 평점 표준화 (z-score)
standardized = nu.standardize_rating_by(netflix, group_col="user_id")
standardized.head()
user_id rating
117541 6 -0.483
324856 6 1.932
412793 6 -0.483
448624 6 1.932
604841 6 -0.483

This work © 2025 by Sungkyun Cho is licensed under CC BY-NC-SA 4.0