๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐ŸŽ“ Education

[์ธ๊ณต์ง€๋Šฅ ํ†ตํ•ฉ๊ณผ์ •] 6์ฃผ์ฐจ : ํƒ€์ดํƒ€๋‹‰ ์‹ค์Šต (๋ฏธ๋‹ˆํ”„๋กœ์ ํŠธ 2)

by vodkassi 2021. 3. 8.
728x90

# ๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋กœ์ ํŠธ์˜ ๊ต๊ณผ์„œ, ํƒ€์ดํƒ€๋‹‰

 

6์ฃผ์ฐจ (์‚ฌ์‹ค 5์ฃผ์ฐจ ๋งˆ์ง€๋ง‰ ๋‚  + 6์ฃผ์ฐจ ์ฒซ ๋‚ ) ์—๋Š” ๋จธ์‹ ๋Ÿฌ๋‹์„ ๋ฐฐ์šฐ๋Š” ์‚ฌ๋žŒ์ด๋ผ๋ฉด ๋ˆ„๊ตฌ๋‚˜ ์ž…๋ฌธํ•˜๊ฒŒ ๋˜๋Š” '์บ๊ธ€ ํƒ€์ดํƒ€๋‹‰ ํ”„๋กœ์ ํŠธ' ๋ฅผ ์ง„ํ–‰ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. (์บ๊ธ€ ๋ฐ์ดํ„ฐ์…‹: www.kaggle.com/c/titanic) 

 

Titanic - Machine Learning from Disaster

Start here! Predict survival on the Titanic and get familiar with ML basics

www.kaggle.com

์ดํ‹€์ด๋ผ๋Š” ์งง์€ ์‹œ๊ฐ„ ๋™์•ˆ ์ง„ํ–‰๋˜์—ˆ๊ธฐ์—, 'ํ”„๋กœ์ ํŠธ'๋ผ๊ณ  ๋ถ€๋ฅด๊ธฐ์—” ์•ฝ๊ฐ„์˜ ์–ดํ๊ฐ€ ์žˆ๋Š” ๊ฒƒ ๊ฐ™๊ณ  '์‹ค์Šตํ™œ๋™' ์ •๋„๊ฐ€ ์ ๋‹นํ•œ ๊ฒƒ ๊ฐ™๋‹ค. ๋ณธ ์‹ค์Šต์˜ ๋ชฉ์ ์€ ์ •ํ˜•๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•˜์—ฌ ์ผ์ฃผ์ผ ๊ฐ„ ๋ฐฐ์šด ๋จธ์‹ ๋Ÿฌ๋‹ ์ด๋ก ๋“ค์„ ์ฝ”๋“œ๋กœ ์ง์ ‘ ๊ตฌํ˜„ํ•ด๋ณด๋Š” ๋ฐ ์žˆ์—ˆ๊ธฐ์—, ๋”ฐ๋กœ ์ฃผ์ œ๋ฅผ ์ •ํ•˜๊ฑฐ๋‚˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š”๋ฐ ๋ถˆํ•„์š”ํ•œ ์—๋„ˆ์ง€๋ฅผ ๋“ค์ด์ง€ ์•Š์•„๋„ ๋˜์—ˆ๋‹ค. 

 

ํŒ€์›๋“ค๊ณผ ์ธ์‚ฌ๋ฅผ ๋‚˜๋ˆ„๊ณ  ๋ฐ”๋กœ ์ž‘์—…์„ ์‹œ์ž‘ํ–ˆ๋Š”๋ฐ, ์ž‘์—… ์ˆœ์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์•˜๋‹ค:

  1. ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ (EDA)
  2. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ (๋ถˆํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ์ œ๊ฑฐ, ๊ฒฐ์ธก์น˜ ์ฑ„์šฐ๊ธฐ) 
  3. ํ…Œ์ŠคํŠธ/๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ
  4. ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ์„ค๊ณ„
  5. ๋ชจ๋ธ ๊ฒ€์ฆ ๋ฐ hyperparameter tuning

# ํƒ€์ดํƒ€๋‹‰ ๋ฐ์ดํ„ฐ๋Š” ์–ด๋–ป๊ฒŒ ์ƒ๊ฒผ๋Š”๊ฐ€ 

 

๋ณธ๊ฒฉ์ ์œผ๋กœ ์‹ค์Šต์— ๋“ค์–ด๊ฐ€๊ธฐ์— ์•ž์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ดํŽด๋ณด๋Š” ์‹œ๊ฐ„์„ ๊ฐ€์กŒ๋Š”๋ฐ, ๋‹จ์ˆœํžˆ ์ˆ˜๋Ÿ‰์ ์ธ ์šฐ์—ด์„ ๋”ฐ์ง€๊ธฐ๋ณด๋‹ค ์‹ค์ œ๋กœ ์ƒ์กดํ•œ ์ž๋“ค๊ณผ (survived) ์ฃฝ์€ ์ž๋“ค(dead)์˜ ํŠน์ง•์ด ์กด์žฌํ•˜๋Š”์ง€ ์•Œ์•„๋ณด๊ณ ์ž ํ–ˆ๋‹ค. 

 

ํƒ‘์Šน์ž ์—ฐ๋ น๋Œ€๋ณ„ ์ƒ์กด์ž
ํƒ‘์Šน์ž ํ˜ธ์นญ๋ณ„ ์ƒ์กด์ž
ํƒ‘์Šน์ž ํƒ‘์Šน์ง€๋ณ„ ์ƒ์กด์ž

 

ํƒ‘์Šน์ž ์ˆ™์†Œ ๊ตฌ์—ญ๋ณ„ ์ƒ์กด์ž

 

ํƒ‘์Šน์ž ํ‹ฐ์ผ“ ๊ฐ€๊ฒฉ๋Œ€๋ณ„ ์ƒ์กด์ž

์ด ๋‹น์‹œ์— ์กฐ์›๋“ค ๊ฐ์ž ๋ช‡ ๋ช‡ ์นดํ…Œ๊ณ ๋ฆฌ์”ฉ ๋งก์•„ ์ƒ์กด์ž ๋น„์œจ๊ณผ ์‚ฌ๋ง์ž ๋น„์œจ์„ ๊ตฌํ•ด๋ณด์•˜๋Š”๋ฐ, ๊ฒฐ๊ณผ์ ์œผ๋กœ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค์ฃผ๋Š” ํ•จ์ˆ˜ ํ•˜๋‚˜๋กœ ๋ชจ๋“  ์‹œ๊ฐํ™”๊ฐ€ ๊ฐ„๋‹จํžˆ ํ•ด๊ฒฐ๋˜์—ˆ๋‹ค. 

def survived_graph(value):
    survived = data_df[data_df['Survived']==1][value].value_counts()
    dead = data_df[data_df['Survived']==0][value].value_counts()
    df = pd.DataFrame([survived, dead])
    df.index = ['Survived', 'Dead']
    df.plot(kind='bar', stacked=True, figsize=(5,5))

์ƒ์กด์ž๋Š” 1, ์‚ฌ๋ง์ž๋Š” 0์œผ๋กœ ๊ธฐ๋ก๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์›๋ณธ ๋ฐ์ดํ„ฐ์—์„œ ํ•ด๋‹น ๊ธฐ๋ก์— ๋งž๋Š” ํ–‰๋“ค์„ ์„ ํƒํ•ด ์ถ”์ถœํ•œ๋’ค, ์ผ์ •ํ•œ ๊ธฐ์ค€์— ๋งž์ถ”์–ด ๋ฐ” ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฌ๋„๋ก ํ–ˆ๋‹ค. 

 

๊ฒฐ๊ณผ๋ฅผ 1์ฐจ์›์ ์œผ๋กœ ํ•ด์„ํ•ด ๋ณด์•˜์„ ๋•Œ, ๋‚จ์„ฑ๋ณด๋‹ค๋Š” ์—ฌ์„ฑ์ด, ์ฒญ๋…„์ธต๋ณด๋‹ค๋Š” ์œ ๋…„์ธต์ด, C ์ง€์—ญ์—์„œ ํƒ‘์Šนํ•œ ์Šน๊ฐ๋“ค์ด ์ƒ์กด์œจ์ด ๋” ๋†’๋‹ค๋Š” ์‚ฌ์‹ค์„ ์œ ์ถ”ํ•ด๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค. 


# ๋นผ๊ณ  ์ฑ„์›Œ๋„ฃ๊ธฐ

 

ํƒ€์ดํƒ€๋‹‰ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ๋•Œ ๋งž๋‹ฅ๋œจ๋ฆฐ ์–ด๋ ค์›€ ์ค‘ ํ•˜๋‚˜๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ฒฐ์ธก์น˜์˜€๋‹ค. ํƒ‘์Šน๊ฐ๋“ค์˜ ์—ฐ๋ น, ํƒ‘์Šน์ง€, ํ‹ฐ์ผ“๊ฐ’, ๊ฐ์‹ค ์œ„์น˜ ๋“ฑ๋“ฑ์˜ ์—ด๋“ค์—์„œ ๊ฒฐ์ธก์น˜๊ฐ€ ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋ฐœ์ƒํ•˜๊ณ  ์žˆ์Œ์„ ํ™•์ธํ•œ ์ดํ›„, ๋นˆ ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ์ฑ„์›Œ๋„ฃ์„์ง€ ๊ณ ๋ฏผํ•ด๋ณด์•˜๋‹ค. 

 

์šฐ์„ , ์—ฐ๋ น์˜ ๊ฒฝ์šฐ, ์šฐ๋ฆฌ ์กฐ๊ฐ€ ์„ ํƒํ•œ ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์—์„œ ๊ฒฐ์ธก๊ฐ’์ด ์•„๋‹Œ ์—ฐ๋ น์„ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ๊ฒฐ์ธก์น˜๊ฐ€ ๋ฐœ์ƒํ•  ๋•Œ๋งˆ๋‹ค ํ•ด๋‹น ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ฌด์ž‘์œ„๋กœ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์„ ํƒํ–ˆ๋‹ค. 

# ๊ธฐ์กด ๋ฐ์ดํ„ฐ์˜ Age ๋ฅผ ํ•˜๋‚˜์˜ pool ๋กœ ๋งŒ๋“ค์–ด ๋žœ๋ค ์ถ”์ถœํ›„ ๊ฐ’ ์ง€์ •

age_list = [ x for x in x_data['Age'].dropna()] 

for idx, x in enumerate(x_data['Age'].isnull()):
    new_x = random.choice(age_list)
    if x == True:
        x_data.at[idx, 'Age'] = new_x

 

๊ฐ์‹ค ์œ„์น˜์˜ ๊ฒฝ์šฐ, ๋˜ ๋‹ค๋ฅธ ์—ด์ด์—ˆ๋˜ 'Pclass' ์— ๋”ฐ๋ผ ๋ฐฐ์ •๋œ ๊ฐ์‹ค ์œ„์น˜๊ฐ€ ๋‹ฌ๋ผ์ง„๋‹ค๋Š” ์ ์„ ํ™œ์šฉํ•˜์—ฌ, ํ•ด๋‹น ์Šน๊ฐ์˜ Pclass ์ด ๊ฐ–๋Š” ๊ฐ์‹ค ์œ„์น˜์˜ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ฌด์ž‘์œ„๋กœ ์ถ”์ถœํ•˜๋„๋ก ํ–ˆ๋‹ค. 

# ์„ ์‹ค์œ ํ˜•์˜ null๊ฐ’์„ Pclass์— ๋”ฐ๋ผ randomํ•˜๊ฒŒ ๋ฐฐ์ •ํ•ด์ค„ ํ•จ์ˆ˜ ์„ ์–ธ
def fillnull(df):
    colList = []
    
    P1_CabinList = ["A", "B", "C", "D", "E"]
    P2_CabinList = ["D", "E", "F"]
    P3_CabinList = ["E", "F", "G"]
    for row in df.index:
        CabinName = df["Cabin"][row]
        PclassName = df["Pclass"][row]
        if CabinName == "N":
            if PclassName == 1:
                colList.append(random.choice(P1_CabinList))
            elif PclassName == 2:
                colList.append(random.choice(P2_CabinList))
            else:
                colList.append(random.choice(P3_CabinList))
        else:
            colList.append(CabinName)
    
    return colList

x_data["Cabin"] = fillnull(x_data)

 

์ด์™ธ์—๋„ ํƒ‘์Šน์ง€์˜ ๊ฒฝ์šฐ ํƒ‘์Šน๊ฐ์ด ๊ฐ€์žฅ ๋งŽ์•˜๋˜ 'S'๋กœ, ํ‹ฐ์ผ“ ๊ฐ€๊ฒฉ์€ ์ค‘๊ฐ„๊ฐ’์œผ๋กœ ๊ฒฐ์ธก์น˜๋ฅผ ์ „๋ถ€ ์ฑ„์›Œ๋„ฃ์—ˆ๋‹ค.


# ์ตœ์ ์˜ ๋ชจ๋ธ ์ฐพ๊ธฐ

 

๋ฐ์ดํ„ฐ๋ฅผ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์— ์ž…๋ ฅํ•˜๊ธฐ ์ด์ „์— ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ์ธ ์ž๋ฃŒ๋“ค์„ ์ •ํ˜•๋ฐ์ดํ„ฐ๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ์ž‘์—…์„ ์šฐ์„ ์ ์œผ๋กœ ๊ฑฐ์น˜๊ณ , ์ •ํ˜•๋ฐ์ดํ„ฐ๋Š” ํ‘œ์ค€ํ™”ํ•ด์ฃผ๋Š” ์ž‘์—…์„ ์ง„ํ–‰ํ–ˆ๋‹ค. Age ์™€ Fare ์—ด์—๋Š” ๊ฐ๊ฐ Standard Scalar ๋ฅผ ํ™œ์šฉํ•ด ๋ชจ๋“  ์ˆ˜๊ฐ€ -1 ๊ณผ 1 ์˜ ๋ฒ”์œ„์— ํฌํ•จ๋˜๋„๋ก ํ–ˆ๊ณ , ๋‚˜๋จธ์ง€ ์—ด๋“ค์— ๋Œ€ํ•ด์„œ๋Š” one-hot encoding ์„ ์ง„ํ–‰ํ–ˆ๋‹ค. (Label encoding ์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์ƒ๊ฐํ–ˆ์œผ๋‚˜, label encoding ์˜ ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐ ๊ฐ„์˜ ์šฐ์—ด์ด ๋ฐœ์ƒํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์–ด ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ๋‹ค.)

 

๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•œ ๋ชฉ์ ์€ '์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅํ–ˆ์„ ๋•Œ, ์ด ์‚ฌ๋žŒ์ด ์‚ด ํ™•๋ฅ ์ด ๋†’์€์ง€, ์ฃฝ์„ ํ™•๋ฅ ์ด ๋†’์€์ง€ ์•Œ์•„๋‚ด๊ธฐ' ์œ„ํ•จ์ด์—ˆ๋‹ค. ๊ฒฐ๊ณผ๊ฐ’์€ ๊ณง '์ƒ์กด' ํ˜น์€ '์‚ฌ๋ง' ์œผ๋กœ ๋‚˜๋‰˜๊ฒŒ ๋  ๊ฒƒ์ด์—ˆ๊ธฐ์—, ์ด์ง„๋ถ„๋ฅ˜ ๋ชจ๋ธ์„ ์ฃผ๋กœ ํ™œ์šฉํ–ˆ๋‹ค.

 

์šฐ๋ฆฌ๋Š” Gradient Boosting Classifier, XGBoost, Support Vector Machine, Logistic Regression, K-Nearest Neighbor ๋“ฑ์˜ ๋ชจ๋ธ์„ ํ™œ์šฉํ–ˆ์œผ๋ฉฐ, ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ๋†’์•˜๋˜ Gradient Boosting Classifier ์— ํ•œํ•ด GridSearchCV ๋ฅผ ์ ์šฉํ–ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋„์ถœํ•ด๋‚ผ ์ˆ˜ ์žˆ์—ˆ๊ณ , ์ตœ์ข…์ ์œผ๋กœ

Accuracy on Training set: 0.884

Accuracy on Test set: 0.840

์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

 

๊ฒฐ๊ณผ์™€ ๊ฐ€์žฅ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋†’์€ ๋ณ€์ˆ˜๋ฅผ ๋ฝ‘์•„์ฃผ๋Š” Feature Importance ๊ทธ๋ž˜ํ”„๋ฅผ ์ถœ๋ ฅํ•ด๋ณด์•˜๋Š”๋ฐ 'Mr' ํ˜ธ์นญ์„ ๊ฐ–๋Š”์ง€ ์—ฌ๋ถ€์— ๋”ฐ๋ผ ์ƒ์กด/์‚ฌ๋ง ์—ฌ๋ถ€๊ฐ€ ๊ฐˆ๋ฆด ํ™•๋ฅ ์ด ๋†’๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ฒŒ ๋˜์—ˆ๋‹ค. 


# ์ด๋ฒˆ ์‹ค์Šต์˜ KPT (Keep, Problem, Try)

 

  • Keep: ์•ž์œผ๋กœ๋„ ์œ ์ง€ํ•˜๋ฉด ์ข‹์„ ์ ๋“ค์€?- EDA ๋ฅผ ํƒ„ํƒ„ํ•˜๊ฒŒ ์ง„ํ–‰ํ•œ ๊ฒƒ. ๋ฐ์ดํ„ฐ๋ฅผ ๋ฉด๋ฐ€ํ•˜๊ฒŒ ์‚ดํŽด๋ณด๋ฉด ํ–ฅํ›„ ๋ถ„์„์„ ์œ„ํ•œ ์ ์ ˆํ•œ ๊ฐ€์„ค์„ ์„ธ์šธ ์ˆ˜ ์žˆ๊ฒŒ ๋จ.
  • Problem: ๊ฐœ์„ ํ•˜๋ฉด ์ข‹์„ ์ ๋“ค์€?- ์–ด๋ ต๋‹ค๊ณ  ์ƒ๊ฐ๋˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ๋„ˆ๋ฌด ๋นจ๋ฆฌ ํฌ๊ธฐํ•˜์ง€ ๋ง ๊ฒƒ. ๊ฐ€๋ น, ๊ฒฐ์ธก์น˜๋ฅผ ์ฑ„์šธ ๋•Œ ์—ฐ๋ น์ด๋‚˜ ํ‹ฐ์ผ“ ๊ฐ€๊ฒฉ์€ ๊ทธ๋ฃนํ•‘์„ ์‹œ๋„ํ•ด๋ณด์•˜์œผ๋ฉด ๋”์šฑ ์ข‹์•˜์„ ๊ฒƒ์ด๋ฉฐ ํ˜น์€ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๋ฏธ๋ฆฌ ์ ์šฉํ•ด๋„ ๊ดœ์ฐฎ์•˜์„ ๊ฒƒ์ด๋‹ค. ๋‹ค ๋จธ๋ฆฌ๋ฅผ ์Šค์ณ๊ฐ”๋˜ ๋ฐฉ๋ฒ•๋“ค์ธ๋ฐ ์‹œ๊ฐ„ ๋ถ€์กฑ๊ณผ ๊ฒฝํ—˜ ๋ถ€์กฑ์œผ๋กœ ๋„ˆ๋ฌด ๋นจ๋ฆฌ ํฌ๊ธฐํ–ˆ๋˜ ๊ฒƒ์ด ์•„์‰ฝ๋‹ค. 
  • Try: ์•ž์œผ๋กœ ์‹œ๋„ํ•ด๋ณผ ๋ฐฉ๋ฒ•๋“ค์€?- ์•„๋ฌด ๋ชจ๋ธ์ด๋‚˜ ์ ์šฉํ•  ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ด๋ก ์„ ๊ธฐ์ดˆ๋กœ ์„ ํƒํ•œ ๋ชจ๋ธ์— ๋Œ€ํ•œ ๊ทผ๊ฑฐ๋ฅผ ์ œ๊ณตํ•  ๊ฒƒ. 

* KPT ๋Š” Design Thinking ์—์„œ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” Agile ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ์จ, ํ”„๋กœ์ ํŠธ ์™„์ˆ˜ ์ดํ›„ ์ง€๋‚˜์˜จ ๊ณผ์ •์„ ํ†บ์•„๋ณด๋ฉฐ ํšŒ๊ณ ํ•˜๊ธฐ์— ์ ์ ˆํ•˜๋‹ค๊ณ  ํ•˜์—ฌ ์‹œ๋„ํ•ด๋ณด์•˜๋‹ค.

 

 

 

 

๋Œ“๊ธ€