๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ—‚.์ž๊ฒฉ์ฆ/๐Ÿ“.๋น…๋ฐ์ดํ„ฐ๋ถ„์„๊ธฐ์‚ฌ

[์‹ค๊ธฐ] ๋น…๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ธฐ์‚ฌ ์ž‘์—…ํ˜• 2 ์ •๋ฆฌ (Python)

by ๐Ÿ’พ๊ณ ๊ตฌ๋งˆ๋ง›ํƒ•๋จน๊ณ ์‹ถ๋‹ค 2023. 12. 3.
728x90

๋น…๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ธฐ์‚ฌ ์‹ค๊ธฐ -  ์ž‘์—…ํ˜• 2 ์ •๋ฆฌ๋ณธ

๐Ÿšจ ๋ชจ๋“  ์ฝ”๋“œ๋Š” ํŒŒ์ด์ฌ ๊ธฐ์ค€์ž…๋‹ˆ๋‹ค.

 

๐Ÿ‘€ ์ž‘์—…ํ˜• 2 ๋ฌธ์ œ ํ’€์ด ์ˆœ์„œ

1 ๋‹จ๊ณ„ : ๋ฐ์ดํ„ฐ ํŒŒ์•…ํ•˜๊ธฐ, ๋ฐ์ดํ„ฐ ํ•™์Šต ๋ชจ๋ธ ์„ ์ •(๋ถ„๋ฅ˜, ํšŒ๊ท€)
2 ๋‹จ๊ณ„ : ๋ฐ์ดํ„ฐ ์ •๋ฆฌ : ๋ถˆํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ, ๊ฒฐ์ธก์น˜, ๋ฌธ์žํ˜• ๋“ฑ ์ „์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•œ ์ปฌ๋Ÿผ ์ฒดํฌํ•˜๊ธฐ
3 ๋‹จ๊ณ„ : ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌํ•˜๊ธฐ 

4 ๋‹จ๊ณ„ : ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌํ•˜๊ธฐ, ๋ชจ๋ธ ํ•™์Šต ์ „ ์ตœ์ ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ ํ…Œ์ŠคํŠธ

   - ๋งŒ์•ฝ xtrain, xtest, ytrain์˜ ํ˜•ํƒœ๊ฐ€ ์•„๋‹Œ train, test๋งŒ ์ฃผ์–ด์ง„๋‹ค๋ฉด ํ•„์ˆ˜๋กœ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌํ•ด์•ผํ•จ

5 ๋‹จ๊ณ„ : ๋ฐ์ดํ„ฐ ํ•™์Šต : ๋ฌธ์ œ์— ๋งž๋Š” ๋ฐ์ดํ„ฐ ๋ชจ๋ธ ์„ ํƒํ•˜์—ฌ ํ•™์Šต์‹œํ‚ค๊ธฐ

 

6 ๋‹จ๊ณ„ : ์ œ์ถœํ•˜๊ธฐ ์ „ ๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๊ณ  ์ œ์ถœํ•˜๊ธฐ

 

โœ๏ธ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

1 ) ๋ถˆํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ์‚ญ์ œ

drop ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์‚ญ์ œ

๋งŒ์•ฝ ๋ฐ์ดํ„ฐ ์ถ”์ถœ์ด ํ•„์š”ํ•˜๋‹ค๋ฉด pop์„ ํ†ตํ•ด ๋”ฐ๋กœ ์ €์žฅ ( ์ด๋•Œ ์›๋ณธ ๋ฐ์ดํ„ฐ์—์„œ๋Š” ์‚ฌ๋ผ์ง)

x_train = x_train.drop(columns=['id'])
y_train = y_train.drop(columns=['id'])
x_test_id = x_test.pop('id')

 

2 ) ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

isnull()๋กœ ํ™•์ธํ•˜๊ณ , fillna()๋ฅผ ํ†ตํ•ด ๊ฐ’ ์ฑ„์›Œ๋„ฃ๊ธฐ

mean(), mode()๋“ฑ ๋ฌธ์ œ์— ๊ฒฐ์ธก์น˜๋ฅผ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•ด์•ผํ•˜๋Š”์ง€ ์•ˆ๋‚˜์™€์žˆ์œผ๋ฉด describe()๋ฅผ ํ†ตํ•ด ๊ฒฐ์ธก์น˜๋ฅผ ์ •ํ•˜๋ฉด๋œ๋‹ค.

์ž˜ ๋ชจ๋ฅด๋ฉด ์ผ๋‹จ ํ‰๊ท  ์•„๋‹ˆ๋ฉด ์ตœ๋นˆ๊ฐ’์ด๋ฉด ๋œ๋‹ค.

์ตœ๋นˆ๊ฐ’์œผ๋กœ ์ฒ˜๋ฆฌ ํ•  ์‹œ value_counts()๋ฅผ ํ†ตํ•ด ํ™•์ธํ•  ๊ฒƒ, ๊ฒฐ์ธก์น˜๊ฐ€ ์ตœ๋นˆ๊ฐ’์ผ ์ˆ˜ ์žˆ์Œ

# print(x_test['ํ™˜๋ถˆ๊ธˆ์•ก'].describe())
x_train['ํ™˜๋ถˆ๊ธˆ์•ก'] = x_train['ํ™˜๋ถˆ๊ธˆ์•ก'].fillna(x_train['ํ™˜๋ถˆ๊ธˆ์•ก'].mean())
x_test['ํ™˜๋ถˆ๊ธˆ์•ก'] = x_test['ํ™˜๋ถˆ๊ธˆ์•ก'].fillna(x_test['ํ™˜๋ถˆ๊ธˆ์•ก'].mean())

train['AnnualIncome'] = train['AnnualIncome'].fillna(train['AnnualIncome'].min())
test['AnnualIncome'] = test['AnnualIncome'].fillna(test['AnnualIncome'].min())

 

3 ) ๋ฌธ์žํ˜• ์ฒ˜๋ฆฌ โญ๏ธ ( 7ํšŒ ์ž‘์—…ํ˜• 2 ๋ฌธ์žํ˜• ์ฒ˜๋ฆฌ ๅฟ… )

object : ์›ํ•ซ์ธ์ฝ”๋”ฉ, ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ ๋“ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•ด์•ผํ•จ. ์ฒ˜๋ฆฌํ•˜๊ธฐ ์• ๋งคํ•  ๊ฒฝ์šฐ ์‚ญ์ œ๋ฅผ ํ•  ์ˆ˜ ์žˆ์Œ
์›ํ•ซ์ธ์ฝ”๋”ฉ : pd.get_dummies()
๋ผ๋ฒจ์ธ์ฝ”๋”ฉ : from sklearn.preprocessing import LabelEncoder

col = ['GraduateOrNot', 'FrequentFlyer', 'EverTravelledAbroad']
train = pd.get_dummies(data = train, columns = col)
test = pd.get_dummies(data = test, columns = col)

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
train['Employment Type'] = encoder.fit_transform(train['Employment Type'])
test['Employment Type'] = encoder.fit_transform(test['Employment Type'])

 

4 ) ์Šค์ผ€์ผ๋ง

describe : ํ•ด๋‹น ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์Šค์ผ€์ผ๋ง ๋ฐฉ๋ฒ•์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๊ฒฐ์ธก์น˜๋ฅผ ์–ด๋–ค ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•  ์ง€ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋‹ค.

StandardScaler, RobustScaler, MinMaxScaler ๋“ฑ์ด ์žˆ์Œ
์ด์ƒ์น˜์— ์˜ํ–ฅ์„ ์ž˜ ๋ฐ›์ง€ ์•Š๋Š” RobustScaler์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ, ๋งŒ์•ฝ ์ตœ์†Œ, ์ตœ๋Œ€๊ฐ’์ด ๋„ˆ๋ฌด ํฐ ์ฐจ์ด๋ฅผ ๋ณด์ธ๋‹ค๋ฉด MinMaxScaler๋ฅผ ์ถ”์ฒœํ•œ๋‹ค.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
col = ['GRE Score','TOEFL Score']
xtrain[col] = scaler.fit_transform(xtrain[col])
xtest[col] = scaler.transform(xtest[col])

 

โœ๏ธ ๋ฐ์ดํ„ฐ ํ•™์Šต ๋ฐ ์ œ์ถœ

1 ) ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ

๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ ๋ชจ๋ธ์— ํ•™์Šต์‹œํ‚จ๋‹ค. 

๋ฐ์ดํ„ฐ ํ•™์Šต์„ ํ†ตํ•ด n_estimators ์™€ max_depth์˜ ๊ฐ’์„ ๋ณ€๊ฒฝํ•˜๋ฉฐ ํ•™์Šต ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•ด ์ตœ์ ์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์„ ์ฐพ๋Š”๋‹ค. ์—ฌ๊ธฐ์„œ ํ•™์Šต๊ฒฐ๊ณผ๋Š” ์ฑ„์  ๋ฐฉ์‹์— ๋”ฐ๋ผ ๊ฐ’์„ ํ™•์ธํ•˜๋ฉด ๋œ๋‹ค.

( from sklearn.metrics import roc_auc_score / import accuracy_score / import r2_score ๋“ฑ )

# ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(train, train_travel, test_size = 0.2)

# ๋ฐ์ดํ„ฐ ํ•™์Šต
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 50, max_depth = 8) # ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ
model.fit(xtrain, ytrain['Reached.on.Time_Y.N'])
model_pred = model.predict_proba(xtest)

# ํ•™์Šต ๊ฒฐ๊ณผ
from sklearn.metrics import roc_auc_score
print(roc_auc_score(ytest['Reached.on.Time_Y.N'],model_pred[:,1]))

 

2 ) ๋ฐ์ดํ„ฐ ํ•™์Šต ๋ชจ๋ธ

๋ถ„๋ฅ˜ : RandomForestClassifier, XGBClassifier
ํšŒ๊ท€ : RandomForestRegressor, XGBRegressor

 

๐Ÿšจ ๋งŒ์•ฝ ํ•™์Šต๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ๋•Œ ์‚ฌ์šฉํ•ด์•ผํ•˜๋Š” y_test ๊ฐ’์ด ์กด์žฌํ•˜์ง€์•Š๋Š”๋‹ค๋ฉด ์ตœ์ ์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ’์„ ์ฐพ์ง€ ์•Š๊ณ  ' model = RandomForestRegressor() ' ๊ธฐ๋ณธ์œผ๋กœ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค.

# ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ
# from sklearn.model_selection import train_test_split
# x_train1, x_test1, y_train1, y_test1 = train_test_split(x_train, y_train, test_size = 0.2)

# ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’ ์ฐพ๊ธฐ ( ์ด๋–„๋Š” ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌํ•œ x_test1์„ ์‚ฌ์šฉํ•จ)
# from sklearn.ensemble import RandomForestRegressor
# model3 = RandomForestRegressor(n_estimators = 100, max_depth=5)
# model3.fit(x_train1, np.ravel(y_train1))
# model_pred3 = model3.predict(x_test1)

# from sklearn.metrics import mean_squared_error
# rmse2 = np.sqrt(mean_squared_error(y_test1, model_pred3))
# print("Mean Squared Error:", rmse2)

# ์‹ค์ œ ์ œ์ถœ์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ํ•™์Šต ์‹œํ‚ค๊ธฐ ( ์ด๋•Œ๋Š” ์‹ค์ œ ์ œ์ถœํ•  x_test๋ฅผ ์‚ฌ์šฉํ•จ)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators = 100, max_depth=5)
model.fit(x_train, np.ravel(y_train))
model_pred = model.predict(x_test)


    

3 ) ๊ฒฐ๊ณผ ํ™•์ธํ•˜์—ฌ ์ œ์ถœ

result = pd.DataFrame({'enrollee_id' : xtest_id, 'target' : model_pred})
print(result)

pd.DataFrame({'enrollee_id' : xtest_id, 'target' : model_pred}).to_csv('result.csv', index=False)

 

๐Ÿ“š ๋ฌธ์ œ ํ’€์ด 

1.    ๋ถ„๋ฅ˜ ( RandomForestClassifier )

# ๋ฐ์ดํ„ฐ ํŒŒ์•…
# print(train.info()) # ์‚ญ์ œ : ID / pop : Segmentation  
# print(test.info()) # pop : ID 

train = train.drop(columns = ['ID'])
train_seg = train.pop('Segmentation')
test_id = test.pop('ID')

# object : Gender /  Ever_Married /Graduated   /Profession /Spending_Score /Var_1 
# ์›ํ•ซ์ธ์ฝ”๋”ฉ : Gender, Ever_Married, Graduated, Spending_Score
# ๋ผ๋ฒจ์ธ์ฝ”๋”ฉ : Profession, Var_1

col = ['Gender', 'Ever_Married', 'Graduated', 'Spending_Score']
train = pd.get_dummies(data=train, columns=col)
test = pd.get_dummies(data=test, columns=col)

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
train['Profession'] = encoder.fit_transform(train['Profession'])
test['Profession'] = encoder.transform(test['Profession'])

train['Var_1'] = encoder.fit_transform(train['Var_1'])
test['Var_1'] = encoder.transform(test['Var_1'])

from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(train, train_seg, test_size = 0.2)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(xtrain, ytrain)
model_pred = model.predict(test)

result = pd.DataFrame({'ID': test_id, 'Segmentation': model_pred})
print(result)

pd.DataFrame({'ID': test_id, 'Segmentation': model_pred}).to_csv('submission.csv', index = False)

 

2.    ํšŒ๊ท€ ( RandomForestRegressor ) โญ๏ธ ( 7ํšŒ ์ž‘์—…ํ˜• 2 ํšŒ๊ท€๋ฌธ์ œ ์ œ์ถœ)

# ๋ณดํ—˜ ์š”๊ธˆ??? ํšŒ๊ท€ ๋ชจ๋ธ RandomForestRegressor
# Insurance Prediction (Regression)
# ์˜ค๋Š˜ ์ €ํฌ๋Š” ์˜๋ฃŒ๋ณดํ—˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•ด ํ•œ ์‚ฌ๋žŒ์ด ๋ณดํ—˜๋ฃŒ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋‚ผ์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํšŒ๊ท€ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ค„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 

# print(x_train.info()) # ๋ถˆํ•„์š”์ปฌ๋Ÿผ : ๊ฒฐ์ธก์น˜ : ์—†์Œ / object : sex, smoker, region
# print(x_test.info()) # ๊ฒฐ์ธก์น˜ : ์—†์Œ / object : sex, smoker, region
# print(y_train.info())

# ๊ฒฐ์ธก์น˜์žˆ๋Š”์ง€ ํ™•์ธ
# print(x_train.isnull().sum())
# print(x_test.isnull().sum())
# print(y_train.isnull().sum())

# ๋ถˆํ•„์š” ์ปฌ๋Ÿผ ์‚ญ์ œ
x_train = x_train.drop(columns=['id'])
x_test_id = x_test.pop('id')
y_train = y_train.drop(columns=['id'])

# ์›ํ•ซ ์ธ์ฝ”๋”ฉ : sex, smoker / ๋ผ๋ฒจ : region
ocol = ['sex', 'smoker']
x_train = pd.get_dummies(data = x_train, columns = ocol)
x_test = pd.get_dummies(data = x_test, columns = ocol)

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
x_train['region'] = encoder.fit_transform(x_train['region'])
x_test['region'] = encoder.transform(x_test['region'])

# ์Šค์ผ€์ผ๋ง
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
rcol = ['age', 'bmi']
x_train[rcol] = scaler.fit_transform(x_train[rcol])
x_test[rcol] = scaler.transform(x_test[rcol])

# ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ
# from sklearn.model_selection import train_test_split
# x_train1, x_test1, y_train1, y_test1 = train_test_split(x_train, y_train, test_size = 0.2)

# ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’ ์ฐพ๊ธฐ ( ์ด๋–„๋Š” ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌํ•œ x_test1์„ ์‚ฌ์šฉํ•จ)
# from sklearn.ensemble import RandomForestRegressor
# model3 = RandomForestRegressor(n_estimators = 100, max_depth=5)
# model3.fit(x_train1, np.ravel(y_train1))
# model_pred3 = model3.predict(x_test1)

# from sklearn.metrics import mean_squared_error
# rmse2 = np.sqrt(mean_squared_error(y_test1, model_pred3))
# print("Mean Squared Error:", rmse2)

# ์‹ค์ œ ์ œ์ถœ์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ํ•™์Šต ์‹œํ‚ค๊ธฐ ( ์ด๋•Œ๋Š” ์‹ค์ œ ์ œ์ถœํ•  x_test๋ฅผ ์‚ฌ์šฉํ•จ)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators = 100, max_depth=5)
model.fit(x_train, np.ravel(y_train))
model_pred = model.predict(x_test)

result = pd.DataFrame({'id': x_test_id , 'charges':model_pred})#.to_csv('123.csv', index=False)
print(result)

rmse(y_test['charges'], model_pred)

 

Pixabay๋กœ๋ถ€ํ„ฐ ์ž…์ˆ˜๋œ Pexels๋‹˜์˜ ์ด๋ฏธ์ง€ ์ž…๋‹ˆ๋‹ค.

 

2023.11.29 - [๐Ÿ“‚.๋น…๋ฐ์ดํ„ฐ๋ถ„์„๊ธฐ์‚ฌ] - ๋น…๋ฐ์ดํ„ฐ๋ถ„์„๊ธฐ์‚ฌ ์‹ค๊ธฐ ์ž„์‹œ ์ •๋ฆฌ๋ณธ

2023.12.01 - [๐Ÿ“‚.๋น…๋ฐ์ดํ„ฐ๋ถ„์„๊ธฐ์‚ฌ] - [์‹ค๊ธฐ] ๋น…๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ธฐ์‚ฌ ์ž‘์—…ํ˜• 1 ์ •๋ฆฌ (Python)

728x90