๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ—‚.์ž๊ฒฉ์ฆ/๐Ÿ“.๋น…๋ฐ์ดํ„ฐ๋ถ„์„๊ธฐ์‚ฌ

[์‹ค๊ธฐ] ๋น…๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ธฐ์‚ฌ ์ž‘์—…ํ˜• 1 ์ •๋ฆฌ (Python)

by ๐Ÿ’พ๊ณ ๊ตฌ๋งˆ๋ง›ํƒ•๋จน๊ณ ์‹ถ๋‹ค 2023. 12. 1.
728x90

๋น…๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ธฐ์‚ฌ ์‹ค๊ธฐ -  ์ž‘์—…ํ˜• 1 ์ •๋ฆฌ๋ณธ

๐Ÿšจ ๋ชจ๋“  ์ฝ”๋“œ๋Š” ํŒŒ์ด์ฌ ๊ธฐ์ค€์ž…๋‹ˆ๋‹ค.

 

1.    ํ•จ์ˆ˜ ์‚ฌ์šฉ

def df_events(x):
    if (x['Events'] == 1):
        return x['Sales'] * 0.8
    else:
        return x['Sales']
    
df['RSales'] = df.apply(df_events, axis=1)

 

2.    Merge์™€ dropna ์‚ฌ์šฉ๋ฒ• โญ๏ธ ( 7ํšŒ ์ž‘์—…ํ˜• 1 dropna ์ œ์ถœ )

# basic1 ๋ฐ์ดํ„ฐ์™€ basic3 ๋ฐ์ดํ„ฐ๋ฅผ 'f4'๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๋ณ‘ํ•ฉํ•˜๊ธฐ
df = b1.merge(b3, how='inner', on='f4')

# ๋ณ‘ํ•ฉํ•œ ๋ฐ์ดํ„ฐ์—์„œ r2๊ฒฐ์ธก์น˜๋ฅผ ์ œ๊ฑฐํ•˜๊ธฐ
df = df.dropna(subset=['r2'])

 

3.    ์ •๋ ฌ

# ์˜ค๋ฆ„์ฐจ์ˆœ : 1, 2, 3, 4
df = df[df['f2'] == 0].sort_values('age', ascending = True).head(20)

# ๋‚ด๋ฆผ์ฐจ์ˆœ : 4, 3, 2, 1
df = df.reset_index().sort_values('f5', ascending = False)

 

4.    IQR โญ๏ธ ( 7ํšŒ ์ž‘์—…ํ˜• 1 ์ œ์ถœ )

Q1 = np.percentile(df['Fare'], 25)
Q3 = np.percentile(df['Fare'], 75)
IQR = Q3 - Q1
 
out1 = df[df['Fare'] < (Q1 - 1.5 * IQR)]
out3 = df[df['Fare'] > (Q3 + 1.5 * IQR)]

 

5.    ์˜ฌ๋ฆผ, ๋‚ด๋ฆผ, ๋ฒ„๋ฆผ

up = np.ceil(df['age']).mean()
down = np.floor(df['age']).mean()
drop = np.trunc(df['age']).mean()

 

6.    ์Šค์ผ€์ผ๋ง ํ›„ ์™œ๋„ ์ฒจ๋„ ๊ตฌํ•˜๊ธฐ

df['SalePrice'] = np.log1p(df['SalePrice'])
spskew2 = df['SalePrice'].skew() # DataFrame.skew() ์™œ๋„
spkurt2 = df['SalePrice'].kurt() # DataFrame.kurt() ์ฒจ๋„

 

7.    ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ๊ธฐ โญ๏ธ ( 7ํšŒ ์ž‘์—…ํ˜• 1 ์ œ์ถœ )

# 'city'์™€ 'f2'์„ ๊ธฐ์ค€์œผ๋กœ ๋ฌถ์–ด ๊ทธ๋ฃน ํ•ฉ๊ณ„ ๊ณ„์‚ฐ
df2 = df.groupby(['city','f2']).sum()

 

8.    ๊ฐ’ ๋Œ€์ฒด

# 'f4'์ปฌ๋Ÿผ์˜ ๊ฐ’์ด 'ESFJ'์ธ ๋ฐ์ดํ„ฐ๋ฅผ 'ISFJ'๋กœ ๋Œ€์ฒด
df['f4'] = df['f4'].replace('ESFJ', 'ISFJ')

 

9.    ์ƒ์œ„, ํ•˜์œ„ ๊ฐ’ ๊ตฌํ•˜๊ธฐ

# ์ƒ์œ„ 5%์™€ ํ•˜์œ„ 5%
low = df['f5M'].quantile(0.05)
high = df['f5M'].quantile(0.95)

 

10. ๋‚ ์งœ ๋ณ€ํ™˜

df['Date'] = pd.to_datetime(df['Date'])
df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month
df['wo'] = df['Date'].dt.dayofweek # ํ‰์ผ๊ณผ ์ฃผ๋ง ๋‚˜๋ˆŒ ๋•Œ ์‚ฌ์šฉ

 

11.  ๋™์ผํ•œ ๊ฐœ์ˆ˜๋กœ ๋‚˜๋ˆ„๊ธฐ

๋™์ผํ•œ ๊ฐœ์ˆ˜๋กœ ๋‚˜์ด ์ˆœ์œผ๋กœ 3๊ทธ๋ฃน์œผ๋กœ ๋‚˜๋ˆˆ ๋’ค ๊ฐ ๊ทธ๋ฃน์˜ ์ค‘์•™๊ฐ’์„ ๋”ํ•˜์‹œ์˜ค
df['range'] = pd.qcut(df['age'], q=3, labels=['1', '2', '3'])

 

12.  ์ค‘๋ณต๋ฐ์ดํ„ฐ ๋ฐœ์ƒ ์‹œ ๋’ค์— ๋ฐ์ดํ„ฐ ์‚ญ์ œ

df = df.drop_duplicates(subset=['age'])

 

13.  ๊ทธ ์™ธ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ํ•จ์ˆ˜

์ด๋ฆ„  ์‚ฌ์šฉ๋ฒ•
๋ถ„์‚ฐ  df['f1'].var()
ํ‰๊ท   df['f5'].mean()
์ค‘์•™๊ฐ’  df['f1'].fillna(df['f1'].median())
์ตœ๋นˆ๊ฐ’  df['f1'].fillna(df['f1'].mode()[0])
ํ‘œ์ค€ํŽธ์ฐจ  df[conI]['f1'].std()
๋ˆ„์ ํ•ฉ  df[con]['f1'].cumsum()

 

Pixabay๋กœ๋ถ€ํ„ฐ ์ž…์ˆ˜๋œ Pexels๋‹˜์˜ ์ด๋ฏธ์ง€ ์ž…๋‹ˆ๋‹ค.

728x90