본문 바로가기

자격증공부/빅데이터분석기사

[빅데이터분석기사][유형2] 문제유형 풀어보기(다항분류)

320x100

 

[문제]

자동차시장 세분화

ㅇ 자동차 회사는 새로운 전략 수립을 위해 4개의 시장으로 세분화 했습니다. 기존 고객 분류 자료를 바탕으로 신규 고객이 어떤 분류에 속할 지 예측해주세요.

 - 데이터 : X_train, y_train, X_test

 - 예측값(y) : "Segmentation" (1,2,3,4)

 - 평가 : Macro f1-score

 - 제출형식 : ID, Segmentation

 

[풀이]

# 다항분류 f1-score 평가 : f1_score(y_val, pred, average='macro')

# 사용 라이브러리 import
import pandas as pd

# x_train.csv, y_train.csv, x_test.csv 읽어오기
X_train = pd.read_csv('X_train.csv')
y_train = pd.read_csv('Y_train.csv')
X_test = pd.read_csv('X_test.csv')
# print(X_train.shape, y_train.shape, X_test.shape)

###### EDA
# print(X_train.shape, y_train.shape, X_test.shape)
# print(X_train.head(3))
# print(y_train.head(3))
# print(X_train.info())
# print(y_train.info())
# print(X_test.info())
# print(X_train.isnull().sum())
# print(X_test.isnull().sum())
# print(y_train['Segmentation'].value_counts())

###### 데이터 전처리
# 불필요한 컬럼 삭제/이동
# print(X_train.shape, y_train.shape)
X_train = X_train.drop('ID', axis=1)
X_id = X_test.pop('ID')
# print(X_train.shape, y_train.shape)
# print(help(df.drop))

# 수치형/범주형 데이터 분리
n_train = X_train.select_dtypes(exclude = 'object').copy()
n_test = X_test.select_dtypes(exclude = 'object').copy()

c_train = X_train.select_dtypes(include = 'object').copy()
c_test = X_test.select_dtypes(include = 'object').copy()
# print(n_train.head(3))
# print(n_train.shape, c_train.shape)

# 수치형 데이터 - Min Max Scaler
from sklearn.preprocessing import MinMaxScaler
cols = X_train.select_dtypes(exclude='object').columns
scaler = MinMaxScaler()
n_train[cols] = scaler.fit_transform(n_train[cols])
n_test[cols] = scaler.transform(n_test[cols])
# print(n_train.head(2), n_test.head(2))

# 범주형 데이터 - getdummies()
# test/train 데이터 합치기
cols = X_train.select_dtypes(include='object').columns
all_df = pd.concat([c_train, c_test])
all_df= pd.get_dummies(all_df[cols])
# print(all_df.shape)

line = int(c_train.shape[0])
c_train = all_df.iloc[:line, :].copy()
c_test= all_df.iloc[line:, :].copy()

X_train = pd.concat([n_train, c_train], axis=1)
X_test = pd.concat([n_test, c_test], axis=1)
# print(X_train.shape, X_test.shape)
# print(X_train.head(3))

#### 평가용/검증용 데이터 분리
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(X_train,
                                            y_train['Segmentation'],
                                            test_size= 0.2,
                                            random_state = 2022)
# print(X_tr.shape, X_val.shape, y_tr.shape, y_val.shape)

#### 모델 생성/ 평가
# 분류(RandomFroest)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 400, max_depth = 8,random_state=2022)
model.fit(X_tr, y_tr)
pred = model.predict(X_val)
# print(pred)
# print(help(RandomForestClassifier))

# 평가(Macro f1-score)
from sklearn.metrics import f1_score
# print(f1_score(y_val, pred, average='macro'))

# n_estimators = 200, max_depth = 4,random_state=2022 0.5068786784466484
# n_estimators = 400, max_depth = 8,random_state=2022 0.5359599787533972
# n_estimators = 500, max_depth = 8,random_state=2022 0.5347514180146223

#### X_test로 최종 예측 및 제출파일 만들기
# print(X_test.head(3))
pred= model.predict(X_test)
# print(pred)

submit = pd.DataFrame({
    'ID' : X_id,
    'Segmentation' : pred
})

print(submit)

submit.to_csv('0000.csv', index=False)

 

320x100
반응형