Code Craft #4 - Solving Kaggle's Spaceship Titanic with Optuna - A Complete Walkthrough for Hyperparameter Optimization
Introduction
Welcome to this comprehensive walkthrough on solving Kaggle’s Spaceship Titanic competition using Optuna for hyperparameter optimization. In this blog, we will delve into the details of how to efficiently tune hyperparameters and select features to build a robust machine learning model. This blog complements the video tutorial, providing all the code snippets and explanations in a written format for easy reference.
Understanding Hyperparameter Optimization
Hyperparameter optimization is a crucial step in the machine learning pipeline. It involves finding the best set of parameters that a learning algorithm should use to train a model. These parameters can significantly impact the performance of the model and tuning them correctly can often be the difference between a mediocre model and a top-performing one.
Optuna is an automatic hyperparameter optimization software framework, designed to make hyperparameter optimization fast and efficient. It uses state-of-the-art optimization algorithms to find the optimal hyperparameters in a more efficient way than traditional methods.
Loading and Exploring the Dataset
First, let's import the necessary libraries and load the dataset.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import optuna
# Load the train and test dataset
data = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
# Check the shape of the data
data.shape
# Output: (8693, 14)
# Display the first few rows
data.head(2)
# Output: First two rows of the dataset
Data Preprocessing
Next, we preprocess the data to prepare it for model training.
# Drop Rows with Missing Values
data.fillna(0, inplace=True)
# Drop the Passenger ID Column as it's the Unique Identifier
data.drop('PassengerId', inplace=True, axis=1)
# Convert mixed type columns to string
for col in data.columns:
if data[col].dtype == 'object':
data[col] = data[col].astype(str)
# Identify categorical columns
categorical_cols = [col for col in data.columns if data[col].dtype == 'object']
# Apply label encoding to categorical columns
oe = {}
for col in categorical_cols:
oe[col] = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
data[col] = oe[col].fit_transform(data[col].values.reshape(-1, 1))
# Separating features and target
X = data.drop('Transported', axis=1)
y = data['Transported']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Defining the Objective Function for Optuna
We define an objective function that Optuna will use to optimize our model.
def objective(trial):
"""
Defining the objective function for Optuna with feature selection
"""
n_estimators = trial.suggest_int('n_estimators', 100, 1000)
max_depth = trial.suggest_int('max_depth', 2, 32)
k_best = trial.suggest_int('k_best', 1, X_train.shape[1])
select_k_best = SelectKBest(score_func=f_classif, k=k_best)
preprocessor = ColumnTransformer([('select', select_k_best, X.columns)], remainder='passthrough')
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42))])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
trial.set_user_attr('classifier', clf)
return accuracy
Running the Optimization
We create a study and run the optimization process.
# Running the optimization
study = optuna.create_study(study_name='best_features_and_hyperparameters', direction='maximize')
study.optimize(objective, n_trials=50)
# Output: Logs of each trial with accuracy and parameters
Extracting the Best Hyperparameters and Features
After optimization, we extract the best hyperparameters and selected features.
# Extracting the best hyperparameters and selected features
print('Best hyperparameters: ', study.best_params)
best_trial = study.best_trial
selected_features = SelectKBest(score_func=f_classif, k=best_trial.params['k_best']).fit(X_train, y_train)
print('Selected features: ', X.columns[selected_features.get_support()])
# Output: Best hyperparameters and selected features
Preparing the Submission
Finally, we prepare and submit our predictions to Kaggle.
# Load Submission Data
submission_data = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
# Drop Rows with Missing Values
submission_data.fillna(0, inplace=True)
# Applying the label encoders to the submission data
for column, encoder in oe.items():
submission_data[column] = encoder.transform(submission_data[column].values.reshape(-1, 1))
# Generate Sample Submission
best_clf = best_trial.user_attrs['classifier']
y_pred = best_clf.predict(submission_data)
submission_data['Transported'] = y_pred
# Only keep required columns for submission
submission_data = submission_data[['PassengerId', 'Transported']]
# Save the DataFrame to Kaggle Directory and Upload to Submit
submission_data.to_csv('/kaggle/working/submission.csv', index=False, header=True)
Conclusion
After submitting our predictions to the Kaggle competition, we achieved a score of 79.611% accuracy. This result highlights the effectiveness of using Optuna for hyperparameter optimization and feature selection.
I hope you found this tutorial insightful and valuable. Keep experimenting with different models and hyperparameters to further improve your score. For more detailed information and code snippets, you can refer to the video linked in the description.
Happy coding, and best of luck in your machine learning journey!