Classify Buyers and Non-buyers on Edmunds.com

Implemented binary logistic classification using XGBoost, and tuned hyperparameters with random search in parallel.

View on GitHub

About the Project

The main goal of this project was to answer one question: "If a customer leaves information on Edmunds.com for a particular car, is he/she going to buy the car?". By determining how likely a certain user will buy cars, we then decided if to pursue this user or not.

Methodology

  • Data was provided by Edmunds.com (data size: 1.7GB & 5 Tables)
  • Engineered 12 new features from 5 tables
  • Prepared dataset for modeling (missing value removal, outlier removal, bucketing, onehot encoding, and etc.)
  • Reduced data dimensionality with Chi-square test
  • Standardized numeric features
  • Activated parallel computing capabilities
  • Implemented binary logistic classification using XGBoost
  • Tuned hyperparameter using random search (Hyperparameter tuned: eta, nrounds, max_depth, eval_metric='AUC')
  • Used 3-fold cross validation to avoid overfitting
  • Model accuracy: 86.8%

Further Details

For more information, check out the Presentation Deck here and the Project Report here.

About DataFest

ASA DataFestTM is a data hackathon for undergraduate students, sponsored by the American Statistical Association and founded at UCLA, in 2011.

For more information, check out the official website here

Back