Back to Projects
Pythonscikit-learnClassificationClustering

Data Science Portfolio

A collection of ML and statistical modeling projects completed during the Post Graduate Program in Data Science & Business Analytics at Texas McCombs School of Business (2022-2024).

ReneWind — Predictive Maintenance for Wind Turbines

Objective: Build various classification models and tune them to identify generator failures in wind turbines to enable repairs before breakdown and reduce maintenance costs.

  • Techniques: Up and downsampling, Regularization, Hyperparameter tuning, Random Forest, XGBoost.
  • Key Challenge: Highly imbalanced dataset — failures are rare but costly.
  • Result: Optimized model with a focus on identifying failures to minimize generator downtime.
  • Relevance: Directly applicable to my work in industrial predictive maintenance.

Trade&Ahead — Stock Market Clustering

Objective: Analyze stock data to group stocks based on financial attributes and provide insights into the characteristics of each market segment.

  • Techniques: K-means Clustering, Hierarchical Clustering, Cluster Profiling.
  • Key Challenge: Feature scaling and determining the optimal number of clusters for meaningful profiling.
  • Result: Identified distinct groups of stocks to assist in segment-based investment analysis.

EasyVisa — Visa Approval Prediction

Objective: Build a predictive model to facilitate the visa approval process by identifying applicants' profiles most likely to be certified or denied.

  • Techniques: Bagging, Random Forest, AdaBoost, Gradient Boosting, XGBoost, Stacking Classifier.
  • Key Challenge: Hyperparameter Tuning using GridSearchCV to find the most accurate predictive factors.
  • Result: Developed an ensemble-based model that identifies significant factors influencing visa status.

INN Hotels — Booking Cancellation Prediction

Objective: Analyze booking data to identify factors influencing cancellations and build a predictive model to forecast cancellations in advance.

  • Techniques: Logistic Regression, Decision Trees, Pruning, Finding optimal threshold using AUC-ROC.
  • Key Challenge: Addressing multicollinearity and balancing business impact through precise cancellation forecasting.
  • Result: Actionable model supporting the formulation of profitable cancellation and refund policies.

ReCell — Dynamic Pricing for Refurbished Devices

Objective: Analyze used device datasets to build a model that develops dynamic pricing strategies and identifies price-influencing factors.

  • Techniques: Linear Regression, EDA, Linear Regression assumptions verification.
  • Key Challenge: Validating regression assumptions to ensure reliable price predictions across diverse device conditions.
  • Result: Created a strategic model to optimize the valuation and pricing of refurbished hardware.

E-news Express — A/B Testing & Statistical Inference

Objective: Determine if a new landing page for an online news portal is more effective at gathering new subscribers than the existing design.

  • Techniques: Hypothesis Testing, A/B testing, Statistical Inference, Data Visualization.
  • Key Challenge: Analyzing the dependence of conversion on preferred language and ensuring statistical significance.
  • Result: Data-driven conclusion on the effectiveness of the new landing page to guide business growth.

FoodHub — Order Analysis & Business Insights

Objective: Perform exploratory data analysis on registered customer orders to extract actionable insights for a food aggregator company.

  • Techniques: Univariate analysis, Bi-Variate analysis, Python-based EDA.
  • Key Challenge: Identifying key variables from order data to answer critical business optimization questions.
  • Result: Provided insights that directly assist the company in improving service and business operations.

Skills Applied

  • Programming: Python (pandas, NumPy, scikit-learn, matplotlib, seaborn)
  • Supervised Learning: Linear Regression, Logistic Regression, Decision Trees
  • Ensemble Techniques: Random Forest, Bagging, Boosting (XGBoost, AdaBoost, Gradient Boosting), Stacking
  • Unsupervised Learning: K-Means Clustering, Hierarchical Clustering
  • Statistics: Hypothesis Testing, A/B testing, Statistical Inference
  • Model Optimization: Regularization, Hyperparameter tuning (GridSearchCV), AUC-ROC curve