Data Science Portfolio
A collection of ML and statistical modeling projects completed during the Post Graduate Program in Data Science & Business Analytics at Texas McCombs School of Business (2022-2024).
ReneWind — Predictive Maintenance for Wind Turbines
Objective: Build various classification models and tune them to identify generator failures in wind turbines to enable repairs before breakdown and reduce maintenance costs.
- Techniques: Up and downsampling, Regularization, Hyperparameter tuning, Random Forest, XGBoost.
- Key Challenge: Highly imbalanced dataset — failures are rare but costly.
- Result: Optimized model with a focus on identifying failures to minimize generator downtime.
- Relevance: Directly applicable to my work in industrial predictive maintenance.
Trade&Ahead — Stock Market Clustering
Objective: Analyze stock data to group stocks based on financial attributes and provide insights into the characteristics of each market segment.
- Techniques: K-means Clustering, Hierarchical Clustering, Cluster Profiling.
- Key Challenge: Feature scaling and determining the optimal number of clusters for meaningful profiling.
- Result: Identified distinct groups of stocks to assist in segment-based investment analysis.
EasyVisa — Visa Approval Prediction
Objective: Build a predictive model to facilitate the visa approval process by identifying applicants' profiles most likely to be certified or denied.
- Techniques: Bagging, Random Forest, AdaBoost, Gradient Boosting, XGBoost, Stacking Classifier.
- Key Challenge: Hyperparameter Tuning using GridSearchCV to find the most accurate predictive factors.
- Result: Developed an ensemble-based model that identifies significant factors influencing visa status.
INN Hotels — Booking Cancellation Prediction
Objective: Analyze booking data to identify factors influencing cancellations and build a predictive model to forecast cancellations in advance.
- Techniques: Logistic Regression, Decision Trees, Pruning, Finding optimal threshold using AUC-ROC.
- Key Challenge: Addressing multicollinearity and balancing business impact through precise cancellation forecasting.
- Result: Actionable model supporting the formulation of profitable cancellation and refund policies.
ReCell — Dynamic Pricing for Refurbished Devices
Objective: Analyze used device datasets to build a model that develops dynamic pricing strategies and identifies price-influencing factors.
- Techniques: Linear Regression, EDA, Linear Regression assumptions verification.
- Key Challenge: Validating regression assumptions to ensure reliable price predictions across diverse device conditions.
- Result: Created a strategic model to optimize the valuation and pricing of refurbished hardware.
E-news Express — A/B Testing & Statistical Inference
Objective: Determine if a new landing page for an online news portal is more effective at gathering new subscribers than the existing design.
- Techniques: Hypothesis Testing, A/B testing, Statistical Inference, Data Visualization.
- Key Challenge: Analyzing the dependence of conversion on preferred language and ensuring statistical significance.
- Result: Data-driven conclusion on the effectiveness of the new landing page to guide business growth.
FoodHub — Order Analysis & Business Insights
Objective: Perform exploratory data analysis on registered customer orders to extract actionable insights for a food aggregator company.
- Techniques: Univariate analysis, Bi-Variate analysis, Python-based EDA.
- Key Challenge: Identifying key variables from order data to answer critical business optimization questions.
- Result: Provided insights that directly assist the company in improving service and business operations.
Skills Applied
- Programming: Python (pandas, NumPy, scikit-learn, matplotlib, seaborn)
- Supervised Learning: Linear Regression, Logistic Regression, Decision Trees
- Ensemble Techniques: Random Forest, Bagging, Boosting (XGBoost, AdaBoost, Gradient Boosting), Stacking
- Unsupervised Learning: K-Means Clustering, Hierarchical Clustering
- Statistics: Hypothesis Testing, A/B testing, Statistical Inference
- Model Optimization: Regularization, Hyperparameter tuning (GridSearchCV), AUC-ROC curve