Artifact 4 – Data Challenge Scenarios

Artifact Overview

This artifact documents an AI-guided scenario-based simulation in which I worked through three progressively complex real-world ML data challenges. For each scenario, I diagnosed the problem, proposed a solution with step-by-step reasoning, and evaluated trade-offs between different approaches. The session was conducted using the AIML-500 Data Challenge Coach on SchoolAI and covered missing data handling, class imbalance and fairness, and data privacy with scalability constraints.

I selected this artifact because it demonstrates applied ML problem-solving — not theoretical knowledge, but the ability to reason through ambiguous data situations under structured feedback. Where earlier artifacts showcased tool exploration (Artifact 1), collaborative communication (Artifact 2), and conceptual depth (Artifact 3), this artifact captures decision-making and trade-off analysis in realistic engineering contexts.

Data Cleaning & Imputation ML Fairness & Bias Auditing Data Privacy (Differential Privacy) Critical Thinking Trade-off Analysis Responsible ML

Scenarios Covered

The session was structured around three scenarios of increasing complexity, each requiring a decision followed by a detailed reasoning justification.

1 Missing Data

Handling randomly scattered missing values in a student performance dataset prior to model training.

2 Class Imbalance & Fairness

Addressing underrepresented demographic groups in a predictive model and separating accuracy from fairness metrics.

3 Privacy & Scalability

Designing a production ML pipeline that satisfies data privacy requirements while maintaining performance at scale.

Highlight: Scenario 1 — Missing Data

The first scenario presented a student pass/fail prediction dataset with 10% of values missing randomly in the hours_studied and prior_grade columns. I outlined a six-step plan covering inspection, imputation strategy, train/test split ordering, and post-imputation validation.

Mohammed (Student)

First, I would inspect the missing data before changing anything — checking how much is missing, confirming the pattern is random (MCAR), and comparing rows with and without missing values. Second, I would avoid dropping rows because only 10% are affected and the missingness is random. Third, I would impute using the median for both features — median is more robust to outliers than mean. Fourth, and critically, I would split into train and test sets before imputing, then calculate median statistics from the training set only and apply those to both sets. This prevents data leakage. Fifth, I would add missingness indicator columns to let the model know a value was originally missing. Finally, I would validate that the imputed distributions look reasonable and compare strategies empirically.

AI Coach

Nice plan — you covered the key steps. You start by inspecting missingness and comparing rows — exactly right. You protect against data leakage by fitting imputation on the training set only. You prefer median for skewed features and propose missingness indicators — both good practices.

Highlight: Probing Question — Correlated Features

After the initial plan, the coach asked a targeted follow-up: if hours_studied correlates strongly with attendance_rate, what imputation method would leverage that relationship and what are its downsides? This required moving beyond univariate imputation thinking.

Mohammed (Student)

If hours_studied correlates strongly with attendance_rate, I would consider KNN imputation or regression-based imputation. With KNN, the model looks for students with similar attendance and other features, then estimates a reasonable value for hours_studied based on those similar cases. The downside is these methods are more computationally expensive, can overfit if the relationships are weak or noisy, and regression imputation can make the data look more certain than it really is. I would also still need to fit the imputer only on the training set and repeat it inside each cross-validation fold to avoid leakage. I would test this against median imputation and compare validation performance rather than assuming the more complex method is automatically better.

AI Coach

Fantastic — you completed all three scenarios and showed strong reasoning and trade-off awareness. Great growth mindset.

Tools & Platform

SchoolAI — AI-powered coaching platform hosting the Data Challenge Scenarios simulation
AIML-500 Data Challenge Coach — Custom chatbot presenting progressively complex real-world ML data problems
Python / scikit-learn (conceptual) — SimpleImputer, KNNImputer, IterativeImputer, train_test_split, cross-validation
ML Fairness Frameworks (conceptual) — Demographic parity, equalized odds, subgroup performance evaluation
Differential Privacy Concepts — Epsilon/delta budgeting, privacy accounting, utility vs. privacy trade-offs

Value Proposition

This artifact demonstrates that I can apply foundational ML knowledge to messy, real-world data problems — not just in theory, but through structured reasoning and decision-making under uncertainty. For employers or collaborators, it shows I understand where ML systems fail in practice: biased data, leakage, privacy risk, and scalability bottlenecks. I can think critically about trade-offs rather than defaulting to simple solutions, and I know how to structure a technically defensible approach before writing a single line of code.

The scenarios covered here reflect the kinds of decisions a data scientist or ML engineer faces on every real project. The ability to reason through them clearly — and to adjust thinking based on feedback — is the skill that separates engineers who build reliable systems from those who produce fragile ones.

Reflection

I selected this artifact for my portfolio because it captures applied problem-solving rather than just knowledge recall. The scenario format required me to commit to a decision and defend it, which is much closer to what real engineering work feels like than answering quiz questions or summarizing readings.

The most important thing I took from this session was around data leakage. I had understood the concept before, but the chatbot's probing question — asking specifically about imputation order relative to the train/test split — made me articulate exactly why the ordering matters and what goes wrong if you get it wrong. That kind of precision is what I want to carry into future projects.

For a technical hiring audience, I would highlight the Scenario 2 work on fairness metrics and subgroup evaluation, since it shows awareness that overall accuracy is not a sufficient success criterion. For a non-technical audience, I would focus on the core insight from Scenario 1: that filling in missing data is not just a cleanup step, it is a modeling decision with real consequences for what the model learns.

If I were to redo this artifact, I would document the full Scenario 3 exchange in more detail, as the privacy and scalability discussion covered differential privacy, feature minimization, and production architecture trade-offs — all highly relevant to real-world ML deployment. That content deserves more visibility in the portfolio.