From 685c2412761bff78b63431a649c857fe3ccfd883 Mon Sep 17 00:00:00 2001 From: Dave Date: Mon, 23 Mar 2026 21:00:02 -0700 Subject: [PATCH 01/10] Updated notebook --- jupyter/anomaly_detection_creditcard.ipynb | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/jupyter/anomaly_detection_creditcard.ipynb b/jupyter/anomaly_detection_creditcard.ipynb index 3505a3a..e6a0adb 100644 --- a/jupyter/anomaly_detection_creditcard.ipynb +++ b/jupyter/anomaly_detection_creditcard.ipynb @@ -6,8 +6,9 @@ "source": [ "# Anomaly Detection in Azure ML Studio\n", "\n", - "Example anomaly detection machine learning job using a hybrid of Azure AI Machine Learning console resources combined with native Python in this Jupyter notebook which is intended to be uploaded and executed from 'Authoring -> Notebooks' in the web console. \n", + "This notebook demonstrates a fraud detection system built using Azure Machine Learning. It processes transaction data, trains a model to identify unusual activity, and evaluates its performance. The goal is to detect potential fraud while minimizing disruption to legitimate customers.\n", "\n", + "The process includes data preparation, model training, evaluation, and deployment considerations. Each step is explained in simple terms to help non-technical stakeholders understand how the system works and what decisions it supports.\n", "### Requirements\n", "\n", "1. You should already have created an Azure account, and created a [Subscription](https://techcommunity.microsoft.com/discussions/azure/understanding-azure-account-subscription-and-directory-/34800) and a [Workspace](https://learn.microsoft.com/en-us/azure/machine-learning/concept-workspace?view=azureml-api-2).\n", From 69d220fc6b713f1a0e17e88ef5963a8049523163 Mon Sep 17 00:00:00 2001 From: Dave Date: Mon, 23 Mar 2026 22:02:08 -0700 Subject: [PATCH 02/10] Updated notebook --- jupyter/anomaly_detection_creditcard.ipynb | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/jupyter/anomaly_detection_creditcard.ipynb b/jupyter/anomaly_detection_creditcard.ipynb index e6a0adb..aa9e5a1 100644 --- a/jupyter/anomaly_detection_creditcard.ipynb +++ b/jupyter/anomaly_detection_creditcard.ipynb @@ -9,6 +9,22 @@ "This notebook demonstrates a fraud detection system built using Azure Machine Learning. It processes transaction data, trains a model to identify unusual activity, and evaluates its performance. The goal is to detect potential fraud while minimizing disruption to legitimate customers.\n", "\n", "The process includes data preparation, model training, evaluation, and deployment considerations. Each step is explained in simple terms to help non-technical stakeholders understand how the system works and what decisions it supports.\n", + "\n", + "Raw Transaction Data\n", + " ↓\n", + "Data Cleaning & Preparation\n", + " ↓\n", + "Feature Processing\n", + " ↓\n", + "Model Training (Anomaly Detection)\n", + " ↓\n", + "Model Evaluation\n", + " ↓\n", + "Deployment (Fraud Detection Service)\n", + " ↓\n", + "Monitoring & Improvement\n", + "\n", + "\n", "### Requirements\n", "\n", "1. You should already have created an Azure account, and created a [Subscription](https://techcommunity.microsoft.com/discussions/azure/understanding-azure-account-subscription-and-directory-/34800) and a [Workspace](https://learn.microsoft.com/en-us/azure/machine-learning/concept-workspace?view=azureml-api-2).\n", From 6b12115f23fee877dc769b8623bf72b254942d9d Mon Sep 17 00:00:00 2001 From: Dave Date: Mon, 23 Mar 2026 22:13:09 -0700 Subject: [PATCH 03/10] Updated notebook --- jupyter/anomaly_detection_creditcard.ipynb | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/jupyter/anomaly_detection_creditcard.ipynb b/jupyter/anomaly_detection_creditcard.ipynb index aa9e5a1..64abcea 100644 --- a/jupyter/anomaly_detection_creditcard.ipynb +++ b/jupyter/anomaly_detection_creditcard.ipynb @@ -6,9 +6,21 @@ "source": [ "# Anomaly Detection in Azure ML Studio\n", "\n", - "This notebook demonstrates a fraud detection system built using Azure Machine Learning. It processes transaction data, trains a model to identify unusual activity, and evaluates its performance. The goal is to detect potential fraud while minimizing disruption to legitimate customers.\n", + "# Credit Card Fraud Detection System (Azure ML Pipeline)\n", "\n", - "The process includes data preparation, model training, evaluation, and deployment considerations. Each step is explained in simple terms to help non-technical stakeholders understand how the system works and what decisions it supports.\n", + "## Executive Summary\n", + "\n", + "This notebook presents a fraud detection system built using Azure Machine Learning. It processes transaction data, identifies unusual patterns, and flags potential fraud.\n", + "\n", + "The goal is to detect fraudulent activity while minimizing disruption to legitimate customers. The system focuses on balancing two key risks:\n", + "- Flagging valid transactions incorrectly (false positives)\n", + "- Missing actual fraud cases\n", + "\n", + "This notebook explains each step of the process in simple terms, from data preparation to model evaluation and deployment considerations, to support informed business decisions.\n", + "\n", + "---\n", + "\n", + "## End-to-End Pipeline Overview\n", "\n", "Raw Transaction Data\n", " ↓\n", From 8668e96502a27c4f01db944b681a52703faf7444 Mon Sep 17 00:00:00 2001 From: Dave Date: Mon, 23 Mar 2026 22:25:14 -0700 Subject: [PATCH 04/10] Updated notebook --- jupyter/anomaly_detection_creditcard.ipynb | 18 ------------------ 1 file changed, 18 deletions(-) diff --git a/jupyter/anomaly_detection_creditcard.ipynb b/jupyter/anomaly_detection_creditcard.ipynb index 64abcea..cdf0ef2 100644 --- a/jupyter/anomaly_detection_creditcard.ipynb +++ b/jupyter/anomaly_detection_creditcard.ipynb @@ -18,24 +18,6 @@ "\n", "This notebook explains each step of the process in simple terms, from data preparation to model evaluation and deployment considerations, to support informed business decisions.\n", "\n", - "---\n", - "\n", - "## End-to-End Pipeline Overview\n", - "\n", - "Raw Transaction Data\n", - " ↓\n", - "Data Cleaning & Preparation\n", - " ↓\n", - "Feature Processing\n", - " ↓\n", - "Model Training (Anomaly Detection)\n", - " ↓\n", - "Model Evaluation\n", - " ↓\n", - "Deployment (Fraud Detection Service)\n", - " ↓\n", - "Monitoring & Improvement\n", - "\n", "\n", "### Requirements\n", "\n", From d75461de6d15375ffbe3b5e5e47ba39ca13ce4e8 Mon Sep 17 00:00:00 2001 From: Dave Date: Mon, 23 Mar 2026 22:27:49 -0700 Subject: [PATCH 05/10] Updated notebook --- jupyter/anomaly_detection_creditcard.ipynb | 27 ++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/jupyter/anomaly_detection_creditcard.ipynb b/jupyter/anomaly_detection_creditcard.ipynb index cdf0ef2..f021b95 100644 --- a/jupyter/anomaly_detection_creditcard.ipynb +++ b/jupyter/anomaly_detection_creditcard.ipynb @@ -18,6 +18,33 @@ "\n", "This notebook explains each step of the process in simple terms, from data preparation to model evaluation and deployment considerations, to support informed business decisions.\n", "\n", + "## End-to-End Process Overview\n", + "\n", + "Raw Transaction Data \n", + "\n", + "↓ \n", + "\n", + "Data Cleaning & Preparation \n", + "\n", + "↓ \n", + "\n", + "Feature Processing \n", + "\n", + "↓ \n", + "\n", + "Model Training \n", + "\n", + "↓ \n", + "\n", + "Model Evaluation \n", + "\n", + "↓ \n", + "\n", + "Deployment \n", + "\n", + "↓ \n", + "\n", + "Monitoring & Improvement\n", "\n", "### Requirements\n", "\n", From 6ba6db825967cc59732f40267954e357c0e89bd3 Mon Sep 17 00:00:00 2001 From: Dave Date: Mon, 23 Mar 2026 22:30:52 -0700 Subject: [PATCH 06/10] Updated notebook --- jupyter/anomaly_detection_creditcard.ipynb | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/jupyter/anomaly_detection_creditcard.ipynb b/jupyter/anomaly_detection_creditcard.ipynb index f021b95..a287b30 100644 --- a/jupyter/anomaly_detection_creditcard.ipynb +++ b/jupyter/anomaly_detection_creditcard.ipynb @@ -46,6 +46,18 @@ "\n", "Monitoring & Improvement\n", "\n", + "\n", + "## Azure ML Components\n", + "\n", + "| Component | Purpose |\n", + "|----------|--------|\n", + "| Dataset | Stores transaction data |\n", + "| Compute | Runs training and processing |\n", + "| Pipeline | Automates workflow |\n", + "| Model | Detects fraud patterns |\n", + "| Endpoint | Enables real-time predictions |\n", + "| Monitoring | Tracks model performance |\n", + "\n", "### Requirements\n", "\n", "1. You should already have created an Azure account, and created a [Subscription](https://techcommunity.microsoft.com/discussions/azure/understanding-azure-account-subscription-and-directory-/34800) and a [Workspace](https://learn.microsoft.com/en-us/azure/machine-learning/concept-workspace?view=azureml-api-2).\n", From 17ad43b36579d8ad6a8318418187b9f9dd41ac8c Mon Sep 17 00:00:00 2001 From: Dave Date: Mon, 23 Mar 2026 22:32:06 -0700 Subject: [PATCH 07/10] Updated notebook --- jupyter/anomaly_detection_creditcard.ipynb | 1 + 1 file changed, 1 insertion(+) diff --git a/jupyter/anomaly_detection_creditcard.ipynb b/jupyter/anomaly_detection_creditcard.ipynb index a287b30..6d39654 100644 --- a/jupyter/anomaly_detection_creditcard.ipynb +++ b/jupyter/anomaly_detection_creditcard.ipynb @@ -49,6 +49,7 @@ "\n", "## Azure ML Components\n", "\n", + "\n", "| Component | Purpose |\n", "|----------|--------|\n", "| Dataset | Stores transaction data |\n", From 8b88aa8da4e648a8a1f61e5c09b1be590a859734 Mon Sep 17 00:00:00 2001 From: Dave Date: Mon, 23 Mar 2026 22:34:23 -0700 Subject: [PATCH 08/10] Updated notebook --- jupyter/anomaly_detection_creditcard.ipynb | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/jupyter/anomaly_detection_creditcard.ipynb b/jupyter/anomaly_detection_creditcard.ipynb index 6d39654..522e150 100644 --- a/jupyter/anomaly_detection_creditcard.ipynb +++ b/jupyter/anomaly_detection_creditcard.ipynb @@ -59,10 +59,22 @@ "| Endpoint | Enables real-time predictions |\n", "| Monitoring | Tracks model performance |\n", "\n", - "### Requirements\n", "\n", - "1. You should already have created an Azure account, and created a [Subscription](https://techcommunity.microsoft.com/discussions/azure/understanding-azure-account-subscription-and-directory-/34800) and a [Workspace](https://learn.microsoft.com/en-us/azure/machine-learning/concept-workspace?view=azureml-api-2).\n", - "2. this exercise assumes that you've already downloaded the [Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) dataset from kaggle, and stored this in 'Assets -> Data' inside your Azure account under the name, 'creditcard_fraud'.**\n", + "## Workflow Overview\n", + "\n", + "The system follows these steps:\n", + "\n", + "1. Load transaction data \n", + "2. Clean and prepare the data \n", + "3. Process features \n", + "4. Train the model \n", + "5. Evaluate performance \n", + "6. Prepare for deployment \n", + "\n", + "Each step improves accuracy and reduces false alerts.\n", + "\n", + "\n", + "\n", "\n", "### About Jupyter Notebooks\n", "\n", From aee3811cf9d831b48d40d4634ab42fd99a5ed8e6 Mon Sep 17 00:00:00 2001 From: Dave Date: Mon, 23 Mar 2026 22:46:02 -0700 Subject: [PATCH 09/10] Updated notebook --- jupyter/anomaly_detection_creditcard.ipynb | 46 ++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/jupyter/anomaly_detection_creditcard.ipynb b/jupyter/anomaly_detection_creditcard.ipynb index 522e150..bca7d29 100644 --- a/jupyter/anomaly_detection_creditcard.ipynb +++ b/jupyter/anomaly_detection_creditcard.ipynb @@ -74,6 +74,52 @@ "Each step improves accuracy and reduces false alerts.\n", "\n", "\n", + "## Feature Processing\n", + "\n", + "The dataset includes processed features (V1–V28) created using statistical methods.\n", + "\n", + "These features help detect patterns but do not directly represent real-world transaction details. Because of this, extreme values can strongly influence model decisions.\n", + "\n", + "Careful handling of these values is important to reduce false positives.\n", + "\n", + "\n", + "## Business Impact\n", + "\n", + "### False Positives vs Missed Fraud\n", + "\n", + "- False positives → customer frustration and lost transactions \n", + "- Missed fraud → financial loss and security risk \n", + "\n", + "The goal is to balance both.\n", + "\n", + "\n", + "### Risks\n", + "\n", + "- Model may flag unusual but valid transactions \n", + "- Data changes over time may reduce accuracy \n", + "- Data adjustments may introduce bias if not reviewed \n", + "\n", + "\n", + "### Recommendations\n", + "\n", + "- Improve data quality by handling extreme values \n", + "- Monitor model performance continuously \n", + "- Deploy changes gradually \n", + "\n", + "\n", + "### Stakeholder Communication\n", + "\n", + "- Share regular updates \n", + "- Clearly explain limitations \n", + "- Set realistic expectations \n", + "\n", + "## Conclusion\n", + "\n", + "This system provides a strong starting point for fraud detection using machine learning.\n", + "\n", + "While the current model has limitations, especially with false positives, it demonstrates how data-driven approaches can improve fraud detection.\n", + "\n", + "Ongoing improvements in data quality, model tuning, and monitoring will be key to long-term success.\n", "\n", "\n", "### About Jupyter Notebooks\n", From 879a8e5fa6682d2ccc400d7ff62da289e551f04d Mon Sep 17 00:00:00 2001 From: Dave Date: Mon, 23 Mar 2026 23:49:03 -0700 Subject: [PATCH 10/10] Updated notebook --- jupyter/anomaly_detection_creditcard.ipynb | 394 --------------------- 1 file changed, 394 deletions(-) diff --git a/jupyter/anomaly_detection_creditcard.ipynb b/jupyter/anomaly_detection_creditcard.ipynb index bca7d29..38717d6 100644 --- a/jupyter/anomaly_detection_creditcard.ipynb +++ b/jupyter/anomaly_detection_creditcard.ipynb @@ -121,402 +121,8 @@ "\n", "Ongoing improvements in data quality, model tuning, and monitoring will be key to long-term success.\n", "\n", - "\n", - "### About Jupyter Notebooks\n", - "\n", - "Jupyter Notebooks originated from the **IPython project**, which was created to provide an interactive computing environment for Python. Over time, it evolved into the broader **Jupyter Project** (\"JU\" for Julia, \"PY\" for Python, and \"R\" for R), supporting multiple programming languages. What makes Jupyter Notebooks so popular in data science is their ability to **combine code, outputs, text, math, and visualizations in one place**. This format is ideal for exploration, analysis, documentation, and teaching. Because of this flexibility and transparency, Jupyter has become a standard across cloud platforms like **Google Colab, Azure AI Machine Learning, and AWS SageMaker** — each offering hosted environments where users can write and execute notebooks with scalable cloud compute. These platforms support the same notebook format (.ipynb), making it easy to move your work between local machines and cloud services.\n", - "\n", - "Thus, you can run this notebook from either of the Azure AI Machine Learning web console, or locally, assuming that you've created and activated the [Python virtual environment](https://realpython.com/python-virtual-environments-a-primer/) provided in the course [GitHub repository](https://github.com/FullStackWithLawrence/azureml-example) for Python 3.9\n", - "\n", - "\n", - "## Workflow\n", - "\n", - "### Step 1: import the PyPi packages\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Step 1: Import Packages and Connect to your Azure Workspace\n", - "from azureml.core import Workspace, Dataset # see https://pypi.org/project/azureml-core/\n", - "import pandas as pd # see https://pandas.pydata.org/docs/\n", - "from sklearn.ensemble import IsolationForest # see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html\n", - "from sklearn.metrics import classification_report # see https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html\n", - "from azureml.core.model import Model # see https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model?view=azure-ml-py " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 2: Load the Credit Card Fraud Dataset from Azure ML\n", - "\n", - "Retrieve the dataset from our existing workspace, and set this up for use with Pandas.\n", - "\n", - "**IMPORTANT: be mindful of the size of the dataset that you're working with. For example, if you run this notebook locally then be aware that you're downloading around 150Mib from your Azure workspace. When running locally this snippet will take approximately 4 minutes to run.**\n", - "\n", - "#### What This Code Does\n", - "\n", - "- **`Workspace.from_config()`** connects to your Azure ML workspace using the `config.json` file (you should already have this if you followed earlier lectures).\n", - "- **`Dataset.get_by_name(...)`** loads the dataset you previously uploaded and registered in the Azure ML web interface.\n", - "- **`.to_pandas_dataframe()`** converts the Azure Dataset into a standard pandas DataFrame so you can explore and manipulate it with Python.\n", - "- **`df.head()`** shows the first 5 rows of the data — this is just a quick preview to confirm that the dataset loaded correctly.\n", - "\n", - "#### Why This Matters\n", - "\n", - "This is the standard pattern you’ll use throughout Azure ML when working with registered datasets in notebooks. It keeps your workflow consistent and lets you:\n", - "- Avoid re-uploading data every time.\n", - "- Ensure reproducibility across experiments and pipelines.\n", - "- Easily switch to remote compute environments without changing your code.\n", - "\n", - "#### Console output\n", - "\n", - "You might (probably) see a few console output messages. This is expected. They come from Azure’s background systems for logging and monitoring. \n", - "Unless you see an actual `ERROR` or `Traceback`, you can **safely ignore** any of the following.\n", - "\n", - "- **`Warning: Falling back to use azure cli login credentials.`** \n", - " - Azure is using your Azure CLI login (`az login`) for authentication.\n", - " - ✅ This is normal and expected for local development.\n", - " - ⚠️ For production, consider using `ServicePrincipalAuthentication` or `MsiAuthentication`.\n", - "\n", - "- **`{'infer_column_types': 'False', 'activity': 'to_pandas_dataframe'}`** \n", - " - The dataset is being converted to a pandas DataFrame.\n", - " - Azure is not auto-detecting column types.\n", - " - ✅ This means `to_pandas_dataframe()` is working.\n", - "\n", - "- **`Timeout was exceeded in force_flush().`** \n", - " - A background telemetry system couldn’t send logging data in time.\n", - " - ✅ This is safe to ignore. It has no effect on your code or data.\n", - "\n", - "- **`Overriding of current TracerProvider / LoggerProvider / MeterProvider is not allowed`** \n", - " - Azure's telemetry was already initialized; it's skipping a duplicate setup.\n", - " - ✅ This is common and harmless in notebook environments.\n", - "\n", - "- **`Attempting to instrument while already instrumented`** \n", - " - Azure ML SDK tried to attach diagnostics tools (e.g., to pandas or HTTP), but they were already connected.\n", - " - ✅ This is internal setup noise — not an error.\n", - "\n", - "In Azure AI Machine Learning Studio - Notebooks you might encounter this error:\n", - "\n", - "```console\n", - "UserErrorException: UserErrorException:\n", - "\tMessage: The workspace configuration file config.json, could not be found in /synfs/notebook/0/aml_notebook_mount or its parent directories. Please check whether the workspace configuration file exists, or provide the full path to the configuration file as an argument. You can download a configuration file for your workspace, via http://ml.azure.com and clicking on the name of your workspace in the right top.\n", - "\tInnerException None\n", - "\tErrorResponse \n", - "{\n", - " \"error\": {\n", - " \"code\": \"UserError\",\n", - " \"message\": \"The workspace configuration file config.json, could not be found in /synfs/notebook/0/aml_notebook_mount or its parent directories. Please check whether the workspace configuration file exists, or provide the full path to the configuration file as an argument. You can download a configuration file for your workspace, via http://ml.azure.com and clicking on the name of your workspace in the right top.\"\n", - " }\n", - "}```\n", "\n" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# You only need to run this if you've imported this notebook to Azure AI Machine Learning Studio - Notebook,\n", - "# in which case you'll also need to upload the config.json file to the same directory as this notebook,\n", - "# and then execute this code to determine the current working directory.\n", - "import os\n", - "print(\"Current working directory:\", os.getcwd())\n", - "print(\"Files in this directory:\", os.listdir())\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# if you're running locally then use this ...\n", - "path = None\n", - "\n", - "# alternatively, if you're running in Azure AI Machine Learning Studio - Notebook, then use this ...\n", - "# (make sure to upload the config.json file to the same directory as this notebook)\n", - "# and then execute this code to determine the current working directory.\n", - "path='Users/[REPLACE-THIS-WITH-YOUR-USERNAME]/config.json'\n", - "ws = Workspace.from_config(path=path)\n", - "dataset = Dataset.get_by_name(ws, name='creditcard_fraud')\n", - "df = dataset.to_pandas_dataframe()\n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 3: Prepare the Data\n", - "\n", - "we're going to normalize the distribution of the transaction $ amount column, which helps the model treat transaction amounts on the same scale as the other features (which are already normalized).\n", - "\n", - "#### What This Code Does\n", - "\n", - "- **Standardizes the `Amount` column**: \n", - " We scale the `Amount` feature so that it has a mean of 0 and a standard deviation of 1. \n", - " \n", - "- **Creates feature and label sets**:\n", - " - `X` contains the features used to make predictions.\n", - " - `y` contains the target variable: `Class` (where `1 = fraud` and `0 = normal`).\n", - "\n", - "- We also drop the `Time` column since it doesn't contribute meaningfully to anomaly detection in this context.\n", - "\n", - "#### Why This Matters\n", - "\n", - "Many machine learning algorithms — including Isolation Forest — perform better when numeric features are on a similar scale. \n", - "Also, splitting the data into `X` and `y` is a standard step that prepares it for training and evaluation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df['Amount'] = (df['Amount'] - df['Amount'].mean()) / df['Amount'].std()\n", - "X = df.drop(columns=['Class', 'Time'])\n", - "y = df['Class']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 4: Train the model\n", - "\n", - "The **Isolation Forest** algorithm is a popular unsupervised method for **detecting anomalies** in high-dimensional datasets. Instead of learning what “normal” looks like, it works by **isolating outliers** — rare points that are easier to separate from the rest of the data. It does this by randomly splitting the dataset using decision trees and measuring how quickly a data point can be isolated. The idea is that **anomalies require fewer splits to isolate**, because they are different from everything else. Isolation Forest is widely used in **fraud detection**, **network security**, and **industrial monitoring** because it is **fast, efficient**, and handles **large datasets** with many features. In our code, we set the `contamination` parameter to roughly match the known fraction of fraud cases in the dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "model = IsolationForest(contamination=0.0017, random_state=42)\n", - "model.fit(X)\n", - "y_pred = model.predict(X)\n", - "y_pred = [1 if x == -1 else 0 for x in y_pred]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 5: Evaluating the Anomaly Detection Model\n", - "\n", - "The table below is a summary of metrics that are calculated from the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix). It shows how well our model identified normal and fraudulent transactions:\n", - "\n", - "| Metric | What It Means |\n", - "|--------------|--------------------------------------------------------------------------------|\n", - "| **Precision** | How often the model was *correct* when it said a transaction was fraud |\n", - "| **Recall** | How many of the *actual fraud cases* the model successfully found |\n", - "| **F1-Score** | A balance between precision and recall — like a combined performance score |\n", - "| **Support** | The number of examples in each group (normal or fraud) in the real data |\n", - "\n", - "#### Results Summary\n", - "\n", - "| Class | Description | Precision | Recall | F1-Score | Support |\n", - "|-------|------------------------|-----------|--------|----------|---------|\n", - "| `0` | Normal transactions | **1.00** | **1.00** | **1.00** | 284,315 |\n", - "| `1` | Fraudulent transactions| **0.29** | **0.28** | **0.28** | 492 |\n", - "\n", - "#### Interpretation (In Simple Terms)\n", - "\n", - "- The model is **excellent at recognizing normal transactions** — it almost never makes a mistake with those.\n", - "- However, it **struggles to correctly catch fraud**:\n", - " - When it says a transaction is fraud, it’s **only right 29% of the time**.\n", - " - It **only finds 28% of the real fraud cases** — it misses most of them.\n", - "\n", - "#### Overall Accuracy\n", - "\n", - "- The model is **99.9% accurate**, but this is misleading.\n", - "- Because **fraud cases are very rare**, the model can look “perfect” just by saying everything is normal.\n", - "- That’s why we look at **precision**, **recall**, and **F1-score** for a fuller picture.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Step 5: Evaluate Model\n", - "print(classification_report(y, y_pred))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 6 (Optional): Register the Model\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import joblib # see https://joblib.readthedocs.io/en/latest/\n", - " # Joblib is a set of tools to provide lightweight pipelining in Python\n", - "joblib.dump(model, 'isolation_forest.pkl')\n", - "Model.register(model_path='isolation_forest.pkl',\n", - " model_name='creditcard_if_model',\n", - " workspace=ws)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 7: Visualize a Count of Predicted Anomalies\n", - "\n", - "The chart below is a typical summarization of an anomaly detection analysis. It shows how many transactions the model predicted as **normal (0)** and **anomalies/fraud (1)**:\n", - "\n", - "- **X-axis**: The prediction labels.\n", - " - `0` means the model thinks the transaction is **normal**.\n", - " - `1` means the model thinks the transaction is **fraud** or **anomalous**.\n", - "- **Y-axis**: The total number of transactions in each category.\n", - "\n", - "#### How to Interpret This Chart\n", - "\n", - "- You will (hopefully) see a **very tall bar for `0`** and a **very short bar for `1`**.\n", - "- This is because **fraud is rare** in the dataset (only 492 out of 284,807 transactions).\n", - "- The model is trained to detect outliers, so it **flags a small number of transactions as anomalies** (which is expected).\n", - "- If the number of predicted frauds is **close to the actual number** (around 500), that’s a good sign that the model is well-calibrated.\n", - "\n", - "#### Why This Matters\n", - "\n", - "- This simple chart gives a **quick health check** of how aggressive or conservative the model is in flagging anomalies.\n", - "- If the model predicts **too many anomalies**, it might be overreacting.\n", - "- If it predicts **almost none**, it might be too cautious — missing fraud cases." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "\n", - "# Add predictions to the original dataframe\n", - "df['predicted_anomaly'] = y_pred\n", - "\n", - "# Count of predicted anomalies\n", - "sns.countplot(x='predicted_anomaly', data=df)\n", - "plt.title('Count of Predicted Anomalies')\n", - "plt.xlabel('Anomaly (1) vs Normal (0)')\n", - "plt.ylabel('Count')\n", - "plt.show()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 7 (continued): Visualize Transaction Amount by Prediction Class\n", - "\n", - "The boxplot below compares the **amount of money** in transactions that the model predicted as **normal (0)** or **anomalous/fraud (1)**.\n", - "\n", - "- **X-axis**: The model’s prediction.\n", - " - `0` = predicted normal transaction\n", - " - `1` = predicted fraud/anomaly\n", - "- **Y-axis**: The dollar **amount** of each transaction (standardized)\n", - "\n", - "#### How to Interpret This Chart\n", - "\n", - "- Each box shows how transaction amounts are distributed for each prediction class.\n", - "- The **line in the middle** of each box is the **median** transaction amount.\n", - "- The **height of the box** shows where most transaction amounts fall.\n", - "- **Dots outside the box** are **outliers** — unusual values far from the average.\n", - "\n", - "#### What This Tells Us\n", - "\n", - "- You may see that predicted frauds (`1`) tend to have **more extreme** or **variable amounts**.\n", - "- This could suggest that the model is flagging **unusually high or low transaction amounts** as suspicious.\n", - "- If the fraud predictions have a **much wider range**, it means the model may be reacting to extreme values — which is common in anomaly detection.\n", - "\n", - "#### Usefulness\n", - "\n", - "This chart helps you:\n", - "- Understand what kinds of amounts the model thinks are suspicious.\n", - "- Spot any bias in the model (e.g. only flagging large transactions).\n", - "- Decide whether you need to normalize, transform, or engineer features differently." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "plt.figure(figsize=(10, 6))\n", - "sns.boxplot(data=df, x='predicted_anomaly', y='Amount')\n", - "plt.title('Transaction Amount by Prediction Class')\n", - "plt.show()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 7 (continued): SHAP Beeswarm Plot – Feature Importance for Anomaly Detection\n", - "\n", - "The beeswarm plot below is generated using **SHAP** (SHapley Additive exPlanations). It helps explain **which features influenced the model's decisions**, and **how strongly**. We only analyze the first 100 transactions here in order to keep the visualization fast and readable.\n", - "\n", - "#### How to Read the SHAP Beeswarm Plot\n", - "\n", - "- **Each dot** represents a single transaction.\n", - "- **Each row** is one feature (like `V1`, `V2`, `Amount`, etc.).\n", - "- **Color** shows the feature value for that transaction:\n", - " - **Red = high** value\n", - " - **Blue = low** value\n", - "- **Horizontal position** shows **impact on the model’s prediction**:\n", - " - Dots farther to the right **push the model toward predicting fraud**.\n", - " - Dots farther to the left **push the model toward predicting normal**.\n", - "\n", - "#### What This Tells Us\n", - "\n", - "- The **topmost features** are the most important ones in the model’s decisions.\n", - "- For example, if `V14` is at the top and its red dots are far right, it means:\n", - " - High values of `V14` increase the chance that the model flags a transaction as fraud.\n", - "- This plot helps us understand **why** the model flagged certain transactions as anomalies.\n", - "\n", - "#### Why Use SHAP?\n", - "\n", - "- SHAP adds transparency to the model, even for complex algorithms like Isolation Forest.\n", - "- Helps **build trust**, especially in sensitive tasks like fraud detection.\n", - "- Guides feature selection and **future model improvements**." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import shap\n", - "\n", - "explainer = shap.Explainer(model, X)\n", - "shap_values = explainer(X[:100])\n", - "shap.plots.beeswarm(shap_values)" - ] } ], "metadata": {