diff --git a/jupyter/anomaly_detection_creditcard.ipynb b/jupyter/anomaly_detection_creditcard.ipynb
index 3505a3a..c7f2b6a 100644
--- a/jupyter/anomaly_detection_creditcard.ipynb
+++ b/jupyter/anomaly_detection_creditcard.ipynb
@@ -8,141 +8,328 @@
"\n",
"Example anomaly detection machine learning job using a hybrid of Azure AI Machine Learning console resources combined with native Python in this Jupyter notebook which is intended to be uploaded and executed from 'Authoring -> Notebooks' in the web console. \n",
"\n",
- "### Requirements\n",
+ "### Requirements for implementing the Proof of Concept\n",
"\n",
"1. You should already have created an Azure account, and created a [Subscription](https://techcommunity.microsoft.com/discussions/azure/understanding-azure-account-subscription-and-directory-/34800) and a [Workspace](https://learn.microsoft.com/en-us/azure/machine-learning/concept-workspace?view=azureml-api-2).\n",
"2. this exercise assumes that you've already downloaded the [Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) dataset from kaggle, and stored this in 'Assets -> Data' inside your Azure account under the name, 'creditcard_fraud'.**\n",
"\n",
- "### About Jupyter Notebooks\n",
+ "### Summary\n",
"\n",
- "Jupyter Notebooks originated from the **IPython project**, which was created to provide an interactive computing environment for Python. Over time, it evolved into the broader **Jupyter Project** (\"JU\" for Julia, \"PY\" for Python, and \"R\" for R), supporting multiple programming languages. What makes Jupyter Notebooks so popular in data science is their ability to **combine code, outputs, text, math, and visualizations in one place**. This format is ideal for exploration, analysis, documentation, and teaching. Because of this flexibility and transparency, Jupyter has become a standard across cloud platforms like **Google Colab, Azure AI Machine Learning, and AWS SageMaker** — each offering hosted environments where users can write and execute notebooks with scalable cloud compute. These platforms support the same notebook format (.ipynb), making it easy to move your work between local machines and cloud services.\n",
+ "This proof-of-concept demonstrates how machine learning may be effectively leveraged to identify fraudulent credit card transactions when it is needed most: at the time of the transaction. \n",
"\n",
- "Thus, you can run this notebook from either of the Azure AI Machine Learning web console, or locally, assuming that you've created and activated the [Python virtual environment](https://realpython.com/python-virtual-environments-a-primer/) provided in the course [GitHub repository](https://github.com/FullStackWithLawrence/azureml-example) for Python 3.9\n",
+ "This document outlines a process to implement a production ready solution.\n",
"\n",
+ "Success depends on leveraging the organization’s unique capabilities and data assets, with structured opportunities for stakeholders across risk, operations, and customer experience to contribute to ongoing model improvement.\n",
"\n",
- "## Workflow\n",
+ "#### The Objective\n",
"\n",
- "### Step 1: import the PyPi packages\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Step 1: Import Packages and Connect to your Azure Workspace\n",
- "from azureml.core import Workspace, Dataset # see https://pypi.org/project/azureml-core/\n",
- "import pandas as pd # see https://pandas.pydata.org/docs/\n",
- "from sklearn.ensemble import IsolationForest # see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html\n",
- "from sklearn.metrics import classification_report # see https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html\n",
- "from azureml.core.model import Model # see https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model?view=azure-ml-py "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Step 2: Load the Credit Card Fraud Dataset from Azure ML\n",
- "\n",
- "Retrieve the dataset from our existing workspace, and set this up for use with Pandas.\n",
- "\n",
- "**IMPORTANT: be mindful of the size of the dataset that you're working with. For example, if you run this notebook locally then be aware that you're downloading around 150Mib from your Azure workspace. When running locally this snippet will take approximately 4 minutes to run.**\n",
+ "Fraudulent transactions are rare - in our proof-of-concept dataset, only 0.17% are proven to be fraudulent. This type of problem is classified as anomaly detection and has two relevant metrics: *Recall* and *Precision*. These terms will be used throughout the process.\n",
"\n",
- "#### What This Code Does\n",
+ "- *Recall* is a _\"Of all actual fraud cases, how many did we catch?\"_\n",
+ " - Low percentages mean less fraudulent transactions are flagged.\n",
+ " - Therefore, lower recall implies *higher operation costs* incurred in dealing with fraud after the settlement of the transaction.\n",
+ "- *Precision* is _“Of all transactions we flagged as fraud, how many were actually fraud?”_\n",
+ " - Low percentages mean normal transactions will be flagged as fraud.\n",
+ " - This *negatively affects the perception of value by customers as this negatively impacts their usage of our service*.\n",
"\n",
- "- **`Workspace.from_config()`** connects to your Azure ML workspace using the `config.json` file (you should already have this if you followed earlier lectures).\n",
- "- **`Dataset.get_by_name(...)`** loads the dataset you previously uploaded and registered in the Azure ML web interface.\n",
- "- **`.to_pandas_dataframe()`** converts the Azure Dataset into a standard pandas DataFrame so you can explore and manipulate it with Python.\n",
- "- **`df.head()`** shows the first 5 rows of the data — this is just a quick preview to confirm that the dataset loaded correctly.\n",
+ "#### The Process\n",
"\n",
- "#### Why This Matters\n",
+ "Using historical transaction data, we train an anomaly detection model to classify transactions as normal or fraudulent. \n",
"\n",
- "This is the standard pattern you’ll use throughout Azure ML when working with registered datasets in notebooks. It keeps your workflow consistent and lets you:\n",
- "- Avoid re-uploading data every time.\n",
- "- Ensure reproducibility across experiments and pipelines.\n",
- "- Easily switch to remote compute environments without changing your code.\n",
+ "This class of machine learning - anomaly detection - is a complex but achievable problem with several open source models available to use.\n",
"\n",
- "#### Console output\n",
+ "During the process, our goal is to balance the two measures of Recall and Precision to rates aligned your organization's strategy.\n",
"\n",
- "You might (probably) see a few console output messages. This is expected. They come from Azure’s background systems for logging and monitoring. \n",
- "Unless you see an actual `ERROR` or `Traceback`, you can **safely ignore** any of the following.\n",
+ "Model Training is configuring an algorithm with customized values for a large amount of variables.\n",
"\n",
- "- **`Warning: Falling back to use azure cli login credentials.`** \n",
- " - Azure is using your Azure CLI login (`az login`) for authentication.\n",
- " - ✅ This is normal and expected for local development.\n",
- " - ⚠️ For production, consider using `ServicePrincipalAuthentication` or `MsiAuthentication`.\n",
+ "A simplified analogy is a vehicle that stores driver seat settings. Initially, the driver adjusts the seat manually based on comfort. Over time, the system remembers these adjustments and applies them automatically. Similarly, a machine learning model learns patterns from historical data and uses those patterns to make future predictions, with periodic updates as new information becomes available.\n",
"\n",
- "- **`{'infer_column_types': 'False', 'activity': 'to_pandas_dataframe'}`** \n",
- " - The dataset is being converted to a pandas DataFrame.\n",
- " - Azure is not auto-detecting column types.\n",
- " - ✅ This means `to_pandas_dataframe()` is working.\n",
- "\n",
- "- **`Timeout was exceeded in force_flush().`** \n",
- " - A background telemetry system couldn’t send logging data in time.\n",
- " - ✅ This is safe to ignore. It has no effect on your code or data.\n",
- "\n",
- "- **`Overriding of current TracerProvider / LoggerProvider / MeterProvider is not allowed`** \n",
- " - Azure's telemetry was already initialized; it's skipping a duplicate setup.\n",
- " - ✅ This is common and harmless in notebook environments.\n",
+ "## Workflow\n",
+ "### Diagram\n",
+ "\n",
"\n",
- "- **`Attempting to instrument while already instrumented`** \n",
- " - Azure ML SDK tried to attach diagnostics tools (e.g., to pandas or HTTP), but they were already connected.\n",
- " - ✅ This is internal setup noise — not an error.\n",
+ "### 1. Data Preparation & Training Setup\n",
"\n",
- "In Azure AI Machine Learning Studio - Notebooks you might encounter this error:\n",
+ "| Azure ML Component | Role in the Process |\n",
+ "| ------------------ | ------------------- |\n",
+ "| 1 a) Dataset (Credit Card Data) | Provides historical transaction data used to train and evaluate the model. |\n",
+ "| 1 b) Jupyter Notebook (Azure ML / VS Code) | Used to explore data, document the workflow, and prepare features for modeling. |\n",
+ "| 1 c) Data Processing (Prepare the Data) | Transforms raw transaction data into structured inputs suitable for machine learning. |\n",
+ "| 1 d) Train/Test Split | Separates data into training and validation sets to ensure unbiased model evaluation. |\n",
"\n",
- "```console\n",
- "UserErrorException: UserErrorException:\n",
- "\tMessage: The workspace configuration file config.json, could not be found in /synfs/notebook/0/aml_notebook_mount or its parent directories. Please check whether the workspace configuration file exists, or provide the full path to the configuration file as an argument. You can download a configuration file for your workspace, via http://ml.azure.com and clicking on the name of your workspace in the right top.\n",
- "\tInnerException None\n",
- "\tErrorResponse \n",
- "{\n",
- " \"error\": {\n",
- " \"code\": \"UserError\",\n",
- " \"message\": \"The workspace configuration file config.json, could not be found in /synfs/notebook/0/aml_notebook_mount or its parent directories. Please check whether the workspace configuration file exists, or provide the full path to the configuration file as an argument. You can download a configuration file for your workspace, via http://ml.azure.com and clicking on the name of your workspace in the right top.\"\n",
- " }\n",
- "}```\n",
- "\n"
+ "#### Preliminary Setup Steps"
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
- "# You only need to run this if you've imported this notebook to Azure AI Machine Learning Studio - Notebook,\n",
- "# in which case you'll also need to upload the config.json file to the same directory as this notebook,\n",
- "# and then execute this code to determine the current working directory.\n",
- "import os\n",
- "print(\"Current working directory:\", os.getcwd())\n",
- "print(\"Files in this directory:\", os.listdir())\n"
+ "# Step 1: Import Packages and Connect to your Azure Workspace\n",
+ "from azureml.core import Workspace, Dataset # see https://pypi.org/project/azureml-core/\n",
+ "import pandas as pd # see https://pandas.pydata.org/docs/\n",
+ "from sklearn.ensemble import IsolationForest # see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html\n",
+ "from sklearn.metrics import classification_report # see https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html\n",
+ "from azureml.core.model import Model # see https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model?view=azure-ml-py \n",
+ "from sklearn.model_selection import train_test_split # see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\n"
]
},
{
- "cell_type": "code",
- "execution_count": null,
+ "cell_type": "markdown",
"metadata": {},
- "outputs": [],
- "source": []
+ "source": [
+ "#### 1a) Dataset (Credit Card Data)\n",
+ "\n",
+ "To ensure accurate training, the dataset *must* be kept in a *known location* where the state of the data is tracked and known. We recommend using the features of Kaggle.com or AzureML for this purpose.\n",
+ "\n",
+ "#### 1b) Jupyter Notebook (Python / Azure ML / VS Code)\n",
+ "\n",
+ "This allows for analysis and exploration of data in a tracked and reproducible way. Given the complexity, usage of such tools is highly recommended for this type of work. This notebook is a Jupyter Notebook.\n",
+ "\n",
+ "Python is favored for this activity as a large community of data scientists and machine learning experts have creating free and open source modules for all to use."
+ ]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Loading dataset from local file...\n",
+ "Dataset shape: (284807, 31)\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Time | \n",
+ " V1 | \n",
+ " V2 | \n",
+ " V3 | \n",
+ " V4 | \n",
+ " V5 | \n",
+ " V6 | \n",
+ " V7 | \n",
+ " V8 | \n",
+ " V9 | \n",
+ " ... | \n",
+ " V21 | \n",
+ " V22 | \n",
+ " V23 | \n",
+ " V24 | \n",
+ " V25 | \n",
+ " V26 | \n",
+ " V27 | \n",
+ " V28 | \n",
+ " Amount | \n",
+ " Class | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 0.0 | \n",
+ " -1.359807 | \n",
+ " -0.072781 | \n",
+ " 2.536347 | \n",
+ " 1.378155 | \n",
+ " -0.338321 | \n",
+ " 0.462388 | \n",
+ " 0.239599 | \n",
+ " 0.098698 | \n",
+ " 0.363787 | \n",
+ " ... | \n",
+ " -0.018307 | \n",
+ " 0.277838 | \n",
+ " -0.110474 | \n",
+ " 0.066928 | \n",
+ " 0.128539 | \n",
+ " -0.189115 | \n",
+ " 0.133558 | \n",
+ " -0.021053 | \n",
+ " 149.62 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 0.0 | \n",
+ " 1.191857 | \n",
+ " 0.266151 | \n",
+ " 0.166480 | \n",
+ " 0.448154 | \n",
+ " 0.060018 | \n",
+ " -0.082361 | \n",
+ " -0.078803 | \n",
+ " 0.085102 | \n",
+ " -0.255425 | \n",
+ " ... | \n",
+ " -0.225775 | \n",
+ " -0.638672 | \n",
+ " 0.101288 | \n",
+ " -0.339846 | \n",
+ " 0.167170 | \n",
+ " 0.125895 | \n",
+ " -0.008983 | \n",
+ " 0.014724 | \n",
+ " 2.69 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 1.0 | \n",
+ " -1.358354 | \n",
+ " -1.340163 | \n",
+ " 1.773209 | \n",
+ " 0.379780 | \n",
+ " -0.503198 | \n",
+ " 1.800499 | \n",
+ " 0.791461 | \n",
+ " 0.247676 | \n",
+ " -1.514654 | \n",
+ " ... | \n",
+ " 0.247998 | \n",
+ " 0.771679 | \n",
+ " 0.909412 | \n",
+ " -0.689281 | \n",
+ " -0.327642 | \n",
+ " -0.139097 | \n",
+ " -0.055353 | \n",
+ " -0.059752 | \n",
+ " 378.66 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 1.0 | \n",
+ " -0.966272 | \n",
+ " -0.185226 | \n",
+ " 1.792993 | \n",
+ " -0.863291 | \n",
+ " -0.010309 | \n",
+ " 1.247203 | \n",
+ " 0.237609 | \n",
+ " 0.377436 | \n",
+ " -1.387024 | \n",
+ " ... | \n",
+ " -0.108300 | \n",
+ " 0.005274 | \n",
+ " -0.190321 | \n",
+ " -1.175575 | \n",
+ " 0.647376 | \n",
+ " -0.221929 | \n",
+ " 0.062723 | \n",
+ " 0.061458 | \n",
+ " 123.50 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 2.0 | \n",
+ " -1.158233 | \n",
+ " 0.877737 | \n",
+ " 1.548718 | \n",
+ " 0.403034 | \n",
+ " -0.407193 | \n",
+ " 0.095921 | \n",
+ " 0.592941 | \n",
+ " -0.270533 | \n",
+ " 0.817739 | \n",
+ " ... | \n",
+ " -0.009431 | \n",
+ " 0.798278 | \n",
+ " -0.137458 | \n",
+ " 0.141267 | \n",
+ " -0.206010 | \n",
+ " 0.502292 | \n",
+ " 0.219422 | \n",
+ " 0.215153 | \n",
+ " 69.99 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
5 rows × 31 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Time V1 V2 V3 V4 V5 V6 V7 \\\n",
+ "0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 \n",
+ "1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 \n",
+ "2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 \n",
+ "3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 \n",
+ "4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 \n",
+ "\n",
+ " V8 V9 ... V21 V22 V23 V24 V25 \\\n",
+ "0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 \n",
+ "1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 \n",
+ "2 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 \n",
+ "3 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 \n",
+ "4 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 \n",
+ "\n",
+ " V26 V27 V28 Amount Class \n",
+ "0 -0.189115 0.133558 -0.021053 149.62 0 \n",
+ "1 0.125895 -0.008983 0.014724 2.69 0 \n",
+ "2 -0.139097 -0.055353 -0.059752 378.66 0 \n",
+ "3 -0.221929 0.062723 0.061458 123.50 0 \n",
+ "4 0.502292 0.219422 0.215153 69.99 0 \n",
+ "\n",
+ "[5 rows x 31 columns]"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
- "# if you're running locally then use this ...\n",
- "path = None\n",
- "\n",
- "# alternatively, if you're running in Azure AI Machine Learning Studio - Notebook, then use this ...\n",
- "# (make sure to upload the config.json file to the same directory as this notebook)\n",
- "# and then execute this code to determine the current working directory.\n",
- "path='Users/[REPLACE-THIS-WITH-YOUR-USERNAME]/config.json'\n",
- "ws = Workspace.from_config(path=path)\n",
- "dataset = Dataset.get_by_name(ws, name='creditcard_fraud')\n",
- "df = dataset.to_pandas_dataframe()\n",
+ "import pandas as pd\n",
+ "from pathlib import Path\n",
+ "\n",
+ "# =========================\n",
+ "# LOCAL MODE (default)\n",
+ "# =========================\n",
+ "path = Path(\"data/creditcard.csv\")\n",
+ "\n",
+ "if path.exists():\n",
+ " print(\"Loading dataset from local file...\")\n",
+ " df = pd.read_csv(path)\n",
+ "\n",
+ "# =========================\n",
+ "# AZURE MODE (fallback)\n",
+ "# =========================\n",
+ "else:\n",
+ " print(\"Local file not found. Attempting Azure ML load...\")\n",
+ " \n",
+ " from azureml.core import Workspace, Dataset\n",
+ " \n",
+ " ws = Workspace.from_config() # assumes config.json is present\n",
+ " dataset = Dataset.get_by_name(ws, name='creditcard_fraud')\n",
+ " df = dataset.to_pandas_dataframe()\n",
+ "\n",
+ "# Preview data\n",
+ "print(\"Dataset shape:\", df.shape)\n",
"df.head()"
]
},
@@ -150,30 +337,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Step 3: Prepare the Data\n",
+ "#### 1c) Data Processing (Prepare the data)\n",
"\n",
- "we're going to normalize the distribution of the transaction $ amount column, which helps the model treat transaction amounts on the same scale as the other features (which are already normalized).\n",
+ "Sometimes data must be structured for suitability for model training and analysis. For our proof of concept, we're going to normalize the distribution of the transaction $ amount column, which helps the model treat transaction amounts on the same scale as the other normalized features.\n",
"\n",
- "#### What This Code Does\n",
- "\n",
- "- **Standardizes the `Amount` column**: \n",
- " We scale the `Amount` feature so that it has a mean of 0 and a standard deviation of 1. \n",
- " \n",
- "- **Creates feature and label sets**:\n",
- " - `X` contains the features used to make predictions.\n",
- " - `y` contains the target variable: `Class` (where `1 = fraud` and `0 = normal`).\n",
- "\n",
- "- We also drop the `Time` column since it doesn't contribute meaningfully to anomaly detection in this context.\n",
- "\n",
- "#### Why This Matters\n",
- "\n",
- "Many machine learning algorithms — including Isolation Forest — perform better when numeric features are on a similar scale. \n",
- "Also, splitting the data into `X` and `y` is a standard step that prepares it for training and evaluation."
+ "We also drop the `Time` column since it doesn't contribute meaningfully to anomaly detection in this context."
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
@@ -186,233 +359,258 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Step 4: Train the model\n",
"\n",
- "The **Isolation Forest** algorithm is a popular unsupervised method for **detecting anomalies** in high-dimensional datasets. Instead of learning what “normal” looks like, it works by **isolating outliers** — rare points that are easier to separate from the rest of the data. It does this by randomly splitting the dataset using decision trees and measuring how quickly a data point can be isolated. The idea is that **anomalies require fewer splits to isolate**, because they are different from everything else. Isolation Forest is widely used in **fraud detection**, **network security**, and **industrial monitoring** because it is **fast, efficient**, and handles **large datasets** with many features. In our code, we set the `contamination` parameter to roughly match the known fraction of fraud cases in the dataset."
+ "#### 1d) Train/Test Split\n",
+ "\n",
+ "The concept of Train/Test splitting is fundamental to Machine Learning and Model Training. \n",
+ "\n",
+ "- Training data is *only* used in the model training.\n",
+ "- Testing data is *only* used to evaluate the training results of Model Training.\n",
+ "\n",
+ "If Testing data were to be used in training data, it would compromise the integrity of the evaluation process."
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 26,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Training samples: 227845\n",
+ "Testing samples: 56962\n"
+ ]
+ }
+ ],
"source": [
- "model = IsolationForest(contamination=0.0017, random_state=42)\n",
- "model.fit(X)\n",
- "y_pred = model.predict(X)\n",
- "y_pred = [1 if x == -1 else 0 for x in y_pred]"
+ "# Split into training and testing sets\n",
+ "X_train, X_test, y_train, y_test = train_test_split(\n",
+ " X, \n",
+ " y, \n",
+ " test_size=0.2, # 20% test, 80% train\n",
+ " random_state=42, # ensures reproducibility\n",
+ " stratify=y # IMPORTANT for imbalanced data\n",
+ ")\n",
+ "\n",
+ "print(\"Training samples:\", X_train.shape[0])\n",
+ "print(\"Testing samples:\", X_test.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Step 5: Evaluating the Anomaly Detection Model\n",
+ "*Additional Information* Note the use of stratify - This ensures the fraud is equally represented among the testing and training datasets. It is concepts like this that require a holistic understanding of data of which data scientists have an understanding of.\n",
"\n",
- "The table below is a summary of metrics that are calculated from the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix). It shows how well our model identified normal and fraudulent transactions:\n",
+ "### 2. Model Training & Evaluation\n",
"\n",
- "| Metric | What It Means |\n",
- "|--------------|--------------------------------------------------------------------------------|\n",
- "| **Precision** | How often the model was *correct* when it said a transaction was fraud |\n",
- "| **Recall** | How many of the *actual fraud cases* the model successfully found |\n",
- "| **F1-Score** | A balance between precision and recall — like a combined performance score |\n",
- "| **Support** | The number of examples in each group (normal or fraud) in the real data |\n",
+ "| Azure ML Component or Processing Step | Role in the Process |\n",
+ "| ----------------------------------------------------------- | ------------------------------------------------------------------------------------ |\n",
+ "| 2a) Compute Instance / Compute Cluster | Executes model training and evaluation processes |\n",
+ "| 2b) Experiment | Tracks multiple model runs and configurations for comparison |\n",
+ "| 2c) Models (Logistic Regression, Isolation Forest, Autoencoder) | Different algorithms used to detect fraudulent patterns in transaction data |\n",
+ "| 2d) Evaluation Metrics (Precision, Recall, F1-Score) | Measures model performance, with emphasis on detecting rare fraud cases effectively |\n",
"\n",
- "#### Results Summary\n",
+ "#### 2a) Compute Instance / Compute Cluster\n",
"\n",
- "| Class | Description | Precision | Recall | F1-Score | Support |\n",
- "|-------|------------------------|-----------|--------|----------|---------|\n",
- "| `0` | Normal transactions | **1.00** | **1.00** | **1.00** | 284,315 |\n",
- "| `1` | Fraudulent transactions| **0.29** | **0.28** | **0.28** | 492 |\n",
+ "Azure ML compute resources are used to execute model training and evaluation tasks. These environments provide the processing power required to efficiently train models on large transaction datasets and support iterative experimentation.\n",
"\n",
- "#### Interpretation (In Simple Terms)\n",
+ "As always, there is a balance between cost, time, and money that must be balanced. It is recommended to allow for budgeted access to these resources so these risk assessments are performed by the experts.\n",
"\n",
- "- The model is **excellent at recognizing normal transactions** — it almost never makes a mistake with those.\n",
- "- However, it **struggles to correctly catch fraud**:\n",
- " - When it says a transaction is fraud, it’s **only right 29% of the time**.\n",
- " - It **only finds 28% of the real fraud cases** — it misses most of them.\n",
+ "#### 2b) Experiment\n",
"\n",
- "#### Overall Accuracy\n",
+ "Each model run is recorded as part of an experiment, enabling comparison across different algorithms, configurations, and parameters. This ensures that model selection is based on measurable performance rather than assumptions.\n",
"\n",
- "- The model is **99.9% accurate**, but this is misleading.\n",
- "- Because **fraud cases are very rare**, the model can look “perfect” just by saying everything is normal.\n",
- "- That’s why we look at **precision**, **recall**, and **F1-score** for a fuller picture.\n"
+ "The Azure ML environment enables the automation of such experiments by providing a drag and drop interface built to effectively perform such analysis.\n",
+ "\n",
+ "#### 2c) Models (Logistic Regression, Isolation Forest, Autoencoder)\n",
+ "\n",
+ "Multiple modeling approaches are evaluated to capture different types of fraud patterns:\n",
+ "\n",
+ "- *Logistic Regression* identifies linear relationships between transaction features and fraud likelihood\n",
+ "- *Isolation* Forest detects anomalies by isolating unusual patterns in the data\n",
+ "- *Autoencoder* learns typical transaction behavior and flags deviations as potential fraud\n",
+ "\n",
+ "Evaluating multiple models ensures that the selected approach aligns with both the statistical characteristics of the data and the organization’s risk tolerance.\n"
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
- "# Step 5: Evaluate Model\n",
- "print(classification_report(y, y_pred))"
+ "# An example of one training instance with two different items being configured.\n",
+ "# In practice, this is ran multiple times to create a model ready for organizational\n",
+ "# review and potential deployment to our Point of Sale solutions.\n",
+ "\n",
+ "# Initialize model\n",
+ "model = IsolationForest(contamination=0.0017, random_state=42)\n",
+ "\n",
+ "# Train on training data only\n",
+ "model.fit(X_train)\n",
+ "\n",
+ "# Predict on test data\n",
+ "y_pred = model.predict(X_test)\n",
+ "\n",
+ "# Convert output: -1 → 1 (fraud), 1 → 0 (normal)\n",
+ "y_pred = [1 if x == -1 else 0 for x in y_pred]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Step 6 (Optional): Register the Model\n"
+ "#### 2d) Evaluation Metrics (Precision, Recall, F1-Score)\n",
+ "\n",
+ "The table below is a summary of metrics that are calculated from the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix). It shows how well our model identified normal and fraudulent transactions:\n",
+ "\n",
+ "| Metric | What It Means |\n",
+ "|--------------|-----------------|\n",
+ "| **Recall** | Of all actual fraud cases, how many did we catch? |\n",
+ "| **Precision** | Of all transactions we flagged as fraud, how many were actually fraud? |\n",
+ "| **F1-Score** | A balance between precision and recall — like a combined performance score. |\n",
+ "| **Support** | The number of examples in each group (normal or fraud) in the real data. |"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " precision recall f1-score support\n",
+ "\n",
+ " 0 1.00 1.00 1.00 56864\n",
+ " 1 0.32 0.33 0.32 98\n",
+ "\n",
+ " accuracy 1.00 56962\n",
+ " macro avg 0.66 0.66 0.66 56962\n",
+ "weighted avg 1.00 1.00 1.00 56962\n",
+ "\n"
+ ]
+ }
+ ],
"source": [
- "import joblib # see https://joblib.readthedocs.io/en/latest/\n",
- " # Joblib is a set of tools to provide lightweight pipelining in Python\n",
- "joblib.dump(model, 'isolation_forest.pkl')\n",
- "Model.register(model_path='isolation_forest.pkl',\n",
- " model_name='creditcard_if_model',\n",
- " workspace=ws)\n"
+ "#Use the testing data, which we had previously set aside for evaluation, to get a score for the performance of the train\n",
+ "\n",
+ "print(classification_report(y_test, y_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Step 7: Visualize a Count of Predicted Anomalies\n",
+ "##### Proof of Concept Results Summary and Demonstration of Precision and Recall\n",
"\n",
- "The chart below is a typical summarization of an anomaly detection analysis. It shows how many transactions the model predicted as **normal (0)** and **anomalies/fraud (1)**:\n",
+ "| Class | Description | Precision | Recall | F1-Score | Support |\n",
+ "|-------|------------------------|-----------|--------|----------|---------|\n",
+ "| `0` | Normal transactions | **1.00** | **1.00** | **1.00** | 56,864 |\n",
+ "| `1` | Fraudulent transactions| **0.32** | **0.33** | **0.32** | 98 |\n",
"\n",
- "- **X-axis**: The prediction labels.\n",
- " - `0` means the model thinks the transaction is **normal**.\n",
- " - `1` means the model thinks the transaction is **fraud** or **anomalous**.\n",
- "- **Y-axis**: The total number of transactions in each category.\n",
+ "##### Interpretation (In Simple Terms)\n",
"\n",
- "#### How to Interpret This Chart\n",
+ "- The model is **excellent at recognizing normal transactions** — it almost never makes a mistake with those.\n",
+ "- However, it **struggles to correctly catch fraud**:\n",
+ " - Precision: When it says a transaction is fraud, it’s **only right 32% of the time**.\n",
+ " - Recall: It **only finds 33% of the real fraud cases** — it misses most of them.\n",
"\n",
- "- You will (hopefully) see a **very tall bar for `0`** and a **very short bar for `1`**.\n",
- "- This is because **fraud is rare** in the dataset (only 492 out of 284,807 transactions).\n",
- "- The model is trained to detect outliers, so it **flags a small number of transactions as anomalies** (which is expected).\n",
- "- If the number of predicted frauds is **close to the actual number** (around 500), that’s a good sign that the model is well-calibrated.\n",
+ "##### Overall Accuracy\n",
"\n",
- "#### Why This Matters\n",
+ "- The model is **99.9% accurate**, but this is misleading.\n",
+ "- Because **fraud cases are very rare**, the model can look “perfect” just by saying everything is normal.\n",
+ "- That’s why we look at **precision**, **recall**, and **F1-score** for a fuller picture.\n",
"\n",
- "- This simple chart gives a **quick health check** of how aggressive or conservative the model is in flagging anomalies.\n",
- "- If the model predicts **too many anomalies**, it might be overreacting.\n",
- "- If it predicts **almost none**, it might be too cautious — missing fraud cases."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import matplotlib.pyplot as plt\n",
- "import seaborn as sns\n",
- "\n",
- "# Add predictions to the original dataframe\n",
- "df['predicted_anomaly'] = y_pred\n",
- "\n",
- "# Count of predicted anomalies\n",
- "sns.countplot(x='predicted_anomaly', data=df)\n",
- "plt.title('Count of Predicted Anomalies')\n",
- "plt.xlabel('Anomaly (1) vs Normal (0)')\n",
- "plt.ylabel('Count')\n",
- "plt.show()\n"
+ "##### Is it ready for production?\n",
+ "\n",
+ "- No - there are far too many false positives evident in these results. Additional analysis is needed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Step 7 (continued): Visualize Transaction Amount by Prediction Class\n",
+ "### 3. Decision & Scoring Layer\n",
"\n",
- "The boxplot below compares the **amount of money** in transactions that the model predicted as **normal (0)** or **anomalous/fraud (1)**.\n",
+ "| Azure ML Component or Process Step | Role in the Process |\n",
+ "| ------------------ | ------------------ |\n",
+ "| 3a) Model Selection | Identifies the model that best balances fraud detection with operational impact |\n",
+ "| 3b) Prediction / Scoring | Applies the selected model to assign fraud risk scores to transactions |\n",
+ "| 3c) Business Decision Layer | Determines actions such as approving, flagging, or investigating transactions |\n",
"\n",
- "- **X-axis**: The model’s prediction.\n",
- " - `0` = predicted normal transaction\n",
- " - `1` = predicted fraud/anomaly\n",
- "- **Y-axis**: The dollar **amount** of each transaction (standardized)\n",
+ "#### 3a) Model Selection \n",
"\n",
- "#### How to Interpret This Chart\n",
+ "In this phase, the selected model is applied to real-world transactions, and its outputs are translated into actionable business decisions. The focus is not only on model performance, but on how predictions are used to manage fraud risk while maintaining a positive customer experience.\n",
"\n",
- "- Each box shows how transaction amounts are distributed for each prediction class.\n",
- "- The **line in the middle** of each box is the **median** transaction amount.\n",
- "- The **height of the box** shows where most transaction amounts fall.\n",
- "- **Dots outside the box** are **outliers** — unusual values far from the average.\n",
+ "The model chosen from the evaluation phase is selected based on its ability to balance fraud detection (recall) with the operational impact of false positives. This decision reflects the organization’s risk tolerance, recognizing that overly aggressive detection may disrupt legitimate transactions, while overly lenient detection may allow fraud to occur.\n",
"\n",
- "#### What This Tells Us\n",
+ "#### 3b) Prediction / Scoring\n",
"\n",
- "- You may see that predicted frauds (`1`) tend to have **more extreme** or **variable amounts**.\n",
- "- This could suggest that the model is flagging **unusually high or low transaction amounts** as suspicious.\n",
- "- If the fraud predictions have a **much wider range**, it means the model may be reacting to extreme values — which is common in anomaly detection.\n",
+ "The selected model is used to assign a fraud risk score to each transaction. Rather than producing a simple “fraud” or “not fraud” output, the model generates a probability or score indicating the likelihood of fraud. This enables more flexible and nuanced decision-making.\n",
"\n",
- "#### Usefulness\n",
+ "#### 3c) Business Decision Layer\n",
"\n",
- "This chart helps you:\n",
- "- Understand what kinds of amounts the model thinks are suspicious.\n",
- "- Spot any bias in the model (e.g. only flagging large transactions).\n",
- "- Decide whether you need to normalize, transform, or engineer features differently."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "plt.figure(figsize=(10, 6))\n",
- "sns.boxplot(data=df, x='predicted_anomaly', y='Amount')\n",
- "plt.title('Transaction Amount by Prediction Class')\n",
- "plt.show()\n"
+ "Model outputs are translated into operational actions based on predefined thresholds and business rules:\n",
+ "\n",
+ "- *Approve* - Low-risk transactions are processed without interruption\n",
+ "- *Flag* - Medium-risk transactions may trigger alerts or additional verification\n",
+ "- *Investigate* - High-risk transactions are escalated for manual review\n",
+ "\n",
+ "These decisions are designed to balance fraud prevention with customer experience and operational efficiency. Thresholds can be adjusted over time to reflect changing business priorities and evolving fraud patterns."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Step 7 (continued): SHAP Beeswarm Plot – Feature Importance for Anomaly Detection\n",
+ "### 4. Continuous Feedback & Improvement\n",
"\n",
- "The beeswarm plot below is generated using **SHAP** (SHapley Additive exPlanations). It helps explain **which features influenced the model's decisions**, and **how strongly**. We only analyze the first 100 transactions here in order to keep the visualization fast and readable.\n",
+ "*Continuous feedback and refinement are essential to maintaining model effectiveness, as fraud patterns evolve and business priorities shift over time.*\n",
"\n",
- "#### How to Read the SHAP Beeswarm Plot\n",
+ "| Azure ML Component | Role in the Process |\n",
+ "| ------------------ | ------------------- |\n",
+ "| Monitoring & Logging | Tracks ongoing model performance, including false positives and missed fraud |\n",
+ "| Stakeholder Input (Risk, Operations, Customer Experience) | Provides operational feedback on model effectiveness and business impact |\n",
+ "| Retraining Pipeline (Future State) | Incorporates new data and feedback to refine and improve model performance over time |\n",
"\n",
- "- **Each dot** represents a single transaction.\n",
- "- **Each row** is one feature (like `V1`, `V2`, `Amount`, etc.).\n",
- "- **Color** shows the feature value for that transaction:\n",
- " - **Red = high** value\n",
- " - **Blue = low** value\n",
- "- **Horizontal position** shows **impact on the model’s prediction**:\n",
- " - Dots farther to the right **push the model toward predicting fraud**.\n",
- " - Dots farther to the left **push the model toward predicting normal**.\n",
+ "Fraud detection is not a one-time implementation but an ongoing process. This phase ensures that the model remains effective over time by incorporating performance monitoring, stakeholder input, and iterative refinement.\n",
"\n",
- "#### What This Tells Us\n",
+ "#### 4a) Monitoring & Logging\n",
"\n",
- "- The **topmost features** are the most important ones in the model’s decisions.\n",
- "- For example, if `V14` is at the top and its red dots are far right, it means:\n",
- " - High values of `V14` increase the chance that the model flags a transaction as fraud.\n",
- "- This plot helps us understand **why** the model flagged certain transactions as anomalies.\n",
+ "Once the model is applied to real-world transactions, its performance is continuously monitored. Key indicators include false positives (legitimate transactions incorrectly flagged) and missed fraud cases. Tracking these metrics over time helps identify performance degradation, emerging fraud patterns, and potential model drift.\n",
"\n",
- "#### Why Use SHAP?\n",
+ "#### 4b) Stakeholder Input (Risk, Operations, Customer Experience)\n",
"\n",
- "- SHAP adds transparency to the model, even for complex algorithms like Isolation Forest.\n",
- "- Helps **build trust**, especially in sensitive tasks like fraud detection.\n",
- "- Guides feature selection and **future model improvements**."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import shap\n",
+ "Operational teams provide critical feedback on how the model performs in practice:\n",
+ "\n",
+ "- Risk & Compliance Teams assess whether fraud detection levels meet organizational and regulatory expectations\n",
+ "- Operations Teams evaluate the impact on investigation workload and processing efficiency\n",
+ "- Customer Experience Teams monitor the effect of false positives on customer satisfaction\n",
+ "\n",
+ "This feedback is structured and translated into measurable improvements rather than directly altering the model.\n",
+ "\n",
+ "#### 4c) Retraining Pipeline (Future State)\n",
+ "\n",
+ "Over time, new transaction outcomes (e.g., confirmed fraud or false alarms) are incorporated into the dataset. This enriched data is used to retrain the model, improving its ability to detect evolving fraud patterns.\n",
+ "\n",
+ "Retraining may include:\n",
+ "\n",
+ "- Updating model parameters\n",
+ "- Incorporating new features\n",
+ "- Adjusting decision thresholds\n",
"\n",
- "explainer = shap.Explainer(model, X)\n",
- "shap_values = explainer(X[:100])\n",
- "shap.plots.beeswarm(shap_values)"
+ "This iterative process ensures that the model adapts to changing conditions and continues to align with business objectives."
]
}
],
"metadata": {
"kernelspec": {
- "display_name": "venv (3.9.6)",
+ "display_name": "fraud-azureml-py39",
"language": "python",
"name": "python3"
},
@@ -426,7 +624,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.9.6"
+ "version": "3.9.25"
}
},
"nbformat": 4,
diff --git a/jupyter/img/ML_Fraud_Workflow_Feedback.png b/jupyter/img/ML_Fraud_Workflow_Feedback.png
new file mode 100644
index 0000000..cf940d4
Binary files /dev/null and b/jupyter/img/ML_Fraud_Workflow_Feedback.png differ