From 84c84cb0c4f8c951b681c60d8115991b753e4fb1 Mon Sep 17 00:00:00 2001 From: Dennis Leung <43949365+dmhleung@users.noreply.github.com> Date: Thu, 5 Mar 2026 18:47:11 +1100 Subject: [PATCH] Some amendments by Dennis to 03 and 04 Script 4 is still not checked thru by Dennis. Will continue next time. --- quarto_scripts/03-Example3-1and3-11Caes.qmd | 7 +- quarto_scripts/04-Example3-2BTR.qmd | 322 +++++++++++--------- 2 files changed, 173 insertions(+), 156 deletions(-) diff --git a/quarto_scripts/03-Example3-1and3-11Caes.qmd b/quarto_scripts/03-Example3-1and3-11Caes.qmd index 7c7129b..709dc01 100644 --- a/quarto_scripts/03-Example3-1and3-11Caes.qmd +++ b/quarto_scripts/03-Example3-1and3-11Caes.qmd @@ -213,8 +213,7 @@ The codebase is in Fortran, and is very fast. We can use `glmnet` to fit the same multinomial model by setting `lambda = 0` (giving no weight to the regularisation step) to get the unregularized estimates. -If you look up `?glmnet`, pay attention to the multinomial parameterization, -since the returned coefficients are not on the same baseline coding as `multinom()`. +If you look up `?glmnet`, pay attention to what they said about symmetric multinomial model which has a different parametrization; [see the section on "Multinomial Regression" in their vignette](https://glmnet.stanford.edu/articles/glmnet.html#multinomial-regression-family-multinomial) for the precise form of this parametrization. Because of a different parametrization, the returned fitted coefficients won't be same as `multinom()`. ```{r} #| label: glmnet-multinomial-fit @@ -384,7 +383,7 @@ because it fits a wide variety of multivariate response regression models. It is not as fast as `glmnet` or `nnet` for large datasets, because it uses the Fisher scoring algorithm (like the book). -Is is good practice to +It is good practice to [read the documentation](https://search.r-project.org/CRAN/refmans/VGAM/html/vglm.html) for the `vglm()` function with `?vglm` to understand the arguments and output structure. @@ -638,4 +637,4 @@ for (pkg in packages) { cat("- **", pkg, "**: ", cit_text, "\n\n", sep = "") } } -``` \ No newline at end of file +``` diff --git a/quarto_scripts/04-Example3-2BTR.qmd b/quarto_scripts/04-Example3-2BTR.qmd index 7c7598f..9102d24 100644 --- a/quarto_scripts/04-Example3-2BTR.qmd +++ b/quarto_scripts/04-Example3-2BTR.qmd @@ -1,13 +1,17 @@ --- title: "Ordinal Regression: Breathing Test Results." subtitle: "@fahrmeir_multivariate_2001 Examples 3.2 & 3.4" -date: today +editor: + markdown: + wrap: 72 --- ## TODO -- Look at 3.3.3 pg 31/40 for variable information on thetas and gammas. -- Write up more about the different link functions (2025-04-11 lecture 23 min) +- Look at 3.3.3 pg 31/40 for variable information on thetas and + gammas. +- Write up more about the different link functions (2025-04-11 lecture + 23 min) ## Data Loading and Setup @@ -21,13 +25,21 @@ library(readr) library(tidyr) ``` -We load the BTR dataset. -It contains breathing test results normal, borderline, and abnormal -(which we can consider ordered categories), -and considered the effects of the covariates age and smoking status. +We load the BTR dataset. It contains breathing test results normal, +borderline, and abnormal (which we can consider ordered categories), and +considered the effects of the covariates age and smoking status. ```{r} #| label: load-btr +btr_dat <- read_csv( + "data/Example3-2BTR.csv" + ) +``` + +If we just read_csv() as it is, we can see that the variables `BTR`, `Age`, and `Smoking` are all read as characters. This is not what we want, and we will use the `col_types` argument to specify that we want them as factors. + +```{r} +#| label: load-btr-factor btr_dat <- read_csv( "data/Example3-2BTR.csv", col_types = cols( @@ -41,9 +53,9 @@ btr_dat <- read_csv( btr_dat ``` -We can see that `BTR`, `Age`, and `Smoking` are all factors. -We check if the breathing test result (`BTR`) is encoded as an ordered factor, -and if not, we convert it to an ordered factor. +We can see that `BTR`, `Age`, and `Smoking` are all factors now. We check if +the breathing test result (`BTR`) is encoded as an ordered factor, and +if not, we convert it to an ordered factor. ```{r} #| label: order-btr @@ -52,25 +64,28 @@ btr_dat$BTR <- as.ordered(btr_dat$BTR) ``` We can now confirm the conversion was successful. + ```{r} #| label: check-btr-ordered class(btr_dat$BTR) btr_dat$BTR ``` -The default ordering happens to be good for us, -but if we needed to change the order, we could specify it with +The default ordering happens to be good for us, but if we needed to +change the order, we could specify it with `factor(..., levels = c(...), ordered = TRUE)`. We now examine the current contrast coding scheme: + ```{r} #| label: get-contrasts-opt getOption("contrasts") ``` `"contr.treatment"` corresponds to dummy coding for unordered factors. -For ordered factors, the default is `"contr.poly"` (orthogonal polynomials). -We consider `Age` and `Smoking` as unordered factors, and we will use dummy coding. +For ordered factors, the default is `"contr.poly"` (orthogonal +polynomials). We consider `Age` and `Smoking` as unordered factors, and +we will use dummy coding. ```{r} #| label: show-contrasts @@ -78,28 +93,28 @@ contrasts(btr_dat$Smoking) contrasts(btr_dat$Age) ``` -The default reference categories are `never-smoked` for `Smoking`, -and the `<40` for Age, -which matches @fahrmeir_multivariate_2001, -however they use **effect coding** rather than **dummy coding.** +The default reference categories are `never-smoked` for `Smoking`, and +the `<40` for Age, which matches @fahrmeir_multivariate_2001, however +they use **effect coding** rather than **dummy coding.** -::: {.callout-note} -- **Dummy coding**: Reference category is 0, other categories are 1; -coefficients represent difference from reference category -- **Effect coding**: Reference category is -1, other categories are 0 or 1; -coefficients represent difference from overall mean +::: callout-note +- **Dummy coding**: Reference category is 0, other categories are 1; + coefficients represent difference from reference category +- **Effect coding**: Reference category is -1, other categories are 0 + or 1; coefficients represent difference from overall mean ::: ## Model Fitting: Proportional Odds Models -The `polr()` function fits cumulative logit (and other) models. -[Looking at the documentation](https://search.r-project.org/CRAN/refmans/MASS/html/polr.html) -using `?polr`, -the sign of coefficients in `polr()` is **opposite** to the textbook formula. +The `polr()` function fits cumulative logit (and other) models. [Looking +at the +documentation](https://search.r-project.org/CRAN/refmans/MASS/html/polr.html) +using `?polr`, the sign of coefficients in `polr()` is **opposite** to +the textbook formula. -We first fit a cumulative logistic/proportional odds model -with an intercept, age, smoking. -We also include an interaction effect between age and smoking. +We first fit a cumulative logistic/proportional odds model with an +intercept, age, smoking. We also include an interaction effect between +age and smoking. ```{r} #| label: fit-logit-int @@ -115,21 +130,21 @@ btr_logit_inter <- polr( ::: {.callout-tip collapse="true"} ## `polr()` Function Syntax -- **`formula`**: Model structure `response ~ predictors`; -response must be an ordered factor -- **`data`**: Data frame containing the variables -- **`weights`**: Case weights (here, frequency counts for each covariate combination) -- **`Hess = TRUE`**: Return the Hessian matrix; -needed for `summary()` and confidence intervals -- **`method`**: Link function---`"logistic"` -(default, proportional odds), `"probit"`, `"cloglog"`, `"loglog"`, or `"cauchit"` +- **`formula`**: Model structure `response ~ predictors`; response + must be an ordered factor +- **`data`**: Data frame containing the variables +- **`weights`**: Case weights (here, frequency counts for each + covariate combination) +- **`Hess = TRUE`**: Return the Hessian matrix; needed for `summary()` + and confidence intervals +- **`method`**: Link function---`"logistic"` (default, proportional + odds), `"probit"`, `"cloglog"`, `"loglog"`, or `"cauchit"` ::: -The `polr()` function is limited. -It can model standard cumulative models -(see section 3.3.1 of @fahrmeir_multivariate_2001), -but doesn't allow for extended cumulative model with threshold variables -(see section 3.3.2 of @fahrmeir_multivariate_2001). +The `polr()` function is limited. It can model standard cumulative +models (see section 3.3.1 of @fahrmeir_multivariate_2001), but doesn't +allow for extended cumulative model with threshold variables (see +section 3.3.2 of @fahrmeir_multivariate_2001). ```{r} #| label: summary-logit-int @@ -139,24 +154,25 @@ summary(btr_logit_inter) ::: {.callout-important collapse="true"} ## SELF TEST: Interpret Coefficients in `polr()` for a Former Smoker age 40 to 59 -A former smoker who is Age 40 to 59 then the cumulative odds of having a normal test -compared to not having a normal test is $\exp(2.8334)$ -times the odds for a person who has never smoked. +A former smoker who is Age 40 to 59 then the cumulative odds of having a +normal test compared to not having a normal test is $\exp(2.8334)$ times +the odds for a person who has never smoked. -If smoking has no association with the breathing test results, -then we would expect this number to be 1. +If smoking has no association with the breathing test results, then we +would expect this number to be 1. ::: +There is no residual readily available, but we can observe the linear +predictor $\eta$ **as defined in the documentation page for the `polr()` +function without the intercept**. -There is no residual readily available, but we can observe the linear predictor -(represented as $\eta$ in the documentation, -but referred to in our subject with $\mathbf{x} \hat{\boldsymbol{\gamma}}$). ```{r} #| label: linear-predictor btr_logit_inter$lp ``` -The $\zeta$'s are our theshold parameters $\theta$ +The $\zeta$'s are the theshold parameters $\theta$'s as defined in our +textbook and slides. ```{r} #| label: threshold-zeta @@ -202,12 +218,14 @@ summary(btr_logit_main) Again the signs of the coefficients are the opposite to our model. -Since the interaction model is nested within the main effects model, -we can compare them with a likelihood ratio test. +Since the interaction model is nested within the main effects model, we +can compare them with a likelihood ratio test. ::: {.callout-important collapse="true"} ## SELF TEST: How Many Degrees of Freedom will the Likelihood Ratio Test have? -The test has 2 degrees of freedom since we are adding two interaction terms (`Age40to59:Smoking2Former` and `Age40to59:Smoking3Current`). + +The test has 2 degrees of freedom since we are adding two interaction +terms (`Age40to59:Smoking2Former` and `Age40to59:Smoking3Current`). ::: We can use `anova()`: @@ -220,26 +238,26 @@ anova(btr_logit_inter, btr_logit_main) ::: {.callout-important collapse="true"} ## SELF TEST: Interpret the Likelihood Ratio Test Results -We reject the null hypothesis that there is no interaction, -at all reasonable significance levels. +We reject the null hypothesis that there is no interaction, at all +reasonable significance levels. ::: -::: {.callout-note} -The LR statistic in the `anova()` output is the difference -between the deviances of the two models. -Each deviance reported is $-2 \times \log(\text{likelihood})$ for that model. +::: callout-note +The LR statistic in the `anova()` output is the difference between the +deviances of the two models. Each deviance reported is +$-2 \times \log(\text{likelihood})$ for that model. ::: ## Alternative Link Functions -The proportional odds assumption can be tested -by fitting models with different link functions. -The `polr()` function supports: +The proportional odds assumption can be tested by fitting models with +different link functions. The `polr()` function supports: -- **logistic**: cumulative logit model (default) -- **probit**: cumulative normal model -- **cloglog**: complementary log-log (grouped Cox/proportional hazards) -- **loglog**: extreme maximal-value distribution. +- **logistic**: cumulative logit model (default) +- **probit**: cumulative normal model +- **cloglog**: complementary log-log (grouped Cox/proportional + hazards) +- **loglog**: extreme maximal-value distribution. See Figure 3.1 of @fahrmeir_multivariate_2001 for some good diagrams demonstrating the shapes of these link functions. @@ -263,7 +281,8 @@ summary(btr_ph) ### Extreme Maximal-Value Distribution -This model is just a reparameterisation of the proportional hazard model. +This model is just a reparameterisation of the proportional hazard +model. ```{r} #| label: fit-loglog-int @@ -280,9 +299,8 @@ summary(btr_extm) ## Matching our Results to @fahrmeir_multivariate_2001 Example 3.4 -By default, R uses dummy coding. -To match to the textbook, we need to use effect coding. -We switch the global contrast option and refit. +By default, R uses dummy coding. To match to the textbook, we need to +use effect coding. We switch the global contrast option and refit. ```{r} #| label: set-effect-coding @@ -293,8 +311,8 @@ contrasts(btr_dat$Smoking) contrasts(btr_dat$Age) ``` -We now have the final categories coded as -1, as required. -Therefore we refit both models: +We now have the final categories coded as -1, as required. Therefore we +refit both models: ```{r} #| label: fit-logit-eff @@ -318,7 +336,8 @@ summary(btr_logit_inter_eff) summary(btr_logit_main_eff) ``` -The values reported in the summaries now match the textbook (excluding rounding errors). +The values reported in the summaries now match the textbook (excluding +rounding errors). In addition, we can redo the likelihood ratio test with the new models to confirm we get the same result. @@ -332,20 +351,19 @@ As expected, the likelihood ratio test results are the same as before. ## Saturated Model and Deviance Calculation -In each of the regressed model summaries, -there is a value "Residual Deviance" reported. -We need to be careful to understand what this exactly means, -because as we have previously seen, -this can be highly package dependent. +In each of the regressed model summaries, there is a value "Residual +Deviance" reported. We need to be careful to understand what this +exactly means, because as we have previously seen, this can be highly +package dependent. -The deviance reported by `polr()` is -$-2 \times \log(\text{likelihood})$ for the fitted model. -We can manually compute the **residual deviance** (difference from saturated model). +The deviance reported by `polr()` is $-2 \times \log(\text{likelihood})$ +for the fitted model. We can manually compute the **residual deviance** +(difference from saturated model). -We first compute the log-likelihood of the saturated model. -We create a matrix with each row corresponding to a unique covariate combination, -and each column counting the frequency of each response per category -for that covariate combination. +We first compute the log-likelihood of the saturated model. We create a +matrix with each row corresponding to a unique covariate combination, +and each column counting the frequency of each response per category for +that covariate combination. ```{r} #| label: make-count-matrix @@ -372,12 +390,11 @@ probs <- data_mat / rowSums(data_mat) probs ``` -This is the also the matrix of maximum likelihood estimates -for the saturated model. -We can manually calculate the log-likelihood of the saturated model -using the multinomial distribution -by first calculating the likelihood for each cell, (treating $0 \log(0) := 0$), -and summing over all cells to get the total log-likelihood for the saturated model. +This is the also the matrix of maximum likelihood estimates for the +saturated model. We can manually calculate the log-likelihood of the +saturated model using the multinomial distribution by first calculating +the likelihood for each cell, (treating $0 \log(0) := 0$), and summing +over all cells to get the total log-likelihood for the saturated model. ```{r} #| label: loglike-matrix @@ -394,22 +411,19 @@ loglike_saturated ::: {.callout-note collapse="true"} ## SELF TEST: Explain Why We Compute the Saturated Log-Likelihood with this Sum -The above computation is under the ungrouped view of the data. -The probability mass function for the multinomial distribution is hence: -$$ +The above computation is under the ungrouped view of the data. The +probability mass function for the multinomial distribution is hence: $$ \Pr(Y_1 = y_1, \ldots, Y_k = y_k) = \prod_{i = 1}^{6} p_1^{y_{i1}} p_2^{y_{i2}} p_3^{y_{i3}}, -$$ -where $y_i$ are the observed counts and $p_i$ are the probabilities for each category. -The log-likelihood is then: -$$\sum_{i,j} y_{ij} \log(p_{ij}),$$ -where we treat $0 \log(0) = 0$. +$$ where $y_i$ are the observed counts and $p_i$ are the probabilities +for each category. The log-likelihood is then: +$$\sum_{i,j} y_{ij} \log(p_{ij}),$$ where we treat $0 \log(0) = 0$. ::: -We can recover the deviance reported in Table 3.3 of @fahrmeir_multivariate_2001 -by computing the residual deviance as the difference -between the reported 'deviance' of the fitted model -and the deviance of the saturated model. +We can recover the deviance reported in Table 3.3 of +@fahrmeir_multivariate_2001 by computing the residual deviance as the +difference between the reported 'deviance' of the fitted model and the +deviance of the saturated model. ```{r} #| label: residual-dev-logit @@ -417,27 +431,28 @@ residual_dev_logit <- 2 * loglike_saturated + btr_logit_inter_eff$deviance residual_dev_logit ``` -This confirms that the so called 'deviance' reported by `polr()` -is actually just twice the negative of the log-likelihood of the fitted model. -This is very different from what we usually consider the deviance. +This confirms that the so called 'deviance' reported by `polr()` is +actually just twice the negative of the log-likelihood of the fitted +model. This is very different from what we usually consider the +deviance. ### Returning the Deviance for Alternative Link Functions -We fit models with complementary log-log and loglog link functions using effect coding, -then summarize results in a comparison table: +We fit models with complementary log-log and loglog link functions using +effect coding, then summarize results in a comparison table: -::: {.callout-tip} +::: callout-tip The table is built in three steps: -1. Fit one model per link and extract coefficient estimates -with `coef(summary(...))[ ,"Value"]`. -2. Build two small tibbles (one for coefficients, one for residual deviance) -and stack them with `bind_rows()`. -3. Use `mutate(across(where(is.numeric), ~ round(., 4)))` -to round all numeric columns for cleaner display. +1. Fit one model per link and extract coefficient estimates with + `coef(summary(...))[ ,"Value"]`. +2. Build two small tibbles (one for coefficients, one for residual + deviance) and stack them with `bind_rows()`. +3. Use `mutate(across(where(is.numeric), ~ round(., 4)))` to round all + numeric columns for cleaner display. -`rownames(coef(...))` supplies the parameter labels -so each estimate aligns across link functions. +`rownames(coef(...))` supplies the parameter labels so each estimate +aligns across link functions. ::: ```{r} @@ -495,9 +510,10 @@ results_table ::: {.callout-tip collapse="true"} ## Alternative: Nested Data + Functional Mapping Workflow -This can be written in a nested/list-column style using functional programming. -The idea is to store one row per link function, fit each model via `map()`, -then map again to extract coefficients and residual deviance. +This can be written in a nested/list-column style using functional +programming. The idea is to store one row per link function, fit each +model via `map()`, then map again to extract coefficients and residual +deviance. ```{r} #| label: fit-alt-links-eff-functional @@ -550,15 +566,14 @@ results_table_map <- bind_rows(coef_long, dev_long) |> results_table_map ``` - ::: -We have now matched Table 3.3 of @fahrmeir_multivariate_2001 -(with inverse signs for the coefficients). +We have now matched Table 3.3 of @fahrmeir_multivariate_2001 (with +inverse signs for the coefficients). -It is **highly recommended** that you read through Example 3.4 in the textbook -to get an idea of intepreting the results. It is worth considering here -that the proportional hazard has the lowest deviance, +It is **highly recommended** that you read through Example 3.4 in the +textbook to get an idea of intepreting the results. It is worth +considering here that the proportional hazard has the lowest deviance, and gives a much smaller threshold parameter. ## Other Tricks and Commands @@ -574,8 +589,8 @@ residuals(btr_logit_inter_eff) ### Fitted Values and Predictions -Fitted probabilities can be extracted directly -or using the generic `predict()` function: +Fitted probabilities can be extracted directly or using the generic +`predict()` function: ```{r} #| label: fitted-probs @@ -588,8 +603,8 @@ Both approaches yield identical results. ### Model Selection with AIC -The `step()` function performs automatic model selection using the AIC criterion. -(not directly covered in this course unless time allows) +The `step()` function performs automatic model selection using the AIC +criterion. (not directly covered in this course unless time allows) ```{r} #| label: model-selection-aic @@ -598,9 +613,9 @@ step(btr_logit_inter_eff) ### Variance-Covariance Matrix and Standard Errors -We can extract the estimated variance-covariance matrix of the parameter estimates -using `vcov()`. -This should line up with the inverse of the Fisher information matrix. +We can extract the estimated variance-covariance matrix of the parameter +estimates using `vcov()`. This should line up with the inverse of the +Fisher information matrix. ```{r} #| label: vcov-extraction @@ -608,9 +623,9 @@ logit_vcov <- vcov(btr_logit_inter_eff) logit_vcov ``` -From this we can match the standard errors reported in the summary output, -since they are the square root of the diagonal elements -of the variance-covariance matrix: +From this we can match the standard errors reported in the summary +output, since they are the square root of the diagonal elements of the +variance-covariance matrix: ```{r} #| label: vcov-comparison @@ -620,11 +635,11 @@ se_from_summary <- summary(btr_logit_inter_eff)$coefficients[, "Std. Error"] se_from_vcov == se_from_summary ``` -We should be able to recover the same confidence intervals -by using the variance-covariance matrix and a normal approximation (Wald intervals). +We should be able to recover the same confidence intervals by using the +variance-covariance matrix and a normal approximation (Wald intervals). -We can check the covariance matrix -against the inverse of the Hessian matrix (Fisher information) returned by `polr()`: +We can check the covariance matrix against the inverse of the Hessian +matrix (Fisher information) returned by `polr()`: ```{r} #| label: hessian-comparison @@ -633,6 +648,7 @@ hessian_inv ``` We can see that the matrices are slightly different. + ```{r} #| label: 04-example3-2btr-chunk-32 all.equal(hessian_inv, logit_vcov) @@ -640,16 +656,17 @@ all.equal(hessian_inv, logit_vcov) ### Confidence Intervals -R provides the generic method `confint()` -to compute confidence intervals for model parameters. +R provides the generic method `confint()` to compute confidence +intervals for model parameters. ```{r} #| label: confint-profile confint(btr_logit_inter_eff) |> round(3) ``` -We can attempt to recovery the same confidence intervals using a Wald approximation -based on the coefficient estimates and their standard errors: +We can attempt to recovery the same confidence intervals using a Wald +approximation based on the coefficient estimates and their standard +errors: ```{r} #| label: confint-wald-manual-upper @@ -660,12 +677,13 @@ wald_lower <- coeff_est + qnorm(0.025) * sqrt(diag(vcov(btr_logit_inter_eff))) tibble(wald_lower, wald_upper) ``` -These values are different from the confidence intervals computed by `confint()`. -This is because the `confint()` function uses the profile likelihood method, -which is more suitable for some non-linear regression problems. +These values are different from the confidence intervals computed by +`confint()`. This is because the `confint()` function uses the profile +likelihood method, which is more suitable for some non-linear regression +problems. -For more details see @venables_modern_2002, p. 220, -and run `profile(btr_logit_inter_eff)`. +For more details see @venables_modern_2002, p. 220, and run +`profile(btr_logit_inter_eff)`. ```{r} #| label: package-citations @@ -688,4 +706,4 @@ for (pkg in packages) { cat("- **", pkg, "**: ", cit_text, "\n\n", sep = "") } } -``` \ No newline at end of file +```