diff --git a/lectures/_static/quant-econ.bib b/lectures/_static/quant-econ.bib
index 82f5cc7ec..6cf319fcd 100644
--- a/lectures/_static/quant-econ.bib
+++ b/lectures/_static/quant-econ.bib
@@ -3631,3 +3631,54 @@ @article{shwartz_ziv_tishby2017
   journal = {arXiv preprint arXiv:1703.00810},
   year    = 2017
 }
+
+@article{kihlstrom_mirman1975,
+  author    = {Kihlstrom, Richard E. and Mirman, Leonard J.},
+  title     = {Information and Market Equilibrium},
+  journal   = {The Bell Journal of Economics},
+  volume    = {6},
+  number    = {1},
+  pages     = {357--376},
+  year      = {1975},
+  publisher = {The RAND Corporation}
+}
+
+@article{muth1961,
+  author  = {Muth, John F.},
+  title   = {Rational Expectations and the Theory of Price Movements},
+  journal = {Econometrica},
+  volume  = {29},
+  number  = {3},
+  pages   = {315--335},
+  year    = {1961}
+}
+
+@article{radner1972,
+  author  = {Radner, Roy},
+  title   = {Existence of Equilibrium of Plans, Prices, and Price Expectations in a Sequence of Markets},
+  journal = {Econometrica},
+  volume  = {40},
+  number  = {2},
+  pages   = {289--303},
+  year    = {1972}
+}
+
+@article{arrow1964,
+  author  = {Arrow, Kenneth J.},
+  title   = {The Role of Securities in the Optimal Allocation of Risk-bearing},
+  journal = {Review of Economic Studies},
+  volume  = {31},
+  number  = {2},
+  pages   = {91--96},
+  year    = {1964}
+}
+
+@article{grossman1976,
+  author  = {Grossman, Sanford J.},
+  title   = {On the Efficiency of Competitive Stock Markets Where Trades Have Diverse Information},
+  journal = {Journal of Finance},
+  volume  = {31},
+  number  = {2},
+  pages   = {573--585},
+  year    = {1976}
+}
\ No newline at end of file
diff --git a/lectures/_toc.yml b/lectures/_toc.yml
index f322b2864..2edbf16c1 100644
--- a/lectures/_toc.yml
+++ b/lectures/_toc.yml
@@ -43,6 +43,7 @@ parts:
   - file: exchangeable
   - file: likelihood_bayes
   - file: blackwell_kihlstrom
+  - file: information_market_equilibrium
   - file: mix_model
   - file: navy_captain
   - file: merging_of_opinions
diff --git a/lectures/information_market_equilibrium.md b/lectures/information_market_equilibrium.md
new file mode 100644
index 000000000..bfb19ac3f
--- /dev/null
+++ b/lectures/information_market_equilibrium.md
@@ -0,0 +1,1541 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.13
+    jupytext_version: 1.17.1
+kernelspec:
+  display_name: Python 3 (ipykernel)
+  language: python
+  name: python3
+---
+
+(information_market_equilibrium)=
+```{raw} jupyter
+<div id="qe-notebook-header" align="right" style="text-align:right;">
+        <a href="https://quantecon.org/" title="quantecon.org">
+                <img style="width:250px;display:inline;"
+                width="250px"
+                src="https://assets.quantecon.org/img/qe-menubar-logo.svg"
+                alt="QuantEcon">
+        </a>
+</div>
+```
+
+# Information and Market Equilibrium
+
+```{contents} Contents
+:depth: 2
+```
+
+## Overview
+
+This lecture studies two questions about the **informational role of prices**
+posed and
+answered by {cite:t}`kihlstrom_mirman1975`.
+
+1. *When do prices transmit inside information?*   
+   - An informed insider observes a private
+   signal correlated with an unknown state of the world and adjusts demand
+   accordingly.
+   - Equilibrium prices shift. 
+   - Under what conditions can an outside observer *infer* the
+   insider's private signal from the equilibrium price?
+
+2. *Do Bayesian price expectations converge?*  
+   - In a stationary stochastic exchange
+   economy, an uninformed observer uses the history of market prices and
+   Bayes' Law to form
+   beliefs about the economy's structure and hence about its induced price
+   distribution.
+   - Do those expectations eventually
+   agree with those of a fully informed observer?
+
+Kihlstrom and Mirman's answers rely on two classical ideas from statistics:
+
+- **Blackwell sufficiency**: a random variable $\tilde{y}$ is said to be
+  *sufficient* for a random variable
+  $\tilde{y}'$ with respect to an unknown state if knowing $\tilde{y}$ gives
+  all the
+  information about the state that $\tilde{y}'$ contains.
+- **Bayesian consistency**: as the sample grows, posterior beliefs eliminate
+  models that imply the wrong **price distribution**, so even when structure is
+  not identified from prices the posterior mass on the true **reduced form**
+  still converges to one.
+
+Important findings of {cite:t}`kihlstrom_mirman1975` are:
+
+- Equilibrium prices transmit inside information *if and only if* the map from
+  the
+  insider's posterior distribution to the equilibrium price is one-to-one on
+  the set of
+  posteriors that can actually arise from the signal.
+  - For the two-state case ($S = 2$), invertibility holds when the informed
+    agent's utility is homothetic and the elasticity of substitution is everywhere
+    either below one or above one.
+- In the dynamic economy, as information accumulates, Bayesian price
+  expectations converge to **rational expectations**, even when the deep
+  structure is not identified from prices alone.
+
+```{note}
+{cite:t}`kihlstrom_mirman1975` use the terms "reduced form" and "structural"
+models in a
+way that careful econometricians do. 
+
+Reduced-form and structural models come in pairs. 
+
+To each structure or structural model
+there is a reduced form, or collection of reduced forms, underlying different
+possible regressions.
+```
+
+The lecture is organized as follows.
+
+1. Set up the static two-commodity model and define equilibrium.
+2. State the price-revelation theorem and the invertibility conditions.
+3. Illustrate invertibility and its failure with numerical examples using CES
+   and
+   Cobb-Douglas preferences.
+4. Introduce the dynamic stochastic economy and derive the Bayesian convergence
+   result.
+5. Simulate Bayesian learning from price observations.
+
+This lecture builds on ideas in {doc}`blackwell_kihlstrom` and
+{doc}`likelihood_bayes`.
+
+We start by importing some Python packages.
+
+```{code-cell} ipython3
+import numpy as np
+import matplotlib.pyplot as plt
+from scipy.optimize import brentq
+from scipy.stats import norm
+```
+
+
+## Setup
+
+### Preferences, endowments, and the unknown state
+
+The economy has two goods. 
+
+Good 2 is the numeraire (price normalized to 1). 
+
+Good 1 trades at price $p > 0$.
+
+An unknown parameter $\bar{a}$ affects the value of good 1. 
+
+Agent $i$'s expected utility
+from a bundle $(x_1^i, x_2^i)$ is
+
+$$
+U^i(x_1^i, x_2^i)
+  = \sum_{s=1}^{S} u^i(a_s x_1^i,\, x_2^i)\, P^i(\bar{a} = a_s),
+$$
+
+where $P^i$ is agent $i$'s subjective probability distribution over the finite
+state space
+$A = \{a_1, \ldots, a_S\}$.
+
+Each agent starts with an endowment $w^i$ of good 2 and a share $\theta^i$ of
+the
+representative firm.
+
+In the paper's formal model, a single firm transforms good 2 into good 1
+according to
+$y_1 = f(y_2)$ with $f' < 0$ and chooses production to maximize
+
+$$
+\pi(p) = \max_{y_2 \leq 0} \{p f(y_2) + y_2\}.
+$$
+
+The firm's profit $\pi$ is then distributed to households according to the
+shares
+$\theta^i$.
+
+Agent
+$i$'s budget constraint is
+
+$$
+p x_1^i + x_2^i = w^i + \theta^i \pi.
+$$
+
+Agents maximize expected utility subject to their budget constraints.
+
+A **competitive
+equilibrium** is a price $\hat{p}$ that clears both markets simultaneously.
+
+For most of what follows, the production side matters only through the induced
+equilibrium price map, so when we turn to numerical illustrations we will
+suppress production and use a pure-exchange / portfolio interpretation to keep
+the calculations transparent.
+
+### The informed agent's problem
+
+Suppose **agent 1** (the insider) observes a private signal $\tilde{y}$
+correlated with
+$\bar{a}$ before trading, where $\tilde{y}$ takes values in a finite set $Y$.
+
+Before the signal arrives, agent 1 has prior beliefs
+$\mu_0 = P^1$.
+
+Upon observing $\tilde{y} = y$, agent 1 updates to the
+**posterior** $\mu_y = (\mu_{y1}, \ldots, \mu_{yS})$ via Bayes' rule:
+
+$$
+\mu_{ys} = P(\bar{a} = a_s \mid \tilde{y} = y).
+$$
+
+Because agent 1's demand depends on $\mu_y$, the new equilibrium price satisfies
+
+$$
+\hat{p} = p(\mu_y).
+$$
+
+Outside observers who see $\hat{p}$ but not $\tilde{y}$ can try to *back out*
+the
+insider's posterior from the price.
+
+Define the set of realized posteriors
+
+$$
+M = \{\mu_y : y \in Y,\; P(\tilde y = y) > 0\}.
+$$
+
+The key question is whether the map $\mu \mapsto p(\mu)$ is one-to-one on $M$.
+
+To answer that question, we now translate "information in prices" into
+Blackwell's language of sufficiency.
+
+(price_revelation_theorem)=
+## Price revelation
+
+### Blackwell sufficiency
+
+The price variable $p(\mu_{\tilde{y}})$ *accurately transmits* the insider's
+private
+information if observing the equilibrium price is just as informative about
+$\bar{a}$ as
+observing the signal $\tilde{y}$ directly.
+
+In Blackwell's language ({cite:t}`blackwell1951` and {cite:t}`blackwell1953`),
+this means
+$p(\mu_{\tilde{y}})$ is **sufficient** for $\tilde{y}$.
+
+```{prf:definition} Sufficiency
+:label: ime_def_sufficiency
+
+A random variable $\tilde{y}$ is *sufficient* for $\tilde{y}'$ with
+respect to $\bar{a}$ if there exists a conditional distribution $P(y' \mid y)$,
+**independent of** $\bar{a}$, such that
+
+$$
+\phi'_a(y') = \sum_{y \in Y} P(y' \mid y)\, \phi_a(y)
+\quad \text{for all } a \text{ and all } y',
+$$
+
+where $\phi_a(y) = P(\tilde{y} = y \mid \bar{a} = a)$.
+
+Thus, once $\tilde{y}$ is known, $\tilde{y}'$ provides no additional information
+about $\bar{a}$.
+```
+
+{cite:t}`kihlstrom_mirman1975` show that 
+
+```{prf:lemma} Posterior Sufficiency
+:label: ime_lemma_posterior_sufficiency
+
+The posterior distribution $\mu_{\tilde{y}}$ is a sufficient statistic for
+$\tilde{y}$.
+```
+
+```{prf:proof} (Sketch)
+The posterior $\mu_{\tilde{y}}$ satisfies
+
+$$
+P(\bar{a} = a_s \mid \mu_{\tilde{y}} = \mu_y,\; \tilde{y} = y) = \mu_{ys}
+  = P(\bar{a} = a_s \mid \mu_{\tilde{y}} = \mu_y).
+$$
+
+This identity says that once the posterior is known, conditioning on the
+original signal
+$\tilde y$ does not change beliefs about $\bar a$.
+
+Equivalently, the conditional law of $\tilde y$ given $\mu_{\tilde y}$ is
+independent of
+$\bar a$, so $\mu_{\tilde y}$ is sufficient for $\tilde y$ in Blackwell's sense.
+```
+
+Now let's think about the mapping from 
+belief to price.
+
+```{prf:theorem} Price Revelation
+:label: ime_theorem_price_revelation
+
+In the model outlined above, the price random variable $p(\mu_{\tilde{y}})$ is
+sufficient for the random variable $\tilde{y}$ if and only if the function
+$p(P^1)$ is invertible on the set of prices
+
+$$
+\mathcal{P} = \Bigl\{\, p(\mu_y) : y \in Y,\;
+  P(\tilde{y} = y) = \sum_{a \in A} \phi_a(y)\,\mu_0(a) > 0 \Bigr\}.
+$$
+```
+
+The logic is
+
+$$
+\tilde y \quad \longrightarrow \quad \mu_{\tilde y} \quad \longrightarrow \quad
+p(\mu_{\tilde y}).
+$$
+
+The first arrow loses no information about $\bar a$ by
+{prf:ref}`ime_lemma_posterior_sufficiency`, and the theorem asks when the second
+arrow also loses no information.
+
+The proof has two parts.
+
+If $p(\cdot)$ is one-to-one on $M$, then observing the price is equivalent to
+observing the
+posterior itself because
+
+$$
+P(\mu_{\tilde y} = \mu \mid p(\mu_{\tilde y}) = p)
+= \begin{cases}
+1 & \text{if } \mu = p^{-1}(p), \\
+0 & \text{otherwise.}
+\end{cases}
+$$
+
+This conditional distribution is independent of the state, so price is
+sufficient for the
+posterior; together with {prf:ref}`ime_lemma_posterior_sufficiency`, price is
+therefore
+sufficient for the signal.
+
+Conversely, if two different posteriors in $M$ generated the same price, an
+observer of the price could not tell which posterior had occurred, and the paper
+shows formally that in this case the conditional distribution of the posterior
+given price would depend on the state, so price could not be sufficient.
+
+Before turning to invertibility itself, it helps to keep in mind the two
+economic interpretations emphasized in the paper.
+
+### Two interpretations
+
+#### Insider trading in a stock market
+
+Good 1 is a risky asset with random return $\bar{a}$; good 2 is "money".
+
+An insider's demand reveals private information about the return.
+
+If the invertibility condition holds, outside observers can read the insider's
+signal from
+the equilibrium stock price.
+
+#### Price as a quality signal
+
+Good 1 has uncertain quality $\bar{a}$.
+
+Experienced consumers (who have sampled the good) observe a signal correlated
+with quality
+and buy accordingly.
+
+Uninformed consumers can infer quality from the market price, provided
+invertibility holds.
+
+(invertibility_conditions)=
+## Invertibility and the elasticity of substitution
+
+When does the belief-to-price map fail to be invertible?
+
+{prf:ref}`ime_theorem_invertibility_conditions`
+shows that for a two-state economy ($S = 2$), the answer depends on the
+**elasticity of
+substitution** $\sigma$ of agent 1's utility function.
+
+Before stating the theorem, it helps to see the two intermediate steps in the
+paper's
+argument.
+
+```{prf:lemma} Same Price Implies Same Allocation
+:label: ime_lemma_same_price_same_allocation
+
+Assume that $u^i$ has continuous first partial derivatives and that $u^i$ is
+quasi-concave.
+
+Let $p \in \mathcal{P}$.
+
+If there exist two measures $\mu^*$ and $\mu'$ in $M$ such that
+$p(\mu^*, P^2, \ldots, P^n) = p(\mu', P^2, \ldots, P^n) = p$, then
+
+$$
+x^i(\mu^*, P^2, \ldots, P^n) = x^i(\mu', P^2, \ldots, P^n), \quad
+i = 1, \ldots, n.
+$$
+```
+
+Fix the beliefs of all agents except agent 1.
+
+The lemma says that if two posterior beliefs $\mu^*$ and $\mu'$ for agent 1
+both support the same equilibrium price $p$, then they support the same
+equilibrium allocation for every trader.
+
+The intuition is that when the price is unchanged, the demands of the
+uninformed traders are unchanged too, so market clearing forces the informed
+agent's bundle to be unchanged as well.
+
+This lemma lets us define the informed agent's equilibrium bundle as a function
+of price alone:
+
+$$
+x(p) = (x_1(p), x_2(p)).
+$$
+
+Throughout, $u^i_j$ denotes the partial derivative of $u^i$ with respect to its
+$j$-th argument.
+
+Whenever the informed agent consumes positive amounts of both goods, optimality
+of $x(p)$
+under posterior $\mu$ gives the interior first-order condition
+
+$$
+p = \frac{\sum_{s=1}^S a_s u_1^1(a_s x_1(p), x_2(p))\, \mu(a_s)}
+         {\sum_{s=1}^S u_2^1(a_s x_1(p), x_2(p))\, \mu(a_s)}.
+$$
+
+For a fixed price $p$, the bundle $x(p)$ is fixed too, so invertibility boils
+down to
+whether this equation admits a unique posterior $\mu$.
+
+```{prf:lemma} Unique Posterior at a Given Price
+:label: ime_lemma_unique_posterior
+
+Assume that the first partial derivatives of $u^1$ exist and that $u^1$ is
+quasi-concave.
+
+Also assume that agent 1 always consumes positive quantities of both goods.
+
+Then $p(P^1)$ is invertible on $\mathcal{P}$ if for each $p \in \mathcal{P}$
+there exists a unique probability measure $\mu \in M$ such that
+
+$$
+\frac{\sum_{s=1}^S a_s\, u^1_1(a_s x_1(p), x_2(p))\, \mu(a_s)}
+     {\sum_{s=1}^S u^1_2(a_s x_1(p), x_2(p))\, \mu(a_s)} = p.
+$$
+```
+
+If two different posteriors gave the same price, then by
+{prf:ref}`ime_lemma_same_price_same_allocation` they would share the same bundle
+$x(p)$, contradicting uniqueness of the posterior that solves the first-order
+condition at that price.
+
+### The two-state first-order condition
+
+With $S = 2$ and $\mu = (q,\, 1-q)$, define
+
+$$
+\alpha_s(p) = a_s\, u^1_1(a_s x_1(p),\, x_2(p)), \qquad
+\beta_s(p)  = u^1_2(a_s x_1(p),\, x_2(p)), \qquad s = 1, 2.
+$$
+
+Then the first-order condition becomes
+
+$$
+p = \frac{\alpha_1(p)\, q + \alpha_2(p)\, (1-q)}
+         {\beta_1(p)\, q + \beta_2(p)\, (1-q)}.
+$$
+
+At a fixed price $p$, the quantities $\alpha_s(p)$ and $\beta_s(p)$ are
+constants, so
+uniqueness of the posterior is the same as uniqueness of the scalar $q$ solving
+this
+equation.
+
+```{prf:theorem} Invertibility Conditions
+:label: ime_theorem_invertibility_conditions
+
+Assume that the first partial derivatives of $u^1$ exist and that $u^1$ is
+quasi-concave and homothetic.
+
+Also suppose that the informed agent always consumes positive quantities of
+both goods in all equilibrium allocations.
+
+If $S = 2$ and the elasticity of substitution of $u^1$ is either always less
+than one or always greater than one, then $p(P^1)$ is invertible on
+$\mathcal{P}$.
+
+If $u^1$ is Cobb-Douglas (elasticity of substitution constant and equal to
+one), then $p(P^1)$ is constant on $\mathcal{P}$.
+```
+
+When $\sigma = 1$ the income and substitution effects exactly cancel, so
+agent 1's demand for good 1 does not respond to changes in beliefs about
+$\bar{a}$.
+
+Because the demand is unchanged, the market-clearing price is unchanged too,
+and the price reveals nothing about the insider's signal.
+
+### CES utility
+
+For concreteness we work with a simplified example with the **constant-elasticity-of-substitution** (CES)
+utility
+function
+
+$$
+u(c_1, c_2) = \bigl(c_1^{\rho} + c_2^{\rho}\bigr)^{1/\rho}, \qquad \rho \in
+(-\infty,0) \cup (0,1),
+$$
+
+whose elasticity of substitution is $\sigma = 1/(1-\rho)$.
+
+- $\rho \to 0$: Cobb-Douglas ($\sigma = 1$).
+- $\rho < 0$: $\sigma < 1$ (complements).
+- $0 < \rho < 1$: $\sigma > 1$ (substitutes).
+
+Pertinent partial derivatives are
+
+$$
+u_1(c_1,c_2) = \bigl(c_1^\rho + c_2^\rho\bigr)^{1/\rho - 1}\, c_1^{\rho-1},
+\qquad
+u_2(c_1,c_2) = \bigl(c_1^\rho + c_2^\rho\bigr)^{1/\rho - 1}\, c_2^{\rho-1}.
+$$
+
+This CES example is only an illustration, because the theorem itself covers any
+homothetic utility with elasticity everywhere above one or everywhere below one.
+
+With that example in hand, we can compute the equilibrium price directly as a
+function of the posterior.
+
+### Equilibrium price as a function of the posterior
+
+We focus on agent 1 as the *only* informed trader who absorbs one unit of good 1
+at
+equilibrium (i.e., $x_1 = 1$).
+
+Let $W_1 = w^1 + \theta^1 \pi$ denote agent 1's total wealth (endowment plus
+profit share).
+
+Agent 1's budget constraint then reduces to
+$x_2 = W_1 - p$, and the equilibrium price is the unique $p \in (0, W_1)$
+satisfying
+the first-order condition
+
+$$
+p \bigl[q\, u_2(a_1,\, W_1-p) + (1-q)\, u_2(a_2,\, W_1-p)\bigr]
+= q\, a_1\, u_1(a_1,\, W_1-p) + (1-q)\, a_2\, u_1(a_2,\, W_1-p).
+$$
+
+For Cobb-Douglas utility ($\sigma = 1$), the first-order condition becomes $p =
+W_1 - p$,
+giving $p^* = W_1/2$ regardless of the posterior $q$, confirming that no
+information
+is transmitted through the price in the Cobb-Douglas case.
+
+We compute first-order conditions numerically below.
+
+```{code-cell} ipython3
+def ces_derivatives(c1, c2, ρ):
+    """
+    Return CES marginal utilities.
+
+    Use the Cobb-Douglas limit near rho = 0.
+    """
+    if abs(ρ) < 1e-4:
+        u1 = 0.5 * np.sqrt(c2 / c1)
+        u2 = 0.5 * np.sqrt(c1 / c2)
+    else:
+        common = (c1**ρ + c2**ρ)**(1 / ρ - 1)
+        u1 = common * c1**(ρ - 1)
+        u2 = common * c2**(ρ - 1)
+    return u1, u2
+
+
+def eq_price(q, a1, a2, W1, ρ):
+    """Return the equilibrium price for posterior q."""
+    def residual(p):
+        x2 = W1 - p
+        u1_s1, u2_s1 = ces_derivatives(a1, x2, ρ)
+        u1_s2, u2_s2 = ces_derivatives(a2, x2, ρ)
+        lhs = p * (q * u2_s1 + (1 - q) * u2_s2)
+        rhs = q * a1 * u1_s1 + (1 - q) * a2 * u1_s2
+        return lhs - rhs
+
+    try:
+        return brentq(residual, 1e-6, W1 - 1e-6, xtol=1e-10)
+    except ValueError:
+        return np.nan
+```
+
+```{code-cell} ipython3
+---
+mystnb:
+  figure:
+    caption: equilibrium price vs posterior
+    name: fig-eq-price-posterior
+---
+a1, a2 = 2.0, 0.5     # state values (a1 > a2)
+W1 = 4.0
+
+q_grid = np.linspace(0.05, 0.95, 200)
+
+ρ_values = [-0.5, 0.0, 0.5]
+ρ_labels = [
+    r"$\rho = -0.5$ ($\sigma = 0.67$, complements)",
+    r"$\rho = 0$ ($\sigma = 1$, Cobb-Douglas)",
+    r"$\rho = 0.5$ ($\sigma = 2$, substitutes)",
+]
+
+fig, ax = plt.subplots(figsize=(8, 5))
+
+for ρ, label in zip(ρ_values, ρ_labels):
+    prices = [eq_price(q, a1, a2, W1, ρ) for q in q_grid]
+    ax.plot(q_grid, prices, label=label, lw=2)
+
+ax.set_xlabel(r"posterior probability $q = \Pr(\bar{a} = a_1)$", fontsize=12)
+ax.set_ylabel("equilibrium price $p^*(q)$", fontsize=12)
+ax.legend(fontsize=10)
+plt.tight_layout()
+plt.show()
+```
+
+The plot confirms {prf:ref}`ime_theorem_invertibility_conditions`.
+
+For CES with $\sigma \neq 1$, the equilibrium price is strictly monotone in $q$.
+
+An outside observer who knows the equilibrium map $p^*(\cdot)$ can therefore
+invert the price uniquely to recover $q$, so the inside information is fully
+transmitted.
+
+For Cobb-Douglas ($\sigma = 1$), the price is flat in $q$, so information is
+never transmitted through the market.
+
+```{code-cell} ipython3
+p_cd = [eq_price(q, a1, a2, W1, ρ=0.0) for q in q_grid]
+
+print(f"Cobb-Douglas (rho=0): min p* = {min(p_cd):.6f}, "
+      f"max p* = {max(p_cd):.6f}, "
+      f"range = {max(p_cd)-min(p_cd):.2e}")
+print(f"Analytical CD price  = W1/2 = {W1/2:.6f}")
+```
+
+Every entry equals $W_1/2 = 2.0$ exactly, confirming analytically that the
+Cobb-Douglas
+equilibrium price is independent of $q$ and of the state values $a_1, a_2$.
+
+The numerical plot shows monotonicity, and the next subsection connects that
+pattern back to the proof of {prf:ref}`ime_theorem_invertibility_conditions`.
+
+(price_monotonicity)=
+### Why monotonicity depends on $\sigma$
+
+Fix a price $p$ and treat $\alpha_s(p)$ and $\beta_s(p)$ as constants.
+
+The right-hand side of the two-state first-order condition
+
+$$
+\frac{\alpha_1(p)\, q + \alpha_2(p)\, (1-q)}
+     {\beta_1(p)\, q + \beta_2(p)\, (1-q)}
+$$
+
+is then a function of $q$ alone, with derivative
+
+$$
+\frac{\partial}{\partial q}
+\frac{\alpha_1 q + \alpha_2 (1-q)}
+     {\beta_1 q + \beta_2 (1-q)}
+= \frac{\alpha_1 \beta_2 - \alpha_2 \beta_1}
+       {\bigl[\beta_1 q + \beta_2 (1-q)\bigr]^2}.
+$$
+
+So the sign is determined by $\alpha_1 \beta_2 - \alpha_2 \beta_1$, and if that
+sign is constant then for each fixed price there is at most one posterior weight
+$q$ consistent with the first-order condition, which is exactly what
+{prf:ref}`ime_theorem_invertibility_conditions` requires.
+
+Using
+
+$$
+\frac{\alpha_s}{\beta_s}
+  = \frac{a_s\, u_1(a_s x_1, x_2)}{u_2(a_s x_1, x_2)}
+  = a_s^{(\sigma-1)/\sigma}\,\Bigl(\frac{x_2}{x_1}\Bigr)^{1/\sigma},
+$$
+
+one can show
+
+$$
+\frac{\partial}{\partial a}\,\frac{\alpha}{\beta}
+  = \frac{(\sigma - 1)}{\sigma}\, a^{-1/\sigma}\,
+    \Bigl(\frac{x_2}{x_1}\Bigr)^{1/\sigma}.
+$$
+
+For the CES specification, this derivative is positive when $\sigma > 1$,
+negative when
+$\sigma < 1$, and *zero when $\sigma = 1$*.
+
+In other words, for CES utility the ratio $\alpha_s / \beta_s$ moves
+monotonically with the state value $a_s$ unless $\sigma = 1$, which makes the
+fixed-price first-order-condition expression monotone in $q$ and in turn
+delivers invertibility.
+
+The vanishing derivative in the Cobb-Douglas case means the marginal rate of
+substitution is
+independent of $a_s$, so the informed agent's demand, and hence the equilibrium
+price, does
+not respond to changes in beliefs.
+
+Let us visualize the ratio $\alpha_s / \beta_s$ as a function of $a_s$ for
+different
+values of $\sigma$:
+
+```{code-cell} ipython3
+---
+mystnb:
+  figure:
+    caption: marginal rate of substitution
+    name: fig-mrs-alpha-beta
+---
+a_vals = np.linspace(0.3, 3.0, 300)
+x1_fix, x2_fix = 1.0, 1.0
+
+fig, ax = plt.subplots(figsize=(7, 4))
+for ρ in [-0.5, -1e-6, 0.5]:
+    σ = 1 / (1 - ρ) if abs(ρ) > 1e-8 else 1.0
+    ratios = []
+    for a in a_vals:
+        u1, u2 = ces_derivatives(a * x1_fix, x2_fix, ρ)
+        ratios.append(a * u1 / u2)
+    ax.plot(a_vals, ratios, label=rf"$\sigma = {σ:.2f}$", lw=2)
+
+ax.set_xlabel(r"state value $a_s$", fontsize=12)
+ax.set_ylabel(r"$\alpha_s / \beta_s = a_s u_1 / u_2$", fontsize=12)
+ax.axhline(y=1.0, color="black", lw=0.8, ls="--")
+ax.legend(fontsize=10)
+plt.tight_layout()
+plt.show()
+```
+
+When $\sigma = 1$ the ratio is constant across all $a_s$ values, so
+information about the state has no effect on the marginal rate of substitution.
+
+For $\sigma < 1$ the ratio is decreasing in $a_s$, and for $\sigma > 1$ it is
+increasing, making the equilibrium price strictly monotone in the posterior $q$
+in both cases.
+
+The static analysis asks whether a current price reveals current private
+information, whereas the next section asks what a whole history of prices
+reveals over time.
+
+(bayesian_price_expectations)=
+## Bayesian price expectations in a dynamic economy
+
+We now turn to a question addressed in Section 3 of
+{cite:t}`kihlstrom_mirman1975`.
+
+### A stochastic exchange economy
+
+Time is discrete: $t = 1, 2, \ldots$
+
+In each period $t$:
+
+1. Consumer $i$ receives a random endowment $\omega_i^t$.
+2. Markets open; competitive prices $p^t = p(\omega^t)$ clear all markets.
+3. Consumers trade and consume.
+
+The endowment vectors $\{\tilde{\omega}^t\}$ are **i.i.d.** with density
+$f(\omega^t \mid \lambda)$, where $\lambda = (\lambda_1, \ldots, \lambda_K)$ is
+a
+**structural parameter vector** (of dimension $K$) that is *fixed but unknown*.
+
+The equilibrium price at time $t$ is a deterministic function of $\omega^t$, so
+$\{p^t\}$ is also i.i.d.
+
+For any measurable price set $P$, let
+
+$$
+W(P) = \{\omega^t : p(\omega^t) \in P\}.
+$$
+
+Then
+
+$$
+P_\lambda(p^t \in P) = P_\lambda(\omega^t \in W(P))
+= \int_{W(P)} f(\omega^t \mid \lambda)\, d\omega^t.
+$$
+
+The induced price density is denoted by $g(p^t \mid \lambda)$.
+
+For a given structure $\lambda$, this density is the observable implication of
+the model, and when several structures imply the same density we group them
+into a single reduced-form class.
+
+The next issue is therefore what an observer can and cannot infer about the
+structure from price data alone.
+
+### The identification problem
+
+Because the map $\omega \mapsto p(\omega)$ is many-to-one, observing prices
+loses
+information relative to observing endowments.
+
+In particular, it may be impossible to
+recover $\lambda$ from $g(p \mid \lambda)$ even with infinite price data.
+
+To handle this, partition $\Lambda$ into equivalence classes $\mu$ such that
+$\lambda \in \mu$ and $\lambda' \in \mu$ whenever $g(p \mid \lambda) = g(p \mid
+\lambda')$
+for all $p$.
+
+The equivalence class $\mu$ containing the true $\lambda$ is the **reduced
+form** relevant for price data.
+
+An observer who knows the infinite price history learns
+$\mu$ but not necessarily $\lambda$.
+
+Once that distinction is clear, Bayesian updating can be written down directly.
+
+### Bayesian updating
+
+An uninformed observer begins with a prior $h(\lambda)$ over $\lambda \in
+\Lambda$.
+
+If the observer could see endowments directly, the posterior would be
+
+$$
+h(\lambda \mid \omega^1, \ldots, \omega^t)
+  = \frac{h(\lambda)\, \prod_{\tau=1}^{t} f(\omega^\tau \mid \lambda)}
+         {\displaystyle\sum_{\lambda' \in \Lambda}
+           h(\lambda')\, \prod_{\tau=1}^{t} f(\omega^\tau \mid \lambda')},
+$$
+
+and the paper appeals to a Bayesian consistency result to conclude that this
+posterior concentrates on the true structure $\bar \lambda$.
+
+After observing the price sequence $(p^1, \ldots, p^t)$, the observer's Bayesian
+posterior is
+
+$$
+h(\lambda \mid p^1, \ldots, p^t)
+  = \frac{h(\lambda)\, \prod_{\tau=1}^{t} g(p^\tau \mid \lambda)}
+         {\displaystyle\sum_{\lambda' \in \Lambda}
+           h(\lambda')\, \prod_{\tau=1}^{t} g(p^\tau \mid \lambda')}.
+$$
+
+Price data cannot distinguish structures inside the same reduced-form class.
+
+Indeed, if
+$\lambda$ and $\lambda'$ belong to the same class $\mu$, then
+$g(\cdot \mid \lambda) = g(\cdot \mid \lambda')$, so
+
+$$
+\frac{h(\lambda \mid p^1, \ldots, p^t)}
+     {h(\lambda' \mid p^1, \ldots, p^t)}
+= \frac{h(\lambda)}{h(\lambda')}
+$$
+
+for every sample history, so the relative odds within an observationally
+equivalent class never change.
+
+At time $t$, the observer's price expectations for the next period are
+
+$$
+g(p^{t+1} \mid p^1, \ldots, p^t)
+  = \sum_{\lambda \in \Lambda} g(p^{t+1} \mid \lambda)\,
+    h(\lambda \mid p^1, \ldots, p^t).
+$$
+
+### The convergence theorem
+
+```{prf:theorem} Bayesian Convergence
+:label: ime_theorem_bayesian_convergence
+
+Let $\bar\lambda$ be the true
+structural parameter and $\bar\mu$ the reduced form that contains $\bar\lambda$.
+
+Assume the prior assigns positive probability to the reduced-form class $\bar\mu$.
+
+Define the posterior mass on a reduced-form class by
+
+$$
+H_t(\mu) = \sum_{\lambda \in \mu} h(\lambda \mid p^1, \ldots, p^t).
+$$
+
+Because all structures inside a class imply the same $g(\cdot \mid \lambda)$,
+the
+predictive density can equivalently be written as
+
+$$
+g(p^{t+1} \mid p^1, \ldots, p^t)
+  = \sum_{\mu} g(p^{t+1} \mid \mu)\, H_t(\mu).
+$$
+
+Then
+
+$$
+\lim_{t \to \infty} H_t(\mu)
+  = \begin{cases} 1 & \text{if } \mu = \bar\mu, \\ 0 & \text{otherwise,}
+  \end{cases}
+$$
+
+with probability one.
+
+Consequently,
+
+$$
+\lim_{t \to \infty} g(p^{t+1} \mid p^1, \ldots, p^t) = g(p \mid \bar\mu),
+$$
+
+which equals the rational-expectations price distribution for a fully informed
+observer.
+```
+
+```{note}
+Note that the theorem only requires the prior to assign positive probability to the reduced-form class $\bar\mu$ that contains the true structure $\bar\lambda$.
+
+This is implied by, but weaker than, assigning positive probability to the true
+structural parameter $\bar\lambda$ itself.
+
+A prior could place zero mass on $\bar\lambda$
+while still placing positive mass on other structures inside $\bar\mu$.
+```
+
+The important distinction is that price observers need not learn $\bar \lambda$
+itself.
+
+They only learn which reduced-form class is correct.
+
+That is enough for forecasting because every $\lambda \in \bar \mu$ generates
+the same price density $g(\cdot \mid \bar \mu)$.
+
+Rational price expectations emerge from
+learning the
+reduced form, not from identifying every structural detail of the economy.
+
+Here "rational expectations" means that the observer's predictive distribution
+for next
+period's price matches the objective price distribution generated by the true
+reduced form.
+
+Let's now turn to a simple simulation.
+
+(bayesian_simulation)=
+## Simulating Bayesian learning from prices
+
+We illustrate the theorem with a two-state example.
+
+Two possible reduced forms $\mu_1$ and $\mu_2$ generate prices
+$p^t \sim N(\bar{p}_i, \sigma_p^2)$ for $i = 1, 2$ respectively.
+
+The observer knows the two possible price distributions (the reduced forms) but
+not which
+one governs the data.
+
+This is a **Bayesian model selection** problem we have seen in {doc}`likelihood_bayes`.
+
+With a prior $h_0$ on $\mu_1$ and the observed price $p^t$, the posterior weight
+on $\mu_1$
+after period $t$ is
+
+$$
+h_t = \frac{h_{t-1}\, g(p^t \mid \mu_1)}{h_{t-1}\, g(p^t \mid \mu_1)
+      + (1-h_{t-1})\, g(p^t \mid \mu_2)}.
+$$
+
+We consider a numerical example with two normal distributions with different means
+
+```{code-cell} ipython3
+def simulate_bayesian_learning(
+    p_bar_true, p_bar_alt, σ_p, T, h0, n_paths, seed=42
+):
+    """Simulate posterior learning between two Gaussian reduced forms."""
+    rng = np.random.default_rng(seed)
+    h_paths = np.zeros((n_paths, T + 1))
+    h_paths[:, 0] = h0
+
+    for path in range(n_paths):
+        h = h0
+        prices = rng.normal(p_bar_true, σ_p, size=T)
+        for t, p in enumerate(prices):
+            g_true = norm.pdf(p, loc=p_bar_true, scale=σ_p)
+            g_alt = norm.pdf(p, loc=p_bar_alt, scale=σ_p)
+            denom = h * g_true + (1 - h) * g_alt
+            h = h * g_true / denom
+            h_paths[path, t + 1] = h
+
+    return h_paths
+
+
+def plot_bayesian_learning(h_paths, p_bar_true, p_bar_alt, ax):
+    """Plot posterior beliefs over time."""
+    T = h_paths.shape[1] - 1
+    t_grid = np.arange(T + 1)
+
+    for path in h_paths:
+        ax.plot(t_grid, path, alpha=0.25, lw=0.8, color="steelblue")
+
+    median_path = np.median(h_paths, axis=0)
+    ax.plot(t_grid, median_path, color="navy", lw=2, label="median posterior")
+
+    ax.axhline(
+        y=1.0,
+        color="black",
+        ls="--",
+        lw=1.2,
+        label="true model weight = 1",
+    )
+    ax.set_xlabel("period $t$", fontsize=12)
+    ax.set_ylabel(r"$h_t$ = posterior weight on true model", fontsize=12)
+    ax.legend(fontsize=10)
+```
+
+We consider two cases, one that is easy to learn and another one that is harder to learn,
+using $T = 300$ periods, $n = 40$ simulated paths, a diffuse prior $h_0 = 0.5$, and
+common standard deviation $\sigma_p = 0.4$.
+
+- *Easy case*: true model $N(2.0,\, 0.4^2)$, alternative $N(1.2,\, 0.4^2)$.
+- *Hard case*: true model $N(2.0,\, 0.4^2)$, alternative $N(1.8,\, 0.4^2)$.
+
+Whether easy or hard to learn depends on "how close" the true distribution is compared to the
+alternative hypothesis.
+
+```{code-cell} ipython3
+---
+mystnb:
+  figure:
+    caption: bayesian learning across paths
+    name: fig-bayesian-learning
+---
+T = 300
+h0 = 0.5     # diffuse prior
+n_paths = 40
+σ_p = 0.4
+
+fig, axes = plt.subplots(1, 2, figsize=(12, 5))
+
+# Distinct reduced forms
+p_bar_true, p_bar_alt = 2.0, 1.2
+h_paths = simulate_bayesian_learning(p_bar_true, p_bar_alt, σ_p, T, h0, n_paths)
+plot_bayesian_learning(h_paths, p_bar_true, p_bar_alt, axes[0])
+
+# Similar reduced forms
+p_bar_true, p_bar_alt = 2.0, 1.8
+h_paths_hard = simulate_bayesian_learning(
+    p_bar_true, p_bar_alt, σ_p, T, h0, n_paths
+)
+plot_bayesian_learning(h_paths_hard, p_bar_true, p_bar_alt, axes[1])
+
+plt.tight_layout()
+plt.show()
+```
+
+In both panels the posterior weight on the true model converges to 1 with
+probability one,
+though convergence is slower when the two price distributions are similar (right
+panel).
+
+### Price expectations vs. rational expectations
+
+We now verify that the observer's price expectations converge to the
+rational-expectations
+distribution $g(p \mid \bar\mu)$.
+
+We continue to use the parameterization of the "easy-to-learn" example above
+($\bar{p}_{\text{true}} = 2.0$, $\bar{p}_{\text{alt}} = 1.2$, $\sigma_p = 0.4$),
+now extending to $T = 1{,}000$ periods with a single simulated path and prior $h_0 = 0.5$
+
+```{code-cell} ipython3
+---
+mystnb:
+  figure:
+    caption: price distribution convergence
+    name: fig-price-convergence
+---
+def price_expectation(h_t, p_bar_true, p_bar_alt, σ_p, p_grid):
+    """Return the predictive price density at posterior weight h_t."""
+    return (
+        h_t * norm.pdf(p_grid, loc=p_bar_true, scale=σ_p)
+        + (1 - h_t) * norm.pdf(p_grid, loc=p_bar_alt, scale=σ_p)
+    )
+
+
+p_bar_true, p_bar_alt = 2.0, 1.2
+σ_p = 0.4
+n_paths = 1
+T_long = 1000
+
+h_paths_long = simulate_bayesian_learning(
+    p_bar_true, p_bar_alt, σ_p, T_long, h0=0.5, n_paths=n_paths, seed=7
+)
+
+p_grid = np.linspace(0.0, 3.5, 300)
+re_density = norm.pdf(p_grid, loc=p_bar_true, scale=σ_p)
+
+fig, ax = plt.subplots(figsize=(8, 5))
+snapshots = [0, 1, 3, 5, 10]
+palette   = plt.cm.Blues(np.linspace(0.3, 1.0, len(snapshots)))
+
+for t_snap, col in zip(snapshots, palette):
+    h_t = h_paths_long[0, t_snap]
+    dens = price_expectation(h_t, p_bar_true, p_bar_alt, σ_p, p_grid)
+    ax.plot(
+        p_grid,
+        dens,
+        color=col,
+        lw=2,
+        label=rf"$t = {t_snap}$, $h_t = {h_t:.3f}$",
+    )
+
+ax.plot(p_grid, re_density, "k--", lw=2,
+        label=r"rational expectations $g(p \mid \bar{\mu})$")
+ax.set_xlabel("price $p$", fontsize=12)
+ax.set_ylabel("density", fontsize=12)
+ax.legend(fontsize=9)
+plt.tight_layout()
+plt.show()
+```
+
+The sequence of predictive densities (shades of blue) converges to the
+rational-expectations
+density (dashed black line) as experience accumulates.
+
+This illustrates {prf:ref}`ime_theorem_bayesian_convergence`.
+
+We can now sharpen the point by looking at a case in which the reduced form is
+learned but the underlying structure is not.
+
+(km_extension_nonidentification)=
+### Learning the reduced form without identifying the structure
+
+The convergence result is particularly striking because the observer converges
+to
+*rational expectations* even when the underlying **structure** $\lambda$ is
+*not identified* by prices.
+
+To illustrate this, consider a case with *three* possible structures
+$\lambda^{(1)}, \lambda^{(2)}, \lambda^{(3)}$ but only *two* reduced forms
+$\mu_1 = \{\lambda^{(1)}, \lambda^{(2)}\}$ and $\mu_2 = \{\lambda^{(3)}\}$
+(because $\lambda^{(1)}$ and $\lambda^{(2)}$ generate the same price
+distribution).
+
+The three structures have price means $\bar{p}_1 = \bar{p}_2 = 2.0$ and
+$\bar{p}_3 = 1.2$, with common standard deviation $\sigma_p = 0.4$, a
+uniform prior $h_0 = (1/3, 1/3, 1/3)$, and $T = 400$ periods over $30$ paths.
+
+The true structure is $\lambda^{(1)}$.
+
+```{code-cell} ipython3
+---
+mystnb:
+  figure:
+    caption: learning with non-identification
+    name: fig-nonidentification
+---
+def simulate_learning_3struct(
+    T, h0_vec, p_bar_vec, σ_p, true_idx, n_paths, seed=0
+):
+    """Simulate learning with three structures and two reduced forms."""
+    rng = np.random.default_rng(seed)
+    h_paths = np.zeros((n_paths, T + 1, 3))
+    h_paths[:, 0, :] = h0_vec
+
+    for path in range(n_paths):
+        h = np.array(h0_vec, dtype=float)
+        prices = rng.normal(p_bar_vec[true_idx], σ_p, size=T)
+        for t, p in enumerate(prices):
+            likelihoods = norm.pdf(p, loc=p_bar_vec, scale=σ_p)
+            h = h * likelihoods
+            h /= h.sum()
+            h_paths[path, t + 1, :] = h
+
+    return h_paths
+
+
+# Structures 0 and 1 share the same reduced form
+p_bar_vec = np.array([2.0, 2.0, 1.2])
+h0_vec = np.array([1 / 3, 1 / 3, 1 / 3])
+σ_p = 0.4
+T = 400
+true_idx = 0     # Structure 0 is observationally equivalent to 1
+
+h_paths_3 = simulate_learning_3struct(
+    T, h0_vec, p_bar_vec, σ_p, true_idx, n_paths=30
+)
+t_grid = np.arange(T + 1)
+
+fig, axes = plt.subplots(1, 3, figsize=(13, 4), sharey=True)
+struct_labels = [
+    r"$\lambda^{(1)}$",
+    r"$\lambda^{(2)}$",
+    r"$\lambda^{(3)}$",
+]
+
+for k, (ax, label) in enumerate(zip(axes, struct_labels)):
+    for path in h_paths_3:
+        ax.plot(t_grid, path[:, k], alpha=0.25, lw=0.8, color="steelblue")
+    ax.plot(t_grid, np.median(h_paths_3[:, :, k], axis=0),
+            color="navy", lw=2, label=f"median weight on {label}")
+    ax.set_xlabel("period $t$", fontsize=11)
+    ax.legend(fontsize=9)
+
+axes[0].set_ylabel("posterior weight", fontsize=11)
+plt.tight_layout()
+plt.show()
+```
+
+The observer correctly rules out $\lambda^{(3)}$ (the wrong reduced form) with
+probability
+one, but cannot distinguish $\lambda^{(1)}$ from $\lambda^{(2)}$ because they
+generate an
+identical price distribution.
+
+Nevertheless, the observer's **price expectations** converge
+to rational expectations because both structures imply the same reduced form
+$\bar\mu$.
+
+
+## Exercises
+
+```{exercise}
+:label: km_ex1
+
+Consider a two-state economy ($a_1 = 2$,
+$a_2 = 0.5$) where the informed agent has **CARA** (constant absolute risk
+aversion)
+preferences over portfolio wealth:
+
+$$
+u(W) = -e^{-\gamma W}, \quad W = x_2 + \bar{a}\, x_1.
+$$
+
+The agent chooses $x_1$ to maximize
+
+$$
+q\,u(W_1) + (1-q)\,u(W_2), \quad W_s = w - p\,x_1 + a_s\,x_1,
+$$
+
+subject to the budget constraint $p\,x_1 + x_2 = w$.
+
+Total supply of good 1 is $X_1 = 1$.
+
+1. Derive the first-order condition for the informed agent's optimal $x_1$.
+
+1. Use the market-clearing condition $x_1 = 1$ (the informed agent absorbs the
+   entire
+supply) to obtain an implicit equation for the equilibrium price $p^*(q)$.
+Solve it
+numerically for $q \in (0,1)$ and several values of $\gamma$.
+
+1. Show numerically that $p^*(q)$ is monotone in $q$, so the invertibility
+   condition
+holds in this example. Explain why this is economically similar to the $\sigma >
+1$ case in
+{prf:ref}`ime_theorem_invertibility_conditions`, but not a direct application of
+that theorem.
+```
+
+```{solution-start} km_ex1
+:class: dropdown
+```
+
+For the first-order condition, define $W_s = w + (a_s - p)\,x_1$ for
+$s = 1, 2$.
+
+Then the FOC is
+
+$$
+q\,(a_1 - p)\,\gamma\, e^{-\gamma W_1}
+= (1-q)\,(p - a_2)\,\gamma\, e^{-\gamma W_2},
+$$
+
+or equivalently (dividing by $\gamma$ and rearranging)
+
+$$
+q\,(a_1 - p)\, e^{-\gamma(a_1-p) x_1}
+  = (1-q)\,(p - a_2)\, e^{\gamma(p-a_2) x_1}.
+$$
+
+Setting $x_1 = 1$ (the informed agent absorbs all supply), this becomes a
+scalar root-finding problem in $p$:
+
+$$
+F(p;\,q,\gamma) \equiv
+  q\,(a_1-p)\,e^{-\gamma(a_1-p)} - (1-q)\,(p-a_2)\,e^{\gamma(p-a_2)} = 0.
+$$
+
+```{code-cell} ipython3
+from scipy.optimize import brentq
+
+def F_cara(p, q, a1, a2, γ, x1=1.0):
+    """Residual for the CARA equilibrium condition."""
+    return (q * (a1 - p) * np.exp(-γ * (a1 - p) * x1)
+            - (1 - q) * (p - a2) * np.exp(γ * (p - a2) * x1))
+
+a1, a2 = 2.0, 0.5
+q_grid = np.linspace(0.05, 0.95, 200)
+γ_values = [0.5, 1.0, 2.0, 5.0]
+colors_sol = plt.cm.plasma(np.linspace(0.15, 0.85, len(γ_values)))
+
+fig, ax = plt.subplots(figsize=(8, 5))
+for γ, color in zip(γ_values, colors_sol):
+    p_eq = [brentq(F_cara, a2, a1,
+                   args=(q, a1, a2, γ))
+            for q in q_grid]
+    ax.plot(q_grid, p_eq, lw=2, color=color,
+            label=rf"$\gamma = {γ}$")
+
+ax.set_xlabel(r"posterior $q = \Pr(\bar a = a_1)$", fontsize=12)
+ax.set_ylabel("equilibrium price $p^*(q)$", fontsize=12)
+ax.set_title("CARA preferences: equilibrium prices", fontsize=12)
+ax.legend(fontsize=10)
+plt.tight_layout()
+plt.show()
+```
+
+The price is strictly increasing in $q$ for every $\gamma > 0$.
+
+The reason is that portfolio utility $u(x_2 + \bar{a}\,x_1)$ treats the two
+goods as perfect substitutes in creating wealth, so a higher posterior
+probability of the high-return state raises the marginal value of the risky
+asset and pushes the equilibrium price upward.
+
+This behavior is similar in spirit to the $\sigma > 1$ case in
+{prf:ref}`ime_theorem_invertibility_conditions`, but it is not a direct
+consequence of that theorem because CARA utility over wealth is not homothetic
+in the two-good representation used in the theorem.
+
+Here monotonicity is verified directly from the specific first-order condition.
+
+```{solution-end}
+```
+
+```{exercise}
+:label: km_ex2
+
+In the Bayesian learning simulation, the speed of
+convergence to rational expectations is determined by the **Kullback-Leibler
+divergence**
+between the two reduced forms.
+
+The KL divergence from $g(\cdot \mid \mu_2)$ to $g(\cdot \mid \mu_1)$, for two
+normal
+distributions with means $\bar{p}_1$ and $\bar{p}_2$ and common variance
+$\sigma_p^2$, is
+
+$$
+D_{KL}(\mu_1 \| \mu_2) = \frac{(\bar{p}_1 - \bar{p}_2)^2}{2\sigma_p^2}.
+$$
+
+1. For the "easy" case ($\bar{p}_1 = 2.0$, $\bar{p}_2 = 1.2$) and the "hard"
+   case
+($\bar{p}_1 = 2.0$, $\bar{p}_2 = 1.8$), compute $D_{KL}$ for $\sigma_p = 0.4$.
+
+1. Re-run the simulations from the lecture for both cases with $n=100$ paths.
+   For each
+path compute the first period $T_{0.99}$ at which $h_t \geq 0.99$.  Plot
+histograms of
+$T_{0.99}$ for both cases.
+
+1. How does the median $T_{0.99}$ scale with $D_{KL}$?  Verify numerically that
+roughly $T_{0.99} \approx C / D_{KL}$ for some constant $C$.
+```
+
+```{solution-start} km_ex2
+:class: dropdown
+```
+
+Here is one solution:
+
+```{code-cell} ipython3
+σ_p = 0.4
+
+def kl_normal(p1, p2, σ):
+    """Return the KL divergence for N(p1, σ^2) and N(p2, σ^2)."""
+    return (p1 - p2)**2 / (2 * σ**2)
+
+cases = [("Easy",  2.0, 1.2), ("Hard", 2.0, 1.8)]
+for name, p1, p2 in cases:
+    kl = kl_normal(p1, p2, σ_p)
+    print(f"{name} case: D_KL = {kl:.4f}")
+
+n_paths = 100
+
+fig, axes = plt.subplots(1, 2, figsize=(11, 4))
+for ax, (name, p1, p2) in zip(axes, cases):
+    kl = kl_normal(p1, p2, σ_p)
+    paths = simulate_bayesian_learning(p1, p2, σ_p, T=2000,
+                                       h0=0.5, n_paths=n_paths, seed=42)
+    # First period with posterior >= 0.99
+    T99 = []
+    for path in paths:
+        idx = np.where(path >= 0.99)[0]
+        T99.append(idx[0] if len(idx) > 0 else 2001)
+
+    median_T = np.median(T99)
+    ax.hist(T99, bins=20, color="steelblue", edgecolor="white", alpha=0.8)
+    ax.axvline(median_T, color="crimson", lw=2,
+               label=fr"Median $T_{{0.99}} = {median_T:.0f}$")
+    ax.set_title(
+        f"{name}: $D_{{KL}} = {kl:.4f}$,  "
+        fr"$C/D_{{KL}} \approx {median_T*kl:.1f}$",
+        fontsize=11
+    )
+    ax.set_xlabel(r"$T_{0.99}$", fontsize=12)
+    ax.set_ylabel("count", fontsize=11)
+    ax.legend(fontsize=10)
+
+plt.tight_layout()
+plt.show()
+```
+
+The median $T_{0.99}$ scales as approximately $C/D_{KL}$, confirming that
+learning is
+faster when the two reduced forms are more easily distinguished (large
+$D_{KL}$).
+
+```{solution-end}
+```
+
+```{exercise}
+:label: km_ex3
+
+{prf:ref}`ime_theorem_bayesian_convergence`
+assumes the true
+distribution $g(\cdot \mid \bar\lambda)$ is in the support of the prior (i.e.,
+$h(\bar\lambda) > 0$).
+
+Investigate what happens when the true model is *not* in the
+prior support.
+
+Simulate $T = 1,000$ periods of prices from $N(2.0, 0.4^2)$ but use a prior
+   that
+    places equal weight on two *wrong* models: $N(1.5, 0.4^2)$ and $N(2.3,
+    0.4^2)$.
+
+Plot the posterior weight on each model over time.
+
+Discuss your findings.
+```
+
+```{solution-start} km_ex3
+:class: dropdown
+```
+
+Here is  one solution:
+
+```{code-cell} ipython3
+def simulate_misspecified(
+    T, p_bar_true, p_bar_wrong, σ_p, h0, n_paths, seed=0
+):
+    """Simulate learning under a misspecified two-model prior."""
+    rng = np.random.default_rng(seed)
+    h_paths = np.zeros((n_paths, T + 1, 2))
+    h_paths[:, 0, :] = h0
+
+    for path in range(n_paths):
+        h = np.array(h0, dtype=float)
+        prices = rng.normal(p_bar_true, σ_p, size=T)
+        for t, price in enumerate(prices):
+            likes = norm.pdf(price, loc=p_bar_wrong, scale=σ_p)
+            h = h * likes
+            h /= h.sum()
+            h_paths[path, t + 1, :] = h
+
+    return h_paths
+
+
+def predictive_density(weights, means, σ_p, p_grid):
+    """Return the predictive density under the current posterior weights."""
+    density = np.zeros_like(p_grid)
+    for weight, mean in zip(weights, means):
+        density += weight * norm.pdf(p_grid, loc=mean, scale=σ_p)
+    return density
+
+
+T = 1000
+p_true = 2.0
+p_wrong = np.array([1.5, 2.3])
+σ_p = 0.4
+h0 = np.array([0.5, 0.5])
+n_paths = 30
+
+h_misspec = simulate_misspecified(T, p_true, p_wrong, σ_p, h0, n_paths)
+
+kl_vals = (p_true - p_wrong)**2 / (2 * σ_p**2)
+for mean, kl in zip(p_wrong, kl_vals):
+    print(f"KL(true || N({mean:.1f}, σ^2)) = {kl:.4f}")
+
+t_grid = np.arange(T + 1)
+fig, axes = plt.subplots(1, 2, figsize=(12, 4))
+
+labels = [r"$N(1.5, \sigma^2)$", r"$N(2.3, \sigma^2)$"]
+for ax, k, label in zip(axes, [0, 1], labels):
+    for path in h_misspec:
+        ax.plot(t_grid, path[:, k], alpha=0.2, lw=0.8, color="steelblue")
+    ax.plot(t_grid, np.median(h_misspec[:, :, k], axis=0),
+            color="navy", lw=2, label="median")
+    ax.set_title(f"Posterior weight on {label}", fontsize=11)
+    ax.set_xlabel("period $t$", fontsize=11)
+    ax.set_ylabel("posterior weight", fontsize=11)
+    ax.legend(fontsize=9)
+
+plt.tight_layout()
+plt.show()
+
+# Predictive density and mean along the median posterior path
+median_path = np.median(h_misspec, axis=0)
+p_grid = np.linspace(0.0, 3.5, 300)
+closer_idx = np.argmin(kl_vals)
+
+fig, ax = plt.subplots(figsize=(8, 4))
+colors = plt.cm.Blues(np.linspace(0.3, 1.0, 4))
+for t_snap, color in zip([0, 10, 100, T], colors):
+    dens = predictive_density(median_path[t_snap], p_wrong, σ_p, p_grid)
+    ax.plot(p_grid, dens, color=color, lw=2, label=f"t = {t_snap}")
+
+ax.plot(
+    p_grid,
+    norm.pdf(p_grid, loc=p_wrong[closer_idx], scale=σ_p),
+    "k--",
+    lw=2,
+    label="KL-best wrong model",
+)
+ax.set_xlabel("price $p$", fontsize=11)
+ax.set_ylabel("density", fontsize=11)
+ax.legend(fontsize=9)
+plt.tight_layout()
+plt.show()
+
+pred_mean = np.median(
+    h_misspec[:, :, 0] * p_wrong[0] + h_misspec[:, :, 1] * p_wrong[1], axis=0
+)
+print(f"True mean: {p_true}")
+print(f"Predictive mean at T={T}: {pred_mean[-1]:.4f}")
+print(f"Closer misspecified mean: {p_wrong[np.argmin(kl_vals)]:.1f}")
+```
+
+Here
+
+$$
+D_{KL}\bigl(N(2.0, 0.4^2)\,\|\,N(2.3, 0.4^2)\bigr)
+<
+D_{KL}\bigl(N(2.0, 0.4^2)\,\|\,N(1.5, 0.4^2)\bigr),
+$$
+
+so the model with mean $2.3$ is the KL-best approximation among the two wrong
+models, and in the simulation posterior weight concentrates on that model.
+
+Posterior odds are cumulative {doc}`likelihood ratios<likelihood_bayes>`.
+
+If we compare the two wrong Gaussian models $f$ and $g$, then under the true
+distribution $h$ the average log likelihood ratio satisfies
+
+$$
+\frac{1}{t} E_h[\log L_t] = K(h,g) - K(h,f).
+$$
+
+So if $f$ is KL-closer to $h$ than $g$ is, $\log L_t$ has positive drift and
+posterior odds tilt toward $f$.
+
+```{solution-end}
+```
diff --git a/lectures/lagrangian_lqdp.md b/lectures/lagrangian_lqdp.md
index f1e680cc6..5cce10050 100644
--- a/lectures/lagrangian_lqdp.md
+++ b/lectures/lagrangian_lqdp.md
@@ -451,11 +451,16 @@ solves. See {cite}`Ljungqvist2012`,  ch 12.
 
 ## Application
 
-Here we demonstrate the computation with an example which is the deterministic version of an example borrowed from this [quantecon lecture](https://python.quantecon.org/lqcontrol.html).
+Here we demonstrate the computation with the deterministic permanent-income example from this [quantecon lecture](https://python.quantecon.org/lqcontrol.html).
+
+Because that model is discounted, we apply the invariant-subspace method to the
+equivalent **undiscounted** system obtained from the transformed matrices
+$\hat A = \beta^{1/2} A$ and $\hat B = \beta^{1/2} B$.
 
 ```{code-cell} ipython3
 # Model parameters
 r = 0.05
+β = 1 / (1 + r)
 c_bar = 2
 μ = 1
 
@@ -468,7 +473,7 @@ B = [[-1],
      [0]]
 
 # Construct an LQ instance
-lq = LQ(Q, R, A, B)
+lq = LQ(Q, R, A, B, beta=β)
 ```
 
 Given matrices $A$, $B$, $Q$, $R$, we can then compute $L$, $N$, and $M=L^{-1}N$.
@@ -476,7 +481,7 @@ Given matrices $A$, $B$, $Q$, $R$, we can then compute $L$, $N$, and $M=L^{-1}N$
 ```{code-cell} ipython3
 def construct_LNM(A, B, Q, R):
 
-    n, k = lq.n, lq.k
+    n = A.shape[0]
 
     # construct L and N
     L = np.zeros((2*n, 2*n))
@@ -496,7 +501,10 @@ def construct_LNM(A, B, Q, R):
 ```
 
 ```{code-cell} ipython3
-L, N, M = construct_LNM(lq.A, lq.B, lq.Q, lq.R)
+A_bar = lq.A * lq.beta ** (1/2)
+B_bar = lq.B * lq.beta ** (1/2)
+
+L, N, M = construct_LNM(A_bar, B_bar, lq.Q, lq.R)
 ```
 
 ```{code-cell} ipython3
@@ -517,7 +525,7 @@ M @ J @ M.T - J
 We can compute the eigenvalues of $M$ using `np.linalg.eigvals`, arranged in ascending order.
 
 ```{code-cell} ipython3
-eigvals = sorted(np.linalg.eigvals(M))
+eigvals = sorted(np.linalg.eigvals(M), key=lambda z: (abs(z), z.real, z.imag))
 eigvals
 ```
 
@@ -529,18 +537,14 @@ When we apply Schur decomposition such that $M=V W V^{-1}$, we want
 To get what we want, let's define a sorting function that tells `scipy.schur` to sort the corresponding eigenvalues with modulus smaller than 1 to the upper left.
 
 ```{code-cell} ipython3
-stable_eigvals = eigvals[:n]
+tol = 1e-10
 
 def sort_fun(x):
-    "Sort the eigenvalues with modules smaller than 1 to the top-left."
-
-    if x in stable_eigvals:
-        stable_eigvals.pop(stable_eigvals.index(x))
-        return True
-    else:
-        return False
+    "Sort the eigenvalues with modulus smaller than 1 to the top-left."
+    return abs(x) < 1 - tol
 
-W, V, _ = schur(M, sort=sort_fun)
+W, V, stable_dim = schur(M, sort=sort_fun)
+stable_dim
 ```
 
 ```{code-cell} ipython3
@@ -584,25 +588,24 @@ def stable_solution(M, verbose=True):
         The matrix represents the linear difference equations system.
     """
     n = M.shape[0] // 2
-    stable_eigvals = list(sorted(np.linalg.eigvals(M))[:n])
+    tol = 1e-10
 
     def sort_fun(x):
-        "Sort the eigenvalues with modules smaller than 1 to the top-left."
-
-        if x in stable_eigvals:
-            stable_eigvals.pop(stable_eigvals.index(x))
-            return True
-        else:
-            return False
-
-    W, V, _ = schur(M, sort=sort_fun)
+        "Sort the eigenvalues with modulus smaller than 1 to the top-left."
+        return abs(x) < 1 - tol
+
+    W, V, stable_dim = schur(M, sort=sort_fun)
+    if stable_dim != n:
+        raise ValueError(
+    f"Expected {n} stable eigenvalues inside the unit circle, found {stable_dim}."
+    )
     if verbose:
         print('eigenvalues:\n')
         print('    W11: {}'.format(np.diag(W[:n, :n])))
         print('    W22: {}'.format(np.diag(W[n:, n:])))
 
-    # compute V21 V11^{-1}
-    P = V[n:, :n] @ np.linalg.inv(V[:n, :n])
+    # compute V21 V11^{-1} without forming the inverse explicitly
+    P = np.linalg.solve(V[:n, :n].T, V[n:, :n].T).T
 
     return W, V, P
 
@@ -761,11 +764,6 @@ For example, when $\beta=\frac{1}{1+r}$, we can solve for $P$ with $\hat{A}=\bet
 
 These settings are adopted by default in the function `stationary_P` defined above.
 
-```{code-cell} ipython3
-β = 1 / (1 + r)
-lq.beta = β
-```
-
 ```{code-cell} ipython3
 stationary_P(lq)
 ```
diff --git a/lectures/multivariate_normal.md b/lectures/multivariate_normal.md
index 7aacee6eb..96adcd647 100644
--- a/lectures/multivariate_normal.md
+++ b/lectures/multivariate_normal.md
@@ -3,8 +3,10 @@ jupytext:
   text_representation:
     extension: .md
     format_name: myst
+    format_version: 0.13
+    jupytext_version: 1.17.1
 kernelspec:
-  display_name: Python 3
+  display_name: Python 3 (ipykernel)
   language: python
   name: python3
 ---
@@ -35,7 +37,7 @@ In this lecture, you will learn formulas for
 * marginal distributions for all subvectors of $x$
 * conditional distributions for subvectors of $x$ conditional on other subvectors of $x$
 
-We will use  the multivariate normal distribution to formulate some useful models:
+We will use the multivariate normal distribution to formulate some useful models:
 
 * a factor analytic model of an intelligence quotient, i.e., IQ
 * a factor analytic model of two independent inherent abilities, say, mathematical and verbal.
@@ -44,7 +46,7 @@ We will use  the multivariate normal distribution to formulate some useful model
 * time series generated by linear stochastic difference equations
 * optimal linear filtering theory
 
-## The Multivariate Normal Distribution
+## The multivariate normal distribution
 
 This lecture defines a Python class `MultivariateNormal` to be used
 to generate **marginal** and **conditional** distributions associated
@@ -58,13 +60,15 @@ For a multivariate normal distribution it is very convenient that
 
 We apply our Python class to some examples.
 
-We  use the following imports:
+We use the following imports:
 
-```{code-cell} ipython
+```{code-cell} ipython3
 import matplotlib.pyplot as plt
 import numpy as np
 from numba import jit
 import statsmodels.api as sm
+
+rng = np.random.default_rng(0)
 ```
 
 Assume that an $N \times 1$ random vector $z$ has a
@@ -73,11 +77,11 @@ multivariate normal probability density.
 This means that the probability density takes the form
 
 $$
-f\left(z;\mu,\Sigma\right)=\left(2\pi\right)^{-\left(\frac{N}{2}\right)}\det\left(\Sigma\right)^{-\frac{1}{2}}\exp\left(-.5\left(z-\mu\right)^{\prime}\Sigma^{-1}\left(z-\mu\right)\right)
+f\left(z;\mu,\Sigma\right)=\left(2\pi\right)^{-\left(\frac{N}{2}\right)}\det\left(\Sigma\right)^{-\frac{1}{2}}\exp\left(-.5\left(z-\mu\right)^\top\Sigma^{-1}\left(z-\mu\right)\right)
 $$
 
 where $\mu=Ez$ is the mean of the random vector $z$ and
-$\Sigma=E\left(z-\mu\right)\left(z-\mu\right)^\prime$ is the
+$\Sigma=E\left(z-\mu\right)\left(z-\mu\right)^\top$ is the
 covariance matrix of $z$.
 
 The covariance matrix $\Sigma$ is symmetric and positive definite.
@@ -95,7 +99,7 @@ def f(z, μ, Σ):
     μ: ndarray(float, dim=1 or 2)
         the mean of z, N by 1
     Σ: ndarray(float, dim=2)
-        the covarianece matrix of z, N by 1
+        the covariance matrix of z, N by N
     """
 
     z = np.atleast_2d(z)
@@ -155,7 +159,7 @@ $$
 and covariance matrix
 
 $$
-\hat{\Sigma}_{11}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}=\Sigma_{11}-\beta\Sigma_{22}\beta^{\prime}
+\hat{\Sigma}_{11}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}=\Sigma_{11}-\beta\Sigma_{22}\beta^\top
 $$
 
 where
@@ -186,7 +190,7 @@ class MultivariateNormal:
     μ: ndarray(float, dim=1)
         the mean of z, N by 1
     Σ: ndarray(float, dim=2)
-        the covarianece matrix of z, N by 1
+        the covariance matrix of z, N by N
 
     Arguments
     ---------
@@ -262,7 +266,7 @@ squares regressions.
 We’ll compare those linear least squares regressions for the simulated
 data to their population counterparts.
 
-## Bivariate Example
+## Bivariate example
 
 We start with a bivariate normal distribution pinned down by
 
@@ -296,7 +300,7 @@ Let's illustrate the fact that you _can regress anything on anything else_.
 
 We have computed everything we need to compute two regression lines, one of $z_2$ on $z_1$, the other of $z_1$ on $z_2$.
 
-We'll represent  these regressions as
+We'll represent these regressions as
 
 $$
 z_1 = a_1 + b_1 z_2 + \epsilon_1
@@ -320,17 +324,17 @@ $$
 E \epsilon_2 z_1 = 0
 $$
 
-Let's  compute $a_1, a_2, b_1, b_2$.
+Let's compute $a_1, a_2, b_1, b_2$.
 
 ```{code-cell} python3
 
-beta = multi_normal.βs
+β = multi_normal.βs
 
-a1 = μ[0] - beta[0]*μ[1]
-b1 = beta[0]
+a1 = μ[0] - β[0]*μ[1]
+b1 = β[0]
 
-a2 = μ[1] - beta[1]*μ[0]
-b2 = beta[1]
+a2 = μ[1] - β[1]*μ[0]
+b2 = β[1]
 ```
 
 Let's print out the intercepts and slopes.
@@ -356,7 +360,12 @@ Now let's plot the two regression lines and stare at them.
 
 
 ```{code-cell} python3
-
+---
+mystnb:
+  figure:
+    caption: two regressions
+    name: fig-two-regressions
+---
 z2 = np.linspace(-4,4,100)
 
 
@@ -383,14 +392,13 @@ ax.xaxis.set_ticks_position('bottom')
 ax.yaxis.set_ticks_position('left')
 plt.ylabel('$z_1$', loc = 'top')
 plt.xlabel('$z_2$,', loc = 'right')
-plt.title('two regressions')
 plt.plot(z2,z1, 'r', label = "$z_1$ on $z_2$")
 plt.plot(z2,z1h, 'b', label = "$z_2$ on $z_1$")
 plt.legend()
 plt.show()
 ```
 
-The red line is the  expectation of $z_1$ conditional on $z_2$.
+The red line is the expectation of $z_1$ conditional on $z_2$.
 
 The intercept and slope of the red line are
 
@@ -410,7 +418,7 @@ print("1/b2 = ", 1/b2)
 
 We can use these regression lines or our code to compute conditional expectations.
 
-Let's  compute the mean and variance of the distribution of $z_2$
+Let's compute the mean and variance of the distribution of $z_2$
 conditional on $z_1=5$.
 
 After that we'll reverse what are on the left and right sides of the regression.
@@ -468,7 +476,7 @@ of $\epsilon$ will converge to $\hat{\Sigma}_1$.
 n = 1_000_000 # sample size
 
 # simulate multivariate normal random vectors
-data = np.random.multivariate_normal(μ, Σ, size=n)
+data = rng.multivariate_normal(μ, Σ, size=n)
 z1_data = data[:, 0]
 z2_data = data[:, 1]
 
@@ -502,17 +510,17 @@ Thus, in each case, for our very large sample size, the sample analogues
 closely approximate their population counterparts.
 
 A Law of Large
-Numbers explains why  sample analogues approximate  population objects.
+Numbers explains why sample analogues approximate population objects.
 
-## Trivariate Example
+## Trivariate example
 
 Let’s apply our code to a trivariate example.
 
 We’ll specify the mean vector and the covariance matrix as follows.
 
 ```{code-cell} python3
-μ = np.random.random(3)
-C = np.random.random((3, 3))
+μ = rng.random(3)
+C = rng.random((3, 3))
 Σ = C @ C.T # positive semi-definite
 
 multi_normal = MultivariateNormal(μ, Σ)
@@ -539,7 +547,7 @@ z2 = np.array([2., 5.])
 
 ```{code-cell} python3
 n = 1_000_000
-data = np.random.multivariate_normal(μ, Σ, size=n)
+data = rng.multivariate_normal(μ, Σ, size=n)
 z1_data = data[:, :k]
 z2_data = data[:, k:]
 ```
@@ -568,7 +576,7 @@ multi_normal.βs[0], results.params
 Once again, sample analogues do a good job of approximating their
 populations counterparts.
 
-## One Dimensional Intelligence (IQ)
+## One dimensional intelligence (IQ)
 
 Let’s move closer to a real-life example, namely, inferring a
 one-dimensional measure of intelligence called IQ from a list of test
@@ -708,7 +716,7 @@ $\theta$ conditional on our test scores.
 Let’s do that and then print out some pertinent quantities.
 
 ```{code-cell} python3
-x = np.random.multivariate_normal(μ_IQ, Σ_IQ)
+x = rng.multivariate_normal(μ_IQ, Σ_IQ)
 y = x[:-1] # test scores
 θ = x[-1]  # IQ
 ```
@@ -723,7 +731,7 @@ conditional normal distribution of the IQ $\theta$.
 
 In the following code, `ind` sets the variables on the right side of the regression.
 
-Given the way we have defined the vector $X$, we want  to set `ind=1` in order to make $\theta$ the left side variable in the
+Given the way we have defined the vector $X$, we want to set `ind=1` in order to make $\theta$ the left side variable in the
 population regression.
 
 ```{code-cell} python3
@@ -809,9 +817,9 @@ Thus, each $y_{i}$ adds information about $\theta$.
 
 If we were to drive the number of tests $n \rightarrow + \infty$, the
 conditional standard deviation $\hat{\sigma}_{\theta}$ would
-converge to $0$ at  rate $\frac{1}{n^{.5}}$.
+converge to $0$ at rate $\frac{1}{n^{.5}}$.
 
-## Information as Surprise
+## Information as surprise
 
 By using a different representation, let’s look at things from a
 different perspective.
@@ -826,13 +834,13 @@ where $C$ is a lower triangular **Cholesky factor** of
 $\Sigma$ so that
 
 $$
-\Sigma \equiv DD^{\prime} = C C^\prime
+\Sigma \equiv DD^\top = C C^\top
 $$
 
 and
 
 $$
-E \epsilon \epsilon' = I .
+E \epsilon \epsilon^\top = I .
 $$
 
 It follows that
@@ -926,13 +934,13 @@ np.max(np.abs(μθ_hat_arr - μθ_hat_arr_C)) < 1e-10
 np.max(np.abs(Σθ_hat_arr - Σθ_hat_arr_C)) < 1e-10
 ```
 
-## Cholesky Factor Magic
+## Cholesky factor magic
 
 Evidently, the Cholesky factorizations automatically computes the
 population  **regression coefficients** and associated statistics
 that are produced by our `MultivariateNormal` class.
 
-The Cholesky factorization  computes these things **recursively**.
+The Cholesky factorization computes these things **recursively**.
 
 Indeed, in formula {eq}`mnv_1`,
 
@@ -942,7 +950,7 @@ Indeed, in formula {eq}`mnv_1`,
 - the coefficient $c_i$ is the simple population regression
   coefficient of $\theta - \mu_\theta$ on $\epsilon_i$
 
-## Math and Verbal  Intelligence
+## Math and verbal intelligence
 
 We can alter the preceding example to be more realistic.
 
@@ -997,7 +1005,7 @@ w_{6}
 $$
 
 where
-$w \begin{bmatrix} w_1 \cr w_2 \cr \vdots \cr w_6 \end{bmatrix}$
+$w = \begin{bmatrix} w_1 \cr w_2 \cr \vdots \cr w_6 \end{bmatrix}$
 is a standard normal random vector.
 
 We construct a Python function `construct_moments_IQ2d` to construct
@@ -1038,7 +1046,7 @@ n = 2
 
 ```{code-cell} python3
 # take one draw
-x = np.random.multivariate_normal(μ_IQ2d, Σ_IQ2d)
+x = rng.multivariate_normal(μ_IQ2d, Σ_IQ2d)
 y1 = x[:n]
 y2 = x[n:2*n]
 θ = x[2*n]
@@ -1060,7 +1068,7 @@ multi_normal_IQ2d.partition(k)
 multi_normal_IQ2d.cond_dist(1, [*y1, *y2])
 ```
 
-Now let’s compute distributions of $\theta$ and $\mu$
+Now let’s compute distributions of $\theta$ and $\eta$
 separately conditional on various subsets of test scores.
 
 It will be fun to compare outcomes with the help of an auxiliary function
@@ -1093,10 +1101,10 @@ for indices, IQ, conditions in [([*range(2*n), 2*n], 'θ', 'y1, y2, y3, y4'),
           f'{μ_hat[0]:1.2f} and {Σ_hat[0, 0]:1.2f} respectively')
 ```
 
-Evidently, math tests provide no information about $\mu$ and
-language tests provide no information about $\eta$.
+Evidently, math tests provide no information about $\eta$ and
+language tests provide no information about $\theta$.
 
-## Univariate Time Series Analysis
+## Univariate time series analysis
 
 We can use the multivariate normal distribution and a little matrix
 algebra to present foundations of univariate linear time series
@@ -1108,7 +1116,7 @@ Consider the following model:
 
 $$
 \begin{aligned}
-x_0 & \sim  N\left(0, \sigma_0^2\right) \\
+x_0 & \sim N\left(0, \sigma_0^2\right) \\
 x_{t+1} & = a x_{t} + b w_{t+1}, \quad w_{t+1} \sim N\left(0, 1\right), t \geq 0  \\
 y_{t} & = c x_{t} + d v_{t}, \quad v_{t} \sim N\left(0, 1\right), t \geq 0
 \end{aligned}
@@ -1164,7 +1172,7 @@ $c$ and $d$ as diagonal respectively.
 Consequently, the covariance matrix of $Y$ is
 
 $$
-\Sigma_{y} = E Y Y^{\prime} = C \Sigma_{x} C^{\prime} + D D^{\prime}
+\Sigma_{y} = E Y Y^\top = C \Sigma_{x} C^\top + D D^\top
 $$
 
 By stacking $X$ and $Y$, we can write
@@ -1179,8 +1187,8 @@ $$
 and
 
 $$
-\Sigma_{z} = EZZ^{\prime}=\left[\begin{array}{cc}
-\Sigma_{x} & \Sigma_{x}C^{\prime}\\
+\Sigma_{z} = EZZ^\top=\left[\begin{array}{cc}
+\Sigma_{x} & \Sigma_{x}C^\top\\
 C\Sigma_{x} & \Sigma_{y}
 \end{array}\right]
 $$
@@ -1255,13 +1263,13 @@ This is going to be very useful for doing the conditioning to be used in
 the fun exercises below.
 
 ```{code-cell} python3
-z = np.random.multivariate_normal(μz, Σz)
+z = rng.multivariate_normal(μz, Σz)
 
 x = z[:T+1]
 y = z[T+1:]
 ```
 
-### Smoothing Example
+### Smoothing example
 
 This is an instance of a classic `smoothing` calculation whose purpose
 is to compute $E X \mid Y$.
@@ -1295,7 +1303,7 @@ print(" E [ X | Y] = ", )
 multi_normal_ex1.cond_dist(0, y)
 ```
 
-### Filtering Exercise
+### Filtering exercise
 
 Compute $E\left[x_{t} \mid y_{t-1}, y_{t-2}, \dots, y_{0}\right]$.
 
@@ -1338,7 +1346,7 @@ sub_y = y[:t]
 multi_normal_ex2.cond_dist(0, sub_y)
 ```
 
-### Prediction Exercise
+### Prediction exercise
 
 Compute $E\left[y_{t} \mid y_{t-j}, \dots, y_{0} \right]$.
 
@@ -1378,10 +1386,10 @@ sub_y = y[:t-j+1]
 multi_normal_ex3.cond_dist(0, sub_y)
 ```
 
-### Constructing a Wold Representation
+### Constructing a Wold representation
 
 Now we’ll apply Cholesky decomposition to decompose
-$\Sigma_{y}=H H^{\prime}$ and form
+$\Sigma_{y}=H H^\top$ and form
 
 $$
 \epsilon = H^{-1} Y.
@@ -1412,12 +1420,12 @@ y
 This example is an instance of what is known as a **Wold representation** in time series analysis.
 
 
-## Stochastic Difference Equation
+## Stochastic difference equation
 
 Consider the stochastic second-order linear difference equation
 
 $$
-y_{t} = \alpha_{0} + \alpha_{1} y_{y-1} + \alpha_{2} y_{t-2} + u_{t}
+y_{t} = \alpha_{0} + \alpha_{1} y_{t-1} + \alpha_{2} y_{t-2} + u_{t}
 $$
 
 where $u_{t} \sim N \left(0, \sigma_{u}^{2}\right)$ and
@@ -1474,8 +1482,8 @@ We have
 $$
 \begin{aligned}
 \mu_{y} = A^{-1} \mu_{b} \\
-\Sigma_{y} &= A^{-1} E \left[\left(b - \mu_{b} + u \right) \left(b - \mu_{b} + u \right)^{\prime}\right] \left(A^{-1}\right)^{\prime} \\
-           &= A^{-1} \left(\Sigma_{b} + \Sigma_{u} \right) \left(A^{-1}\right)^{\prime}
+\Sigma_{y} &= A^{-1} E \left[\left(b - \mu_{b} + u \right) \left(b - \mu_{b} + u \right)^\top\right] \left(A^{-1}\right)^\top \\
+           &= A^{-1} \left(\Sigma_{b} + \Sigma_{u} \right) \left(A^{-1}\right)^\top
 \end{aligned}
 $$
 
@@ -1493,7 +1501,7 @@ $$
 
 $$
 \Sigma_{b}=\left[\begin{array}{cc}
-C\Sigma_{\tilde{y}}C^{\prime} & \boldsymbol{0}_{N-2\times N-2}\\
+C\Sigma_{\tilde{y}}C^\top & \boldsymbol{0}_{N-2\times N-2}\\
 \boldsymbol{0}_{N-2\times2} & \boldsymbol{0}_{N-2\times N-2}
 \end{array}\right],\quad C=\left[\begin{array}{cc}
 \alpha_{2} & \alpha_{1}\\
@@ -1512,7 +1520,6 @@ $$
 
 ```{code-cell} python3
 # set parameters
-T = 80
 T = 160
 # coefficients of the second order difference equation
 𝛼0 = 10
@@ -1520,7 +1527,6 @@ T = 160
 𝛼2 = -.9
 
 # variance of u
-σu = 1.
 σu = 10.
 
 # distribution of y_{-1} and y_{0}
@@ -1529,7 +1535,7 @@ T = 160
 ```
 
 ```{code-cell} python3
-# construct A and A^{\prime}
+# construct A and A^\top
 A = np.zeros((T, T))
 
 for i in range(T):
@@ -1565,7 +1571,7 @@ C = np.array([[𝛼2, 𝛼1], [0, 𝛼2]])
 Σy = A_inv @ (Σb + Σu) @ A_inv.T
 ```
 
-## Application to Stock Price Model
+## Application to stock price model
 
 Let
 
@@ -1602,7 +1608,7 @@ we have
 $$
 \begin{aligned}
 \mu_{p} = B \mu_{y} \\
-\Sigma_{p} = B \Sigma_{y} B^{\prime}
+\Sigma_{p} = B \Sigma_{y} B^\top
 \end{aligned}
 $$
 
@@ -1639,7 +1645,7 @@ $$
 $$
 
 $$
-\Sigma_{z}=D\Sigma_{y}D^{\prime}
+\Sigma_{z}=D\Sigma_{y}D^\top
 $$
 
 ```{code-cell} python3
@@ -1656,7 +1662,7 @@ conditional mean $E \left[p_{t} \mid y_{t-1}, y_{t}\right]$ using
 the `MultivariateNormal` class.
 
 ```{code-cell} python3
-z = np.random.multivariate_normal(μz, Σz)
+z = rng.multivariate_normal(μz, Σz)
 y, p = z[:T], z[T:]
 ```
 
@@ -1688,12 +1694,12 @@ plt.show()
 
 In the above graph, the green line is what the price of the stock would
 be if people had perfect foresight about the path of dividends while the
-green line is the conditional expectation $E p_t | y_t, y_{t-1}$, which is what the price would
+red line is the conditional expectation $E p_t | y_t, y_{t-1}$, which is what the price would
 be if people did not have perfect foresight but were optimally
 predicting future dividends on the basis of the information
 $y_t, y_{t-1}$ at time $t$.
 
-## Filtering Foundations
+## Filtering foundations
 
 Assume that $x_0$ is an $n \times 1$ random vector and that
 $y_0$ is a $p \times 1$ random vector determined by the
@@ -1711,7 +1717,7 @@ We consider the problem of someone who
 
 * *observes* $y_0$
 *  does not observe $x_0$,
-*  knows $\hat x_0, \Sigma_0, G, R$ and therefore  the joint probability distribution of the vector $\begin{bmatrix} x_0 \cr y_0 \end{bmatrix}$
+* knows $\hat x_0, \Sigma_0, G, R$ and therefore the joint probability distribution of the vector $\begin{bmatrix} x_0 \cr y_0 \end{bmatrix}$
 * wants to infer $x_0$ from $y_0$ in light of what he knows about that
 joint probability distribution.
 
@@ -1728,7 +1734,7 @@ $$
                           G \Sigma_0 & G \Sigma_0 G' + R \end{bmatrix}
 $$
 
-By applying an appropriate instance of the above formulas for the  mean vector $\hat \mu_1$ and covariance matrix
+By applying an appropriate instance of the above formulas for the mean vector $\hat \mu_1$ and covariance matrix
 $\hat \Sigma_{11}$ of $z_1$ conditional on $z_2$, we find that the probability distribution of
 $x_0$ conditional on $y_0$ is
 ${\mathcal N}(\tilde x_0, \tilde \Sigma_0)$ where
@@ -1834,7 +1840,7 @@ of $x_t$ conditional on
 $y_0, y_1, \ldots , y_{t-1} = y^{t-1}$ is
 
 $$
-x_t | y^{t-1} \sim {\mathcal N}(A \tilde x_t , A \tilde \Sigma_t A' + C C' )
+x_t | y^{t-1} \sim {\mathcal N}(A \tilde x_{t-1} , A \tilde \Sigma_{t-1} A' + C C' )
 $$
 
 where $\{\tilde x_t, \tilde \Sigma_t\}_{t=1}^\infty$ can be
@@ -1858,7 +1864,7 @@ $$
 \Sigma_{t+1}= C C' + A \Sigma_t A' - A \Sigma_t G' (G \Sigma_t G' +R)^{-1} G \Sigma_t A' .
 $$
 
-This is a matrix Riccati difference equation that is closely related to another matrix Riccati difference equation that appears in  a quantecon lecture on the basics of linear quadratic control theory.
+This is a matrix Riccati difference equation that is closely related to another matrix Riccati difference equation that appears in a quantecon lecture on the basics of linear quadratic control theory.
 
 That equation has the form
 
@@ -1874,7 +1880,7 @@ P_{t-1} =R + A' P_t A  - A' P_t B
 Stare at the two preceding equations for a moment or two, the first being a matrix difference equation for a conditional covariance matrix, the
 second being a matrix difference equation in the matrix appearing in a quadratic form for an intertemporal cost of value function.
 
-Although the  two equations are not identical, they display striking family resemblences.
+Although the two equations are not identical, they display striking family resemblances.
 
 * the first equation tells dynamics that work **forward**  in time
 * the second equation tells dynamics that work  **backward** in time
@@ -1895,7 +1901,7 @@ G = np.array([[1., 3.]])
 R = np.array([[1.]])
 
 x0_hat = np.array([0., 1.])
-Σ0 = np.array([[1., .5], [.3, 2.]])
+Σ0 = np.array([[1., .5], [.5, 2.]])
 
 μ = np.hstack([x0_hat, G @ x0_hat])
 Σ = np.block([[Σ0, Σ0 @ G.T], [G @ Σ0, G @ Σ0 @ G.T + R]])
@@ -1929,7 +1935,7 @@ x1_cond = A @ μ1_hat
 x1_cond, Σ1_cond
 ```
 
-### Code for Iterating
+### Code for iterating
 
 Here is code for solving a dynamic filtering problem by iterating on our
 equations, followed by an example.
@@ -1970,12 +1976,12 @@ iterate(x0_hat, Σ0, A, C, G, R, [2.3, 1.2, 3.2])
 
 The iterative algorithm just described is a version of the celebrated **Kalman filter**.
 
-We describe the Kalman filter  and some applications of it in {doc}`A First Look at the Kalman Filter <kalman>`
+We describe the Kalman filter and some applications of it in {doc}`A First Look at the Kalman Filter <kalman>`
 
 
-## Classic Factor Analysis Model
+## Classic factor analysis model
 
-The factor analysis model widely used in psychology and other fields can
+The factor analysis model can
 be represented as
 
 $$
@@ -1985,31 +1991,31 @@ $$
 where
 
 1. $Y$ is $n \times 1$ random vector,
-   $E U U^{\prime} = D$ is a diagonal matrix,
-1. $\Lambda$ is $n \times k$ coefficient matrix,
-1. $f$ is $k \times 1$ random vector,
-   $E f f^{\prime} = I$,
-1. $U$ is $n \times 1$ random vector, and $U \perp f$ (i.e., $E U f' = 0 $ )
-1. It is presumed that $k$ is small relative to $n$; often
+   $E U U^\top = D$ is a diagonal matrix,
+2. $\Lambda$ is $n \times k$ coefficient matrix,
+3. $f$ is $k \times 1$ random vector,
+   $E f f^\top = I$,
+4. $U$ is $n \times 1$ random vector, and $U \perp f$ (i.e., $E U f^\top = 0 $ )
+5. It is presumed that $k$ is small relative to $n$; often
    $k$ is only $1$ or $2$, as in our IQ examples.
 
 This implies that
 
 $$
 \begin{aligned}
-\Sigma_y = E Y Y^{\prime} = \Lambda \Lambda^{\prime} + D \\
-E Y f^{\prime} = \Lambda \\
-E f Y^{\prime} = \Lambda^{\prime}
+\Sigma_y = E Y Y^\top = \Lambda \Lambda^\top + D \\
+E Y f^\top = \Lambda \\
+E f Y^\top = \Lambda^\top
 \end{aligned}
 $$
 
 Thus, the covariance matrix $\Sigma_Y$ is the sum of a diagonal
 matrix $D$ and a positive semi-definite matrix
-$\Lambda \Lambda^{\prime}$ of rank $k$.
+$\Lambda \Lambda^\top$ of rank $k$.
 
 This means that all covariances among the $n$ components of the
 $Y$ vector are intermediated by their common dependencies on the
-$k<$ factors.
+$k$ factors.
 
 Form
 
@@ -2024,9 +2030,9 @@ the covariance matrix of the expanded random vector $Z$ can be
 computed as
 
 $$
-\Sigma_{z} = EZZ^{\prime}=\left(\begin{array}{cc}
-I & \Lambda^{\prime}\\
-\Lambda & \Lambda\Lambda^{\prime}+D
+\Sigma_{z} = EZZ^\top=\left(\begin{array}{cc}
+I & \Lambda^\top\\
+\Lambda & \Lambda\Lambda^\top+D
 \end{array}\right)
 $$
 
@@ -2093,7 +2099,7 @@ $Z$.
 ```
 
 ```{code-cell} python3
-z = np.random.multivariate_normal(μz, Σz)
+z = rng.multivariate_normal(μz, Σz)
 
 f = z[:k]
 y = z[k:]
@@ -2113,7 +2119,7 @@ multi_normal_factor.cond_dist(0, y)
 
 We can verify that the conditional mean
 $E \left[f \mid Y=y\right] = B Y$ where
-$B = \Lambda^{\prime} \Sigma_{y}^{-1}$.
+$B = \Lambda^\top \Sigma_{y}^{-1}$.
 
 ```{code-cell} python3
 B = Λ.T @ np.linalg.inv(Σy)
@@ -2134,7 +2140,7 @@ $\Lambda I^{-1} f = \Lambda f$.
 Λ @ f
 ```
 
-## PCA and Factor Analysis
+## PCA and factor analysis
 
 To learn about Principal Components Analysis (PCA), please see this lecture {doc}`Singular Value Decompositions <svd_intro>`.
 
@@ -2144,8 +2150,9 @@ model.
 
 
 
-Technically, this means that the PCA model is misspecified. (Can you
-explain why?)
+Technically, this means that the PCA model is misspecified.
+
+(Can you explain why?)
 
 Nevertheless, this exercise will let us study how well the first two
 principal components from a PCA can approximate the conditional
@@ -2156,7 +2163,7 @@ governs the data on $Y$ we have generated.
 So we compute the PCA decomposition
 
 $$
-\Sigma_{y} = P \tilde{\Lambda} P^{\prime}
+\Sigma_{y} = P \tilde{\Lambda} P^\top
 $$
 
 where $\tilde{\Lambda}$ is a diagonal matrix.
@@ -2170,7 +2177,7 @@ $$
 and
 
 $$
-\epsilon = P^\prime Y
+\epsilon = P^\top Y
 $$
 
 Note that we will arrange the eigenvectors in $P$ in the
@@ -2246,51 +2253,144 @@ Let’s look at them, after which we’ll look at $E f | y = B y$
 B @ y
 ```
 
-The fraction of variance in $y_{t}$ explained by the first two
-principal components can be computed as below.
+```{note}
+The two largest eigenvalues are both $5.25$ in this example. 
+
+
+When an
+eigenvalue is repeated, the associated principal components are not
+individually pinned down: any orthonormal basis for the same
+two-dimensional eigenspace is valid.
+
+For that reason, it is not meaningful to compare $\epsilon_1$ and
+$\epsilon_2$ component-by-component with $E[f \mid Y]$. 
+
+The PC scores
+live in a PCA coordinate system, while $E[f \mid Y]$ lives in factor
+space. 
+
+Even within the common two-dimensional subspace, the PCA basis can
+be rotated or sign-flipped, and its coordinates need not use the same
+scaling as the factor coordinates.
+
+What is uniquely determined is the two-dimensional subspace spanned by
+the first two columns of $P$. 
+
+In this symmetric example, that subspace is
+exactly the column space of $\Lambda$.
+```
+
+The fraction of variance in $y_t$ explained by the first two principal
+components is
 
 ```{code-cell} python3
 𝜆_tilde[:2].sum() / 𝜆_tilde.sum()
 ```
 
-Compute
+To compare PCA with the factor model in observation space, compute
 
 $$
 \hat{Y} = P_{j} \epsilon_{j} + P_{k} \epsilon_{k}
 $$
 
-where $P_{j}$ and $P_{k}$ correspond to the largest two
-eigenvalues.
+where $P_j$ and $P_k$ are the eigenvectors associated with the two
+largest eigenvalues.
 
 ```{code-cell} python3
 y_hat = P[:, :2] @ ε[:2]
 ```
 
-In this example, it turns out that the projection $\hat{Y}$ of
-$Y$ on the first two principal components does a good job of
-approximating $Ef \mid y$.
+$\hat{Y}$ is the rank-2 PCA approximation to $Y$ in observation space,
+so it is a 10-vector rather than a 2-vector. 
+
+The natural observation-space
+counterpart from the factor model is $\Lambda E[f \mid Y]$, which is
+also a 10-vector.
+
+In this symmetric example, both vectors lie in the same two-dimensional
+subspace, namely the column space of $\Lambda$. 
+
+They are therefore close,
+but not identical. 
+
+The PCA reconstruction uses the block means directly,
+while $\Lambda E[f \mid Y]$ shrinks those block means toward zero by the
+factor $5/(5+\sigma_u^2) \approx 0.952$.
+
+The next plot makes this comparison concrete.
+
+The two scatter plots, $E[Y \mid f] = \Lambda f$ and $\hat{Y}$, are both
+10-vectors in observation space, so they can be compared directly.
+
+The horizontal lines show the factor values $f_1$ and $f_2$, together
+with their posterior means $E[f_i \mid Y]$. 
+
+These are 2-dimensional
+factor-space quantities, drawn over the relevant half of the index set to
+match the block structure of $\Lambda$.
+
+This uses the same idea as the earlier formula
+$E[Y \mid f] = \Lambda f$: the matrix $\Lambda$ maps a 2-vector in factor
+space into a 10-vector in observation space. 
 
-We confirm this in the following plot of $f$,
-$E y \mid f$, $E f \mid y$, and $\hat{y}$ on the
-coordinate axis versus $y$ on the ordinate axis.
+In our example,
+
+$$
+\Lambda a
+=
+\begin{bmatrix}
+a_1 \\
+\vdots \\
+a_1 \\
+a_2 \\
+\vdots \\
+a_2
+\end{bmatrix}
+\quad \text{for any } a = \begin{bmatrix} a_1 \\ a_2 \end{bmatrix},
+$$
+
+because the first five rows of $\Lambda$ are $(1,0)$ and the last five
+rows are $(0,1)$.
+
+Therefore, once we observe $Y=y$, the posterior mean
+$E[f \mid Y=y] = \begin{bmatrix} E[f_1 \mid y] \\ E[f_2 \mid y] \end{bmatrix}$
+is converted into the observation-space vector
+
+$$
+\Lambda E[f \mid Y=y]
+=
+\begin{bmatrix}
+E[f_1 \mid y] \\
+\vdots \\
+E[f_1 \mid y] \\
+E[f_2 \mid y] \\
+\vdots \\
+E[f_2 \mid y]
+\end{bmatrix}.
+$$
+
+So the horizontal line at height $E[f_1 \mid y]$ over the first five
+indices, together with the horizontal line at height $E[f_2 \mid y]$
+over the last five indices, is exactly a picture of
+$\Lambda E[f \mid Y=y]$.
 
 ```{code-cell} python3
-plt.scatter(range(N), Λ @ f, label='$Ey|f$')
-plt.scatter(range(N), y_hat, label=r'$\hat{y}$')
+plt.scatter(range(N), Λ @ f, label=r'$E[Y \mid f]$')
+plt.scatter(range(N), y_hat, label=r'$\hat{Y}$')
 plt.hlines(f[0], 0, N//2-1, ls='--', label='$f_{1}$')
 plt.hlines(f[1], N//2, N-1, ls='-.', label='$f_{2}$')
 
 Efy = B @ y
-plt.hlines(Efy[0], 0, N//2-1, ls='--', color='b', label='$Ef_{1}|y$')
-plt.hlines(Efy[1], N//2, N-1, ls='-.', color='b', label='$Ef_{2}|y$')
+plt.hlines(Efy[0], 0, N//2-1, ls='--', color='b', label=r'$E[f_1 \mid y]$')
+plt.hlines(Efy[1], N//2, N-1, ls='-.', color='b', label=r'$E[f_2 \mid y]$')
 plt.legend()
 
 plt.show()
 ```
 
-The covariance matrix of $\hat{Y}$ can be computed by first
-constructing the covariance matrix of $\epsilon$ and then use the
-upper left block for $\epsilon_{1}$ and $\epsilon_{2}$.
+To compute the covariance matrix of $\hat{Y}$, first form the covariance
+matrix of $\epsilon$ and then extract the upper-left block corresponding
+to $\epsilon_1$ and $\epsilon_2$.
 
 ```{code-cell} python3
 Σεjk = (P.T @ Σy @ P)[:2, :2]
@@ -2300,3 +2400,439 @@ Pjk = P[:, :2]
 Σy_hat = Pjk @ Σεjk @ Pjk.T
 print('Σy_hat = \n', Σy_hat)
 ```
+
+## Exercises
+
+```{exercise}
+:label: mv_normal_ex1
+
+**Verify conditional mean and variance by simulation**
+
+For the bivariate normal with
+
+$$
+\mu = \begin{bmatrix} 0.5 \\ 1.0 \end{bmatrix}, \quad
+\Sigma = \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix}
+$$
+
+fix $z_2 = 2$.
+
+1. Use `MultivariateNormal` to compute the analytical conditional mean
+$\hat{\mu}_1$ and variance $\hat{\Sigma}_{11}$ of $z_1 \mid z_2 = 2$.
+
+1. Draw $10^6$ samples from the joint distribution.
+
+   Retain only those for which $|z_2 - 2| < 0.05$.
+
+   Compute the sample mean and variance of the retained $z_1$ values.
+
+1. Confirm that the sample estimates are close to the analytical values.
+```
+
+```{solution-start} mv_normal_ex1
+:class: dropdown
+```
+
+Here is one solution:
+
+```{code-cell} python3
+μ = np.array([.5, 1.])
+Σ = np.array([[1., .5], [.5, 1.]])
+
+mn = MultivariateNormal(μ, Σ)
+mn.partition(1)
+μ1_hat, Σ11_hat = mn.cond_dist(0, np.array([2.]))
+print(f"Analytical  μ1_hat = {μ1_hat[0]:.4f},  Σ11_hat = {Σ11_hat[0,0]:.4f}")
+
+n = 1_000_000
+data = rng.multivariate_normal(μ, Σ, size=n)
+z1_all, z2_all = data[:, 0], data[:, 1]
+
+mask = np.abs(z2_all - 2.) < 0.05
+z1_cond = z1_all[mask]
+print(f"Sample size in band: {mask.sum()}")
+print(f"Sample      μ1_hat = {np.mean(z1_cond):.4f},  Σ11_hat = {np.var(z1_cond, ddof=1):.4f}")
+```
+
+```{solution-end}
+```
+
+```{exercise}
+:label: mv_normal_ex2
+
+**Product of regression slopes equals squared correlation**
+
+For a bivariate normal with standard deviations $\sigma_1 = \sigma_2 = 1$ and
+correlation $\rho$, show analytically that $b_1 b_2 = \rho^2$, where
+$b_1$ is the slope of $z_1$ on $z_2$ and $b_2$ is the slope of $z_2$
+on $z_1$.
+
+Then verify numerically for $\rho \in \{0.2, 0.5, 0.9\}$ that
+`βs[0] * βs[1]` $= \rho^2$ by constructing the appropriate
+`MultivariateNormal` instances.
+```
+
+```{solution-start} mv_normal_ex2
+:class: dropdown
+```
+
+The regression slopes are
+
+$$
+b_1 = \frac{\Sigma_{12}}{\Sigma_{22}} = \frac{\rho \sigma_1 \sigma_2}{\sigma_2^2}
+= \rho \frac{\sigma_1}{\sigma_2}, \qquad
+b_2 = \frac{\Sigma_{21}}{\Sigma_{11}} = \rho \frac{\sigma_2}{\sigma_1}
+$$
+
+so $b_1 b_2 = \rho^2$.
+
+```{code-cell} python3
+for ρ in [0.2, 0.5, 0.9]:
+    Σ = np.array([[1., ρ], [ρ, 1.]])
+    mn = MultivariateNormal(np.zeros(2), Σ)
+    mn.partition(1)
+    product = mn.βs[0].item() * mn.βs[1].item()
+    print(f"ρ={ρ:.1f}:  b1*b2 = {product:.4f}")
+    print(f"ρ^2 = {ρ**2:.4f},  match: {np.isclose(product, ρ**2)}")
+```
+
+```{solution-end}
+```
+
+```{exercise}
+:label: mv_normal_ex3
+
+**IQ inference: effect of the signal-to-noise ratio**
+
+Using the one-dimensional IQ model with $n = 50$ test scores and
+$\mu_\theta = 100$, $\sigma_\theta = 10$:
+
+1. Vary the test-score noise $\sigma_y \in \{1, 5, 10, 20, 50\}$.
+
+- For each value, plot the posterior standard deviation
+$\hat{\sigma}_\theta$ as a function of the number of test scores
+included (from 1 to 50), with all curves on the same axes.
+
+2. Explain intuitively why a larger $\sigma_y$ leads to a slower
+decline of posterior uncertainty.
+```
+
+```{solution-start} mv_normal_ex3
+:class: dropdown
+```
+
+Here is one solution:
+
+```{code-cell} python3
+n_max = 50
+μθ_val, σθ_val = 100., 10.
+
+fig, ax = plt.subplots()
+for σy_val in [1., 5., 10., 20., 50.]:
+    σθ_hat_arr = np.empty(n_max)
+    for i in range(1, n_max + 1):
+        μ_i, Σ_i, _ = construct_moments_IQ(i, μθ_val, σθ_val, σy_val)
+        mn_i = MultivariateNormal(μ_i, Σ_i)
+        mn_i.partition(i)
+        _, Σθ_i = mn_i.cond_dist(1, np.zeros(i))
+        σθ_hat_arr[i - 1] = np.sqrt(Σθ_i[0, 0])
+    ax.plot(range(1, n_max + 1), σθ_hat_arr, label=f'σy={σy_val:.0f}')
+
+ax.set_xlabel('number of test scores')
+ax.set_ylabel(r'posterior $\hat{\sigma}_\theta$')
+ax.legend()
+plt.show()
+```
+
+When $\sigma_y$ is large each test score is a noisy signal about $\theta$,
+so many more observations are required before the posterior variance falls
+appreciably. 
+
+In the limit $\sigma_y \to 0$ a single observation pins down
+$\theta$ exactly.
+
+```{solution-end}
+```
+
+````{exercise}
+:label: mv_normal_ex4
+
+**Prior vs. likelihood in IQ inference**
+
+Using the one-dimensional IQ model with $n = 20$ test scores and
+$\mu_\theta = 100$, $\sigma_y = 10$:
+
+1. Fix $\sigma_y = 10$ and vary the prior spread
+$\sigma_\theta \in \{1, 5, 10, 50, 500\}$.
+
+    - For each value compute the
+    posterior mean $\hat{\mu}_\theta$ given the same set of $n = 20$ test
+    scores and plot $\hat{\mu}_\theta$ against $\sigma_\theta$.
+
+1. Show analytically (or verify numerically) that
+
+   - as $\sigma_\theta \to \infty$ the posterior mean converges to the
+     sample mean $\bar{y}$ (the data dominate the prior), and
+   - as $\sigma_\theta \to 0$ the posterior mean converges to the prior
+     mean $\mu_\theta$ (the prior dominates the data).
+
+```{hint}
+The posterior mean formula is
+$\hat{\mu}_\theta = \bigl(\mu_\theta/\sigma_\theta^2 + n\bar{y}/\sigma_y^2\bigr)
+\big/ \bigl(1/\sigma_\theta^2 + n/\sigma_y^2\bigr)$.
+```
+
+Examine each limit by letting $\sigma_\theta$ go to $\infty$ or $0$.
+````
+
+```{solution-start} mv_normal_ex4
+:class: dropdown
+```
+
+Here is one solution:
+
+```{code-cell} python3
+n_scores = 20
+μθ_val, σy_val = 100., 10.
+
+rng = np.random.default_rng(42)
+true_θ = 108.
+y_obs = true_θ + σy_val * rng.standard_normal(n_scores)
+y_bar = np.mean(y_obs)
+
+σθ_vals = [1., 5., 10., 50., 500.]
+μθ_hat_vals = []
+
+for σθ_val in σθ_vals:
+    μ_i, Σ_i, _ = construct_moments_IQ(n_scores, μθ_val, σθ_val, σy_val)
+    mn_i = MultivariateNormal(μ_i, Σ_i)
+    mn_i.partition(n_scores)
+    μθ_hat, _ = mn_i.cond_dist(1, y_obs)
+    μθ_hat_vals.append(μθ_hat.item())
+
+def posterior_mean(σθ_val):
+    μ_i, Σ_i, _ = construct_moments_IQ(n_scores, μθ_val, σθ_val, σy_val)
+    mn_i = MultivariateNormal(μ_i, Σ_i)
+    mn_i.partition(n_scores)
+    μθ_hat, _ = mn_i.cond_dist(1, y_obs)
+    return μθ_hat.item()
+
+fig, ax = plt.subplots()
+ax.semilogx(σθ_vals, μθ_hat_vals, 'o-', 
+            label=r'$\hat{\mu}_\theta$')
+ax.axhline(y_bar,  ls='--', color='r', 
+            label=f'sample mean y_bar = {y_bar:.1f}')
+ax.axhline(μθ_val, ls=':',  color='g', 
+            label=f'prior mean μθ = {μθ_val:.0f}')
+ax.set_xlabel(r'$\sigma_\theta$')
+ax.set_ylabel(r'posterior mean $\hat{\mu}_\theta$')
+ax.legend()
+plt.show()
+
+σθ_small = 1e-2
+σθ_large = 1e4
+
+print(f"y_bar = {y_bar:.4f}")
+print(f"Posterior mean with σθ={σθ_large:.0e}: {posterior_mean(σθ_large):.4f}")
+print(f"Posterior mean with σθ={σθ_small:.0e}: {posterior_mean(σθ_small):.4f}")
+print(f"Prior mean μθ = {μθ_val:.4f}")
+```
+
+```{solution-end}
+```
+
+```{exercise}
+:label: mv_normal_ex5
+
+**Kalman filter convergence**
+
+Using the `iterate` function from the Filtering Foundations section with
+
+$$
+A = \begin{bmatrix} 0.9 & 0 \\ 0 & 0.5 \end{bmatrix}, \quad
+C = \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \quad
+G = \begin{bmatrix} 1 & 0 \end{bmatrix}, \quad
+R = \begin{bmatrix} 1 \end{bmatrix}
+$$
+
+and initial conditions $\hat{x}_0 = [0, 0]'$, $\Sigma_0 = I_2$:
+
+1. Simulate $T = 60$ periods of $\{x_t, y_t\}$ and run the filter.
+
+1. Plot the sequences of conditional variances $\Sigma_t[0,0]$ and
+$\Sigma_t[1,1]$ over time.
+
+   Verify that they converge to a steady state.
+
+1. Plot the filtered state estimates $\tilde{x}_t[0]$ together with the
+true $x_t[0]$ and the raw observations $y_t$ on a single figure.
+```
+
+```{solution-start} mv_normal_ex5
+:class: dropdown
+```
+
+Here is one solution:
+
+```{code-cell} python3
+A_ex = np.array([[0.9, 0.], [0., 0.5]])
+C_ex = np.array([[1.], [1.]])
+G_ex = np.array([[1., 0.]])
+R_ex = np.array([[1.]])
+
+T_ex = 60
+x0_hat_ex = np.zeros(2)
+Σ0_ex = np.eye(2)
+
+rng = np.random.default_rng(7)
+x_true = np.zeros((T_ex + 1, 2))
+y_seq_ex = np.zeros(T_ex)
+for t in range(T_ex):
+    x_true[t + 1] = A_ex @ x_true[t] + C_ex[:, 0] * rng.standard_normal()
+    y_seq_ex[t] = (G_ex @ x_true[t]).item() + rng.standard_normal()
+
+x_hat_seq, Σ_hat_seq = iterate(
+    x0_hat_ex, Σ0_ex, A_ex, C_ex, G_ex, R_ex, y_seq_ex)
+
+# x_hat_seq[t] = E[x_t | y^{t-1}] (one-step-ahead prediction)
+# Σ_hat_seq[t] = corresponding prediction-error covariance
+fig, ax = plt.subplots()
+ax.plot(Σ_hat_seq[:, 0, 0], label=r'$\Sigma_t[0,0]$')
+ax.plot(Σ_hat_seq[:, 1, 1], label=r'$\Sigma_t[1,1]$')
+ax.set_xlabel('t')
+ax.set_ylabel('prediction-error variance')
+ax.legend()
+plt.show()
+
+# The `iterate` function stores one-step-ahead predictions. 
+# We recover the filtered estimates E[x_t | y^t] by re-applying
+# the measurement-update step at each t.
+n_state = 2
+x_filt_seq = np.empty((T_ex, n_state))
+for t in range(T_ex):
+    xt_hat = x_hat_seq[t]
+    Σt     = Σ_hat_seq[t]
+    μ_k = np.hstack([xt_hat, G_ex @ xt_hat])
+    Σ_k = np.block([[Σt,          Σt  @ G_ex.T          ],
+                    [G_ex @ Σt,   G_ex @ Σt @ G_ex.T + R_ex]])
+    mn_k = MultivariateNormal(μ_k, Σ_k)
+    mn_k.partition(n_state)
+    x_filt_seq[t], _ = mn_k.cond_dist(0, y_seq_ex[t:t+1])
+
+fig, ax = plt.subplots()
+ax.plot(x_true[:-1, 0], label='true $x_t[0]$', alpha=0.7)
+ax.plot(x_filt_seq[:, 0], label=r'filtered $\tilde{x}_t[0]$', ls='--')
+ax.plot(y_seq_ex, label='observations $y_t$', alpha=0.4, lw=0.8)
+ax.set_xlabel('t')
+ax.legend()
+plt.show()
+```
+
+```{solution-end}
+```
+
+```{exercise}
+:label: mv_normal_ex6
+
+**PCA vs. factor analysis**
+
+In the classic factor analysis model at the end of the lecture the true
+covariance is $\Sigma_y = \Lambda \Lambda' + D$.
+
+1. Set $\sigma_u = 2$ (instead of $0.5$). 
+
+    - Recompute the fraction of
+    variance explained by the first two principal components and compare
+    it with the $\sigma_u = 0.5$ result. 
+    - Explain the change.
+
+1. Show that the observation-space factor-analytic posterior
+   $\Lambda E[f \mid Y] = \Lambda B Y$ (an $N$-vector) is **not** equal to
+   the two-component PCA reconstruction
+   $\hat{Y} = P_{:,1:2}\,\epsilon_{1:2}$ (also an $N$-vector).
+    - Plot both on the same axes.
+
+   *Note:* $E[f \mid Y] = BY$ is a $k$-vector and $\hat{Y}$ is an
+   $N$-vector, so they cannot be compared directly; the comparison must be
+   made in observation space via $\Lambda E[f \mid Y]$.
+
+1. In one or two sentences, explain why PCA is misspecified for
+factor-analytic data.
+```
+
+```{solution-start} mv_normal_ex6
+:class: dropdown
+```
+
+Here is one solution:
+
+```{code-cell} python3
+rng = np.random.default_rng(42)
+
+N_fa = 10
+k_fa = 2
+
+Λ_fa = np.zeros((N_fa, k_fa))
+Λ_fa[:N_fa//2, 0] = 1
+Λ_fa[N_fa//2:, 1] = 1
+
+for σu_val in [0.5, 2.0]:
+    D_fa = np.eye(N_fa) * σu_val ** 2
+    Σy_fa = Λ_fa @ Λ_fa.T + D_fa
+
+    λ_fa, P_fa = np.linalg.eigh(Σy_fa)
+    ind_fa = sorted(range(N_fa), key=lambda x: λ_fa[x], reverse=True)
+    P_fa   = P_fa[:, ind_fa]
+    λ_fa   = λ_fa[ind_fa]
+
+    frac = λ_fa[:2].sum() / λ_fa.sum()
+    print(f"σu={σu_val}: fraction explained by first 2 PCs = {frac:.4f}")
+
+σu_b = 0.5
+D_b = np.eye(N_fa) * σu_b ** 2
+Σy_b = Λ_fa @ Λ_fa.T + D_b
+
+μz_b = np.zeros(k_fa + N_fa)
+Σz_b = np.block([[np.eye(k_fa), Λ_fa.T], [Λ_fa, Σy_b]])
+z_b = rng.multivariate_normal(μz_b, Σz_b)
+f_b = z_b[:k_fa]
+y_b = z_b[k_fa:]
+
+B_b = Λ_fa.T @ np.linalg.inv(Σy_b)
+Efy_b  = B_b @ y_b
+
+λ_b, P_b = np.linalg.eigh(Σy_b)
+ind_b = sorted(range(N_fa), key=lambda x: λ_b[x], reverse=True)
+P_b = P_b[:, ind_b]
+ε_b = P_b.T @ y_b
+y_hat_b = P_b[:, :2] @ ε_b[:2]
+
+fig, ax = plt.subplots(figsize=(8, 4))
+ax.scatter(range(N_fa), 
+        Λ_fa @ Efy_b, label=r'Factor-analytic $\Lambda E[f\mid y]$')
+ax.scatter(range(N_fa), 
+        y_hat_b, marker='x', label=r'PCA projection $\hat{y}$')
+ax.scatter(range(N_fa), 
+        Λ_fa @ f_b, marker='^', alpha=0.6, label=r'True signal $\Lambda f$')
+ax.set_xlabel('observation index')
+ax.legend()
+plt.show()
+```
+
+In this symmetric example, PCA does recover the same two-dimensional
+observation-space subspace as the factor model, namely the column space
+of $\Lambda$. But PCA is still misspecified for factor-analytic data,
+because it treats the covariance matrix as an arbitrary matrix to be
+approximated and does not use the special decomposition
+$\Sigma_y = \Lambda \Lambda^\top + D$ into a common part and an
+idiosyncratic noise part.
+
+So the two methods are solving different problems. PCA forms
+$\hat{Y}$ as the best rank-2 approximation to the observed data vector
+$Y$, which in this example amounts to using the block means. The factor
+model instead computes $\Lambda E[f \mid Y]$, the conditional mean of the
+latent common component $\Lambda f$ given the data, and because it
+accounts for noise it shrinks those block means toward zero.
+
+```{solution-end}
+```
diff --git a/lectures/prob_matrix.md b/lectures/prob_matrix.md
index b142b9e39..b29820c20 100644
--- a/lectures/prob_matrix.md
+++ b/lectures/prob_matrix.md
@@ -1,10 +1,10 @@
 ---
 jupytext:
   text_representation:
-    extension: .myst
+    extension: .md
     format_name: myst
     format_version: 0.13
-    jupytext_version: 1.13.8
+    jupytext_version: 1.17.1
 kernelspec:
   display_name: Python 3 (ipykernel)
   language: python
@@ -17,7 +17,7 @@ kernelspec:
 
 This lecture uses matrix algebra to illustrate some basic ideas about probability theory.
 
-After introducing  underlying objects, we'll use matrices and vectors to describe probability distributions.
+After introducing underlying objects, we'll use matrices and vectors to describe probability distributions.
 
 Among concepts that we'll be studying include
 
@@ -29,13 +29,13 @@ Among concepts that we'll be studying include
     - couplings
     - copulas
 - the probability distribution of a sum of two independent random variables
-    - convolution of  marginal distributions
+    - convolution of marginal distributions
 - parameters that define a probability distribution
 - sufficient statistics as data summaries
 
 We'll use a matrix to represent a bivariate or multivariate probability distribution and a vector to represent a univariate probability distribution
 
-This {doc}`companion lecture <stats_examples>` describes some popular probability distributions and describes how to  use Python to sample from them. 
+This {doc}`companion lecture <stats_examples>` describes some popular probability distributions and describes how to use Python to sample from them.
 
 
 In addition to what's in Anaconda, this lecture will need the following libraries:
@@ -53,79 +53,94 @@ As usual, we'll start with some imports
 import numpy as np
 import matplotlib.pyplot as plt
 import prettytable as pt
+from scipy import stats
+from scipy.special import comb
 from mpl_toolkits.mplot3d import Axes3D
 from matplotlib_inline.backend_inline import set_matplotlib_formats
 set_matplotlib_formats('retina')
+
+rng = np.random.default_rng(0)
 ```
 
 
-## Sketch of Basic Concepts
+## Sketch of basic concepts
 
 We'll briefly define what we mean by a **probability space**, a **probability measure**, and a **random variable**.
 
 For most of this lecture, we sweep these objects into the background
  
 ```{note}
-Nevertheless, they'll be lurking beneath  **induced distributions** of random variables that  we'll  focus on here. These deeper objects are essential for defining  and analysing  the concepts of stationarity and ergodicity that underly laws of large numbers.  For a relatively
-nontechnical presentation of some of these results see this chapter from Lars Peter Hansen and Thomas J. Sargent's online monograph titled "Risk, Uncertainty, and Values":<https://lphansen.github.io/QuantMFR/book/1_stochastic_processes.html>.
+Nevertheless, they'll be lurking beneath **induced distributions** of random variables that we'll focus on here. 
+
+These deeper objects are essential for defining and analysing the concepts of stationarity and ergodicity that underly laws of large numbers.
+
+For a relatively
+nontechnical presentation of some of these results see this chapter from Lars Peter Hansen and Thomas J. Sargent's online monograph titled [*Risk, Uncertainty, and Values*](https://lphansen.github.io/QuantMFR/book/1_stochastic_processes.html).
 ``` 
   
 
 
-Let $\Omega$ be a set of possible underlying outcomes and let $\omega \in \Omega$ be a particular underlying outcomes.
+Let $\Omega$ be a set of possible underlying outcomes and let $\omega \in \Omega$ be a particular underlying outcome.
 
-Let $\mathcal{G} \subset \Omega$ be a subset of $\Omega$.
+Let $\mathcal{F}$ be a collection of subsets of $\Omega$ that we call **events**.
 
-Let $\mathcal{F}$ be a collection of such subsets  $\mathcal{G} \subset \Omega$.
+(Technically, $\mathcal{F}$ is a [$\sigma$-algebra](https://en.wikipedia.org/wiki/Sigma-algebra).)
 
-The pair $\Omega,\mathcal{F}$  forms our **probability space** on which we want to put a probability measure.
+A **probability measure** $\mu$ maps each event $\mathcal{G} \in \mathcal{F}$ into a scalar number $\mu(\mathcal{G})$ between $0$ and $1$, with $\mu(\Omega)=1$.
 
-A **probability measure** $\mu$ maps a set of possible underlying outcomes  $\mathcal{G} \in \mathcal{F}$  into a scalar number between $0$ and $1$
+The triple $\Omega,\mathcal{F},\mu$ forms our **probability space**.
 
-- this is the "probability" that $X$ belongs to $A$, denoted by $ \textrm{Prob}\{X\in A\}$.
+A **random variable** $X(\omega)$ is a function of the underlying outcome $\omega \in \Omega$ that assigns a value in some set of possible values.
 
-A **random variable** $X(\omega)$ is a function of the underlying outcome $\omega \in \Omega$.
+If $A$ is a set of possible values of $X$, then the event that $X$ lies in $A$ is
 
+$$
+\mathcal{G} = \{\omega \in \Omega : X(\omega) \in A\}.
+$$
 
-The random variable $X(\omega)$  has a **probability distribution** that is induced by the underlying probability measure $\mu$ and the function
-$X(\omega)$:
+The random variable $X(\omega)$ has a **probability distribution** induced by the probability measure $\mu$:
 
 $$
-\textrm{Prob} (X \in A ) = \int_{\mathcal{G}} \mu(\omega) d \omega
-$$ (eq:CDFfromdensity)
+\textrm{Prob}(X \in A) = \mu(\mathcal{G}).
+$$
 
-where ${\mathcal G}$ is the subset of $\Omega$ for which $X(\omega) \in A$.
+If $\mu$ has a density $p(\omega)$, then we can also write
+
+$$
+\textrm{Prob}(X \in A) = \int_{\mathcal{G}} p(\omega)\, d \omega
+$$ (eq:CDFfromdensity)
 
 We call this the induced probability distribution of random variable $X$.
 
-Instead of working explicitly with an underlying probability space $\Omega,\mathcal{F}$  and probability measure $\mu$,
-applied statisticians often proceed simply by specifying a form for an induced distribution for a random variable $X$. 
+Instead of working explicitly with an underlying probability space $\Omega,\mathcal{F}$ and probability measure $\mu$,
+applied statisticians often proceed simply by specifying a form for an induced distribution for a random variable $X$.
 
-That is how we'll proceed in this lecture and in many subsequent lectures. 
+That is how we'll proceed in this lecture and in many subsequent lectures.
 
 
-## What Does Probability Mean?
+## What does probability mean?
 
 Before diving in, we'll say a few words about what probability theory means and how it connects to statistics.
 
-We  also touch  on these topics in the quantecon lectures  <https://python.quantecon.org/prob_meaning.html> and <https://python.quantecon.org/navy_captain.html>.
+We also touch on these topics in {doc}`prob_meaning` and {doc}`navy_captain`.
 
-For much of this lecture we'll be discussing  fixed "population" probabilities.
+For much of this lecture we'll be discussing fixed "population" probabilities.
 
 These are purely mathematical objects.
 
 To appreciate how statisticians connect probabilities to data, the key is to understand the following concepts:
 
 * A single draw from a probability distribution
-* Repeated independently  and identically distributed (i.i.d.)  draws of "samples" or "realizations" from the same probability distribution
-* A **statistic** defined as a  function of a sequence of samples
-* An **empirical distribution** or **histogram** (a binned empirical distribution) that records observed  **relative frequencies**
-* The idea that a  population probability  distribution is  what we anticipate **relative frequencies** will be in a long sequence of i.i.d. draws. Here the following mathematical machinery makes precise what is meant by **anticipated relative frequencies**
+* Repeated independently and identically distributed (i.i.d.) draws of "samples" or "realizations" from the same probability distribution
+* A **statistic** defined as a function of a sequence of samples
+* An **empirical distribution** or **histogram** (a binned empirical distribution) that records observed **relative frequencies**
+* The idea that a population probability distribution is what we anticipate **relative frequencies** will be in a long sequence of i.i.d. draws. Here the following mathematical machinery makes precise what is meant by **anticipated relative frequencies**
      - **Law of Large Numbers (LLN)**
-     -  **Central Limit Theorem (CLT)**
+     - **Central Limit Theorem (CLT)**
 
+### A discrete random variable example
 
-**Scalar example**
+#### Scalar example
 
 Let $X$ be a scalar random variable that takes on the $I$ possible values
 $0, 1, 2, \ldots, I-1$ with probabilities
@@ -147,23 +162,23 @@ $$
 
 as a short-hand way of saying that the random variable $X$ is described by the probability distribution $ \{{f_i}\}_{i=0}^{I-1}$.
 
-Consider drawing a  sample $x_0, x_1, \dots , x_{N-1}$ of  $N$ independent and identically distributoed  draws of $X$. 
+Consider drawing a sample $x_0, x_1, \dots , x_{N-1}$ of $N$ independent and identically distributed draws of $X$.
 
-What do the "identical" and "independent" mean in   IID or iid ("identically and independently distributed")?
+What do "identical" and "independent" mean in IID or iid ("identically and independently distributed")?
 
 - "identical" means that each draw is from the same distribution.
-- "independent" means that  joint distribution  equal  products of marginal distributions, i.e.,
+- "independent" means that the joint distribution equals the product of marginal distributions, i.e.,
 
 $$
 \begin{aligned}
-\textrm{Prob}\{x_0 = i_0, x_1 = i_1, \dots , x_{N-1} = i_{N-1}\} &= \textrm{Prob}\{x_0 = i_0\} \cdot \dots \cdot \textrm{Prob}\{x_{I-1} = i_{I-1}\}\\
+\textrm{Prob}\{x_0 = i_0, x_1 = i_1, \dots , x_{N-1} = i_{N-1}\} &= \textrm{Prob}\{x_0 = i_0\} \cdot \dots \cdot \textrm{Prob}\{x_{N-1} = i_{N-1}\}\\
 &= f_{i_0} f_{i_1} \cdot \dots \cdot f_{i_{N-1}}\\
 \end{aligned}
 $$
 
-We define an  **empirical distribution** as follows.
+We define an **empirical distribution** as follows.
 
-For each $i  = 0,\dots,I-1$, let 
+For each $i = 0,\dots,I-1$, let
 
 $$
 \begin{aligned}
@@ -174,35 +189,31 @@ N & = \sum^{I-1}_{i=0} N_i \quad \text{total number of draws},\\
 $$
 
 
-Key concepts that connect probability theory with statistics are laws of large numbers and central limit theorems
-
-**LLN:**
+Key concepts that connect probability theory with statistics are laws of large numbers and central limit theorems.
 
-- A Law of Large Numbers (LLN) states that $\tilde {f_i} \to f_i \text{ as } N \to \infty$
+A Law of Large Numbers (LLN) states that $\tilde {f_i} \to f_i$ as $N \to \infty$.
 
-**CLT:**
+A Central Limit Theorem (CLT) describes a **rate** at which $\tilde {f_i} \to f_i$.
 
-- A Central Limit Theorem (CLT) describes a  **rate** at which $\tilde {f_i} \to f_i$
+See {doc}`lln_clt` for a detailed treatment of both results.
 
+### Understanding probability: frequentist vs. Bayesian
 
-**Remarks**
+For "frequentist" statisticians, **anticipated relative frequency** is **all** that a probability distribution means.
 
-- For "frequentist" statisticians, **anticipated relative frequency**  is **all** that a probability distribution means.
+But for a Bayesian it means something else -- something partly subjective and purely personal.
 
-- But for a Bayesian it means something else -- something partly  subjective and purely personal.
-     
-     * we say "partly" because a Bayesian also pays attention to relative frequencies 
+We say "partly" because a Bayesian also pays attention to relative frequencies.
 
+## Representing probability distributions
 
-## Representing  Probability Distributions
-
-A  probability distribution $\textrm{Prob} (X \in A)$ can  be described by its **cumulative distribution function (CDF)**
+A probability distribution $\textrm{Prob} (X \in A)$ can be described by its **cumulative distribution function (CDF)**
 
 $$
 F_{X}(x) = \textrm{Prob}\{X\leq x\}.
 $$
 
-Sometimes, but not always, a random variable can also be described by  **density function** $f(x)$
+Sometimes, but not always, a random variable can also be described by a **density function** $f(x)$
 that is related to its CDF by
 
 $$
@@ -215,13 +226,13 @@ $$
 
 Here $B$ is a set of possible $X$'s whose probability of occurring we want to compute.
 
-When a probability density exists, a probability distribution can be characterized either by its CDF or by its  density.
+When a probability density exists, a probability distribution can be characterized either by its CDF or by its density.
 
 For a **discrete-valued** random variable
 
 * the number  of possible values of $X$ is finite or countably infinite
 * we replace a  **density** with a **probability mass function**, a non-negative sequence that sums to one
-* we replace integration with summation in the formula like {eq}`eq:CDFfromdensity` that relates a CDF to a probability mass function
+* when a density exists, we replace integration with summation in formulas like {eq}`eq:CDFfromdensity`
 
 
 In this lecture, we mostly discuss discrete random variables.
@@ -231,7 +242,7 @@ Doing this enables us to confine our tool set basically to linear algebra.
 Later we'll briefly discuss how to approximate a continuous random variable with a discrete random variable.
 
 
-## Univariate Probability Distributions
+## Univariate probability distributions
 
 We'll devote most of this lecture to discrete-valued random variables, but we'll say a few things
 about continuous-valued random variables.
@@ -281,15 +292,19 @@ $$
 where $\theta $ is a vector of parameters that is of much smaller dimension than $I$.
 
 
-**Remarks:**
+A **statistical model** is a joint probability distribution characterized by a list of **parameters**.
+
+The concept of **parameter** is intimately related to the notion of **sufficient statistic**.
+
+A **statistic** is a nonlinear function of a data set.
+
+**Sufficient statistics** summarize all **information** that a data set contains about parameters of a statistical model.
+
+Note that a sufficient statistic corresponds to a particular statistical model.
 
-- A **statistical model** is a joint probability distribution characterized by a list of **parameters** 
-- The concept of  **parameter** is intimately related to the notion of  **sufficient statistic**.
-- A **statistic** is a   nonlinear function of a data set.
-- **Sufficient statistics**  summarize all  **information** that a  data set contains  about  parameters of statistical model.
-   * Note that a sufficient statistic corresponds to a particular statistical model. 
-   * Sufficient statistics are key  tools that AI uses to summarize or compress  a **big data** set.
--  R. A. Fisher provided a rigorous definition of **information** -- see <https://en.wikipedia.org/wiki/Fisher_information>
+Sufficient statistics are key tools that AI uses to summarize or compress a **big data** set.
+
+R. A. Fisher provided a rigorous definition of **information** -- see [Fisher information](https://en.wikipedia.org/wiki/Fisher_information).
 
 
 
@@ -298,7 +313,7 @@ An example of a parametric probability distribution is  a **geometric distributi
 It is described by
 
 $$
-f_{i} = \textrm{Prob}\{X=i\} = (1-\lambda)\lambda^{i},\quad \lambda \in [0,1], \quad i = 0, 1, 2, \ldots
+f_{i} = \textrm{Prob}\{X=i\} = (1-\lambda)\lambda^{i},\quad \lambda \in [0,1), \quad i = 0, 1, 2, \ldots
 $$
 
 Evidently,  $\sum_{i=0}^{\infty}f_i=1$.
@@ -311,7 +326,7 @@ $$
 
 ### Continuous random variable
 
-Let $X$ be a continous random variable that takes values $X \in \tilde{X}\equiv[X_U,X_L]$ whose distributions have parameters $\theta$.
+Let $X$ be a continuous random variable that takes values in a set $\tilde{X} \subseteq \mathbb{R}$ and whose distribution has parameters $\theta$.
 
 $$
 \textrm{Prob}\{X\in A\} = \int_{x\in A} f(x;\theta)\,dx;  \quad f(x;\theta)\ge0
@@ -323,7 +338,7 @@ $$
 \textrm{Prob}\{X\in \tilde{X}\} =1
 $$
 
-## Bivariate Probability Distributions
+## Bivariate probability distributions
 
 We'll now discuss a bivariate **joint distribution**.
 
@@ -357,9 +372,9 @@ $$
 \sum_{i}\sum_{j}f_{ij}=1
 $$
 
-## Marginal Probability Distributions
+## Marginal probability distributions
 
-The joint distribution induce marginal distributions
+The joint distribution induces marginal distributions
 
 $$
 \textrm{Prob}\{X=i\}= \sum_{j=0}^{J-1}f_{ij} = \mu_i, \quad i=0,\ldots,I-1
@@ -391,7 +406,7 @@ $$
 \end{aligned}
 $$
 
-**Digression:** If two random variables $X,Y$ are continuous and have joint density $f(x,y)$, then marginal distributions can be computed by
+As a digression, if two random variables $X,Y$ are continuous and have joint density $f(x,y)$, then marginal distributions can be computed by
 
 $$
 \begin{aligned}
@@ -400,7 +415,7 @@ f(y)& = \int_{\mathbb{R}} f(x,y) dx
 \end{aligned}
 $$
 
-## Conditional Probability  Distributions
+## Conditional probability distributions
 
 Conditional probabilities are defined according to
 
@@ -422,18 +437,22 @@ where $i=0, \ldots,I-1, \quad j=0,\ldots,J-1$.
 Note that
 
 $$
-\sum_{i}\textrm{Prob}\{X_i=i|Y_j=j\}
+\sum_{i}\textrm{Prob}\{X=i|Y=j\}
 =\frac{ \sum_{i}f_{ij} }{ \sum_{i}f_{ij}}=1
 $$
 
-**Remark:** The mathematics  of conditional probability  implies:
+The mathematics of conditional probability implies:
 
 $$
 \textrm{Prob}\{X=i|Y=j\}	=\frac{\textrm{Prob}\{X=i,Y=j\}}{\textrm{Prob}\{Y=j\}}=\frac{\textrm{Prob}\{Y=j|X=i\}\textrm{Prob}\{X=i\}}{\textrm{Prob}\{Y=j\}}
 $$ (eq:condprobbayes)
 
 ```{note}
-Formula {eq}`eq:condprobbayes` is also  what a  Bayesian calls **Bayes' Law**. A Bayesian statistician regards  marginal probability distribution $\textrm{Prob}({X=i}), i = 1,  \ldots, J$ as a **prior** distribution that describes his personal subjective beliefs about $X$.
+Formula {eq}`eq:condprobbayes` is also  what a  Bayesian calls **Bayes' Law**. 
+
+A Bayesian statistician regards  marginal probability distribution $\textrm{Prob}({X=i}), i = 0,  \ldots, I-1$ as a **prior** distribution that describes his personal subjective beliefs about $X$.
+
+
 He  then interprets  formula {eq}`eq:condprobbayes` as a procedure for constructing a **posterior** distribution that describes how he would  revise his subjective beliefs after observing that $Y$ equals $j$.  
 ```
 
@@ -446,7 +465,7 @@ $$
 $$
 
 
-## Transition Probability Matrix
+## Transition probability matrix
 
 Consider the following joint probability distribution of  two random variables.
 
@@ -465,7 +484,7 @@ $$
 An associated conditional distribution is
 
 $$
-\textrm{Prob}\{Y=i\vert X=j\} = \frac{\rho_{ij}}{ \sum_{j}\rho_{ij}}
+\textrm{Prob}\{Y=j\vert X=i\} = \frac{\rho_{ij}}{ \sum_{j}\rho_{ij}}
 = \frac{\textrm{Prob}\{Y=j, X=i\}}{\textrm{Prob}\{ X=i\}}
 $$
 
@@ -480,8 +499,8 @@ where
 $$
 \left[
    \begin{matrix}
-  p_{11} & p_{12}\\
-  p_{21} & p_{22}
+  p_{00} & p_{01}\\
+  p_{10} & p_{11}
   \end{matrix}
 \right]
 $$
@@ -491,11 +510,11 @@ The first row is the probability that $Y=j, j=0,1$ conditional on $X=0$.
 The second row is the probability that $Y=j, j=0,1$ conditional on $X=1$.
 
 Note that
-- $\sum_{j}\rho_{ij}= \frac{ \sum_{j}\rho_{ij}}{ \sum_{j}\rho_{ij}}=1$, so each row of the transition matrix $P$ is a probability distribution (not so for each column).
+- $\sum_{j}p_{ij}= \frac{ \sum_{j}\rho_{ij}}{ \sum_{j}\rho_{ij}}=1$, so each row of the transition matrix $P$ is a probability distribution (not so for each column).
 
 
 
-## Application: Forecasting a Time Series
+## Application: forecasting a time series
 
 Suppose that there are two time periods.
 
@@ -508,7 +527,7 @@ Suppose that
 
 $$
 \begin{aligned}
-\text{Prob} \{X(0)=i,X(1)=j\} &=f_{ij}≥0，i=0,\cdots,I-1\\
+\text{Prob} \{X(0)=i,X(1)=j\} &=f_{ij}\geq 0, \quad i=0,\cdots,I-1, \quad j=0,\cdots,J-1\\
 \sum_{i}\sum_{j}f_{ij}&=1
 \end{aligned}
 $$
@@ -519,11 +538,10 @@ A conditional distribution is
 
 $$\text{Prob} \{X(1)=j|X(0)=i\}= \frac{f_{ij}}{ \sum_{j}f_{ij}}$$
 
-**Remark:**
-- This formula is a workhorse for applied economic forecasters.
+This formula is a workhorse for applied economic forecasters.
 
 
-## Statistical Independence
+## Statistical independence
 
 Random variables X and Y are statistically **independent** if
 
@@ -535,8 +553,8 @@ where
 
 $$
 \begin{aligned}
-\textrm{Prob}\{X=i\} &=f_i\ge0， \sum{f_i}=1 \cr
-\textrm{Prob}\{Y=j\} & =g_j\ge0， \sum{g_j}=1
+\textrm{Prob}\{X=i\} &=f_i\ge 0, \quad \sum_{i}{f_i}=1 \cr
+\textrm{Prob}\{Y=j\} & =g_j\ge 0, \quad \sum_{j}{g_j}=1
 \end{aligned}
 $$
 
@@ -550,7 +568,7 @@ $$
 $$
 
 
-## Means and Variances
+## Means and variances
 
 The  mean and variance of a discrete random variable $X$  are
 
@@ -562,7 +580,7 @@ $$
 \end{aligned}
 $$
 
-A continuous random variable having  density $f_{X}(x)$) has  mean and variance
+A continuous random variable having  density $f_{X}(x)$ has  mean and variance
 
 $$
 \begin{aligned}
@@ -571,7 +589,7 @@ $$
 \end{aligned}
 $$
 
-## Matrix Representations of Some Bivariate Distributions
+## Matrix representations of some bivariate distributions
 
 Let's use matrices to represent a joint distribution, conditional distribution, marginal distribution, and the mean and variance of a  bivariate random variable.
 
@@ -590,12 +608,9 @@ $$ \textrm{Prob}(X=i)=\sum_j{f_{ij}}=u_i  $$
 $$ \textrm{Prob}(Y=j)=\sum_i{f_{ij}}=v_j $$
 
 
-**Sampling:**
-
-Let's write some Python code that let's us  draw some long samples and compute relative frequencies.
+Let's write some Python code that lets us draw some long samples and compute relative frequencies.
 
-The code will let us  check whether  the "sampling" distribution agrees   with the "population" distribution - confirming that
-the population distribution correctly tells us the relative frequencies that we should expect in a large sample. 
+The code lets us check whether the "sampling" distribution agrees with the "population" distribution -- confirming that the population distribution correctly tells us the relative frequencies that we should expect in a large sample.
 
 
 
@@ -607,7 +622,7 @@ f = np.array([[0.3, 0.2], [0.1, 0.4]])
 f_cum = np.cumsum(f)
 
 # draw random numbers
-p = np.random.rand(1_000_000)
+p = rng.random(1_000_000)
 x = np.vstack([xs[1]*np.ones(p.shape), ys[1]*np.ones(p.shape)])
 # map to the bivariate distribution
 
@@ -764,7 +779,7 @@ class discrete_bijoint:
         xs = self.xs
         ys = self.ys
         f_cum = np.cumsum(self.f)
-        p = np.random.rand(n)
+        p = rng.random(n)
         x = np.empty([2, p.shape[0]])
         lf = len(f_cum)
         lx = len(xs)-1
@@ -844,7 +859,9 @@ class discrete_bijoint:
 
 Let's apply our code to some examples.
 
-**Example 1**
+### Numerical examples
+
+#### Example 1
 
 ```{code-cell} ipython3
 # joint
@@ -863,7 +880,7 @@ d.marg_dist()
 d.cond_dist()
 ```
 
-**Example 2**
+#### Example 2
 
 ```{code-cell} ipython3
 xs_new = np.array([10, 20, 30])
@@ -882,7 +899,7 @@ d_new.marg_dist()
 d_new.cond_dist()
 ```
 
-## A Continuous Bivariate Random Vector
+## A continuous bivariate random vector
 
 
 A two-dimensional Gaussian distribution has  joint density
@@ -891,11 +908,6 @@ $$
 f(x,y) =(2\pi\sigma_1\sigma_2\sqrt{1-\rho^2})^{-1}\exp\left[-\frac{1}{2(1-\rho^2)}\left(\frac{(x-\mu_1)^2}{\sigma_1^2}-\frac{2\rho(x-\mu_1)(y-\mu_2)}{\sigma_1\sigma_2}+\frac{(y-\mu_2)^2}{\sigma_2^2}\right)\right]
 $$
 
-
-$$
-\frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}}\exp\left[-\frac{1}{2(1-\rho^2)}\left(\frac{(x-\mu_1)^2}{\sigma_1^2}-\frac{2\rho(x-\mu_1)(y-\mu_2)}{\sigma_1\sigma_2}+\frac{(y-\mu_2)^2}{\sigma_2^2}\right)\right]
-$$
-
 We start with a  bivariate normal distribution pinned down by
 
 $$
@@ -934,9 +946,11 @@ y = np.linspace(-10, 10, 1_000)
 x_mesh, y_mesh = np.meshgrid(x, y, indexing="ij")
 ```
 
-**Joint Distribution**
+### Joint, marginal, and conditional distributions
+
+#### Joint distribution
 
-Let's  plot the **population** joint density.
+Let's plot the **population** joint density.
 
 ```{code-cell} ipython3
 # %matplotlib notebook
@@ -967,18 +981,18 @@ Next  we can use   a built-in `numpy` function to draw random samples, then calc
 μ= np.array([0, 5])
 σ= np.array([[5, .2], [.2, 1]])
 n = 1_000_000
-data = np.random.multivariate_normal(μ, σ, n)
+data = rng.multivariate_normal(μ, σ, n)
 x = data[:, 0]
 y = data[:, 1]
 ```
 
-**Marginal distribution**
+#### Marginal distribution
 
 ```{code-cell} ipython3
 plt.hist(x, bins=1_000, alpha=0.6)
 μx_hat, σx_hat = np.mean(x), np.std(x)
 print(μx_hat, σx_hat)
-x_sim = np.random.normal(μx_hat, σx_hat, 1_000_000)
+x_sim = rng.normal(μx_hat, σx_hat, 1_000_000)
 plt.hist(x_sim, bins=1_000, alpha=0.4, histtype="step")
 plt.show()
 ```
@@ -987,19 +1001,19 @@ plt.show()
 plt.hist(y, bins=1_000, density=True, alpha=0.6)
 μy_hat, σy_hat = np.mean(y), np.std(y)
 print(μy_hat, σy_hat)
-y_sim = np.random.normal(μy_hat, σy_hat, 1_000_000)
+y_sim = rng.normal(μy_hat, σy_hat, 1_000_000)
 plt.hist(y_sim, bins=1_000, density=True, alpha=0.4, histtype="step")
 plt.show()
 ```
 
-**Conditional distribution**
+#### Conditional distribution
 
 For a bivariate normal population distribution, the conditional distributions are also normal:
 
 $$
-\begin{aligned} \\
-[X|Y &= y ]\sim \mathbb{N}\bigg[\mu_X+\rho\sigma_X\frac{y-\mu_Y}{\sigma_Y},\sigma_X^2(1-\rho^2)\bigg] \\
-[Y|X &= x ]\sim \mathbb{N}\bigg[\mu_Y+\rho\sigma_Y\frac{x-\mu_X}{\sigma_X},\sigma_Y^2(1-\rho^2)\bigg]
+\begin{aligned}
+X \mid Y = y &\sim \mathbb{N}\bigg[\mu_X+\rho\sigma_X\frac{y-\mu_Y}{\sigma_Y},\sigma_X^2(1-\rho^2)\bigg] \\
+Y \mid X = x &\sim \mathbb{N}\bigg[\mu_Y+\rho\sigma_Y\frac{x-\mu_X}{\sigma_X},\sigma_Y^2(1-\rho^2)\bigg]
 \end{aligned}
 $$
 
@@ -1007,30 +1021,33 @@ $$
 Please see this {doc}`quantecon lecture <multivariate_normal>` for more details.
 ```
 
-Let's approximate  the joint density by discretizing and mapping the approximating joint density into a  matrix.
+Let's approximate the joint density by discretizing and mapping the approximating joint density into a matrix.
+
+On an evenly spaced grid, we can approximate the conditional distribution by assigning probability weights proportional to a slice of the joint density.
 
-We can compute the discretized marginal density  by just using matrix algebra and  noting that
+For fixed $y$, this means that
 
 $$
-\textrm{Prob}\{X=i|Y=j\}=\frac{f_{ij}}{\sum_{i}f_{ij}}
+z_i
+\equiv \frac{f(x_i,y)}{\sum_k f(x_k,y)}
 $$
 
 Fix $y=0$.
 
 ```{code-cell} ipython3
-# discretized marginal density
+# discretized conditional distribution of X given Y = 0
 x = np.linspace(-10, 10, 1_000_000)
 z = func(x, y=0) / np.sum(func(x, y=0))
 plt.plot(x, z)
 plt.show()
 ```
 
-The mean and variance are computed by
+The conditional mean and variance are then approximated by
 
 $$
 \begin{aligned}
-\mathbb{E}\left[X\vert Y=j\right] & =\sum_{i}iProb\{X=i\vert Y=j\}=\sum_{i}i\frac{f_{ij}}{\sum_{i}f_{ij}} \\
-\mathbb{D}\left[X\vert Y=j\right] &=\sum_{i}\left(i-\mu_{X\vert Y=j}\right)^{2}\frac{f_{ij}}{\sum_{i}f_{ij}}
+\mathbb{E}\left[X\vert Y=y\right] & \approx \sum_i x_i z_i \\
+\mathbb{D}\left[X\vert Y=y\right] & \approx \sum_i\left(x_i-\mu_{X\vert Y=y}\right)^{2} z_i
 \end{aligned}
 $$
 
@@ -1044,7 +1061,7 @@ Let's draw from a normal distribution with above mean and variance and check how
 σx = np.sqrt(np.dot((x - μx)**2, z))
 
 # sample
-zz = np.random.normal(μx, σx, 1_000_000)
+zz = rng.normal(μx, σx, 1_000_000)
 plt.hist(zz, bins=300, density=True, alpha=0.3, range=[-10, 10])
 plt.show()
 ```
@@ -1052,19 +1069,19 @@ plt.show()
 Fix $x=1$.
 
 ```{code-cell} ipython3
-y = np.linspace(0, 10, 1_000_000)
+y = np.linspace(-10, 10, 1_000_000)
 z = func(x=1, y=y) / np.sum(func(x=1, y=y))
 plt.plot(y,z)
 plt.show()
 ```
 
 ```{code-cell} ipython3
-# discretized mean and standard deviation
+# discretized conditional mean and standard deviation
 μy = np.dot(y,z)
 σy = np.sqrt(np.dot((y - μy)**2, z))
 
 # sample
-zz = np.random.normal(μy,σy,1_000_000)
+zz = rng.normal(μy, σy, 1_000_000)
 plt.hist(zz, bins=100, density=True, alpha=0.3)
 plt.show()
 ```
@@ -1079,7 +1096,7 @@ print(μy, σy)
 print(μ2 + ρ * σ2 * (1 - μ1) / σ1, np.sqrt(σ2**2 * (1 - ρ**2)))
 ```
 
-## Sum of Two Independently Distributed Random Variables
+## Sum of two independently distributed random variables
 
 Let $X, Y$ be two independent discrete random variables that take values in $\bar{X}, \bar{Y}$, respectively.
 
@@ -1127,10 +1144,10 @@ Start with a joint distribution
 $$
 \begin{aligned}
 f_{ij} & =\textrm{Prob}\{X=i,Y=j\}\\
-i& =0, \cdots，I-1\\
-j& =0, \cdots，J-1\\
-& \text{stacked to an }I×J\text{ matrix}\\
-& e.g. \quad I=1, J=1
+i& =0, \cdots, I-1\\
+j& =0, \cdots, J-1\\
+& \text{stacked to an }I\times J\text{ matrix}\\
+& e.g. \quad I=2, J=2
 \end{aligned}
 $$
 
@@ -1139,8 +1156,8 @@ where
 $$
 \left[
    \begin{matrix}
-  f_{11} & f_{12}\\
-  f_{21} & f_{22}
+  f_{00} & f_{01}\\
+  f_{10} & f_{11}
   \end{matrix}
 \right]
 $$
@@ -1149,7 +1166,7 @@ From the joint distribution, we have shown above that we  obtain **unique** marg
 
 Now we'll try to go in a reverse direction.
 
-We'll find that from two marginal distributions, can we usually construct more than one   joint distribution that verifies these marginals.
+We'll find that from two marginal distributions we can usually construct more than one joint distribution that satisfies these marginals.
 
 Each of these joint distributions is called a **coupling** of the two marginal distributions.
 
@@ -1162,9 +1179,7 @@ $$
 \end{aligned}
 $$
 
-Given two marginal distribution, $\mu$ for $X$ and $\nu$ for $Y$, a joint distribution $f_{ij}$ is said to be a **coupling** of $\mu$ and $\nu$.
-
-**Example:**
+Given two marginal distributions, $\mu$ for $X$ and $\nu$ for $Y$, a joint distribution $f_{ij}$ is said to be a **coupling** of $\mu$ and $\nu$.
 
 Consider the following bivariate example.
 
@@ -1174,13 +1189,13 @@ $$
 \text{Prob} \{X=1\}=& q  =\mu_{1}\\
 \text{Prob} \{Y=0\}=& 1-r  =\nu_{0}\\
 \text{Prob} \{Y=1\}= & r  =\nu_{1}\\
-\text{where } 0 \leq q < r \leq 1
+\text{where } 0 \leq q \leq r \leq 1
 \end{aligned}
 $$
 
 We construct  two couplings.
 
-The first coupling if our two marginal distributions is the joint distribution
+The first coupling of our two marginal distributions is the joint distribution
 
 $$f_{ij}=
 \left[
@@ -1199,7 +1214,7 @@ $$
 \mu_{0}= (1-q)(1-r)+(1-q)r & =1-q\\
 \mu_{1}= q(1-r)+qr & =q\\
 \nu_{0}= (1-q)(1-r)+(1-r)q& =1-r\\
-\mu_{1}= r(1-q)+qr& =r
+\nu_{1}= r(1-q)+qr& =r
 \end{aligned}
 $$
 
@@ -1216,7 +1231,7 @@ f_{ij}=
 \right]
 $$
 
-The verify that this is a coupling, note that
+To verify that this is a coupling, note that
 
 $$
 \begin{aligned}
@@ -1234,12 +1249,11 @@ But the joint distributions differ.
 
 Thus, multiple  joint distributions $[f_{ij}]$ can have  the same marginals.
 
-**Remark:**
-- Couplings  are important in optimal transport problems and in Markov processes. Please see this {doc}`lecture about optimal transport <opt_transport>`
+Couplings are important in optimal transport problems and in Markov processes. Please see this {doc}`lecture about optimal transport <opt_transport>`.
 
-## Copula Functions
+## Copula functions
 
-Suppose that $X_1, X_2, \dots, X_n$ are $N$ random variables  and that
+Suppose that $X_1, X_2, \dots, X_N$ are $N$ random variables  and that
 
 * their marginal distributions are $F_1(x_1), F_2(x_2),\dots, F_N(x_N)$,  and
 
@@ -1251,12 +1265,16 @@ $$
 H(x_1,x_2,\dots,x_N) = C(F_1(x_1), F_2(x_2),\dots,F_N(x_N)).
 $$
 
-We can obtain
+If the marginal distributions are continuous, then the copula is unique.
+
+In that case, we can recover it from the marginal inverses:
 
 $$
-C(u_1,u_2,\dots,u_n) = H[F^{-1}_1(u_1),F^{-1}_2(u_2),\dots,F^{-1}_N(u_N)]
+C(u_1,u_2,\dots,u_N) = H(F^{-1}_1(u_1),F^{-1}_2(u_2),\dots,F^{-1}_N(u_N))
 $$
 
+When marginal distributions are not continuous, one uses generalized inverses, and the copula is uniquely determined only on $\textrm{Ran}(F_1)\times \cdots \times \textrm{Ran}(F_N)$.
+
 In a reverse direction of logic, given univariate  **marginal distributions**
 $F_1(x_1), F_2(x_2),\dots,F_N(x_N)$ and a copula function $C(\cdot)$, the function $H(x_1,x_2,\dots,x_N) = C(F_1(x_1), F_2(x_2),\dots,F_N(x_N))$ is a **coupling** of $F_1(x_1), F_2(x_2),\dots,F_N(x_N)$.
 
@@ -1265,9 +1283,11 @@ Thus, for given marginal distributions, we can use  a copula function to determi
 
 Copula functions are often used to characterize **dependence** of  random variables.
 
-**Discrete marginal distribution**
+### Bivariate examples with discrete and continuous distributions
+
+#### Discrete marginal distribution
 
-As mentioned above,  for two given marginal distributions there can be more than one coupling.
+As mentioned above, for two given marginal distributions there can be more than one coupling.
 
 For example, consider two  random variables $X, Y$ with distributions
 
@@ -1285,23 +1305,23 @@ For these two random variables there can be more than one coupling.
 Let's first generate X and Y.
 
 ```{code-cell} ipython3
-# define parameters
-mu = np.array([0.6, 0.4])
-nu = np.array([0.3, 0.7])
+μ = np.array([0.6, 0.4])
+ν = np.array([0.3, 0.7])
 
 # number of draws
 draws = 1_000_000
 
-# generate draws from uniform distribution
-p = np.random.rand(draws)
+# generate independent draws from uniform distribution for X and Y
+p_x = rng.random(draws)
+p_y = rng.random(draws)
 
-# generate draws of X and Y via uniform distribution
+# generate draws of X and Y via independent uniform draws
 x = np.ones(draws)
 y = np.ones(draws)
-x[p <= mu[0]] = 0
-x[p > mu[0]] = 1
-y[p <= nu[0]] = 0
-y[p > nu[0]] = 1
+x[p_x <= μ[0]] = 0
+x[p_x > μ[0]] = 1
+y[p_y <= ν[0]] = 0
+y[p_y > ν[0]] = 1
 ```
 
 ```{code-cell} ipython3
@@ -1353,9 +1373,9 @@ f1_cum = np.cumsum(f1)
 draws1 = 1_000_000
 
 # generate draws from uniform distribution
-p = np.random.rand(draws1)
+p = rng.random(draws1)
 
-# generate draws of first copuling via uniform distribution
+# generate draws of first coupling via uniform distribution
 c1 = np.vstack([np.ones(draws1), np.ones(draws1)])
 # X=0, Y=0
 c1[0, p <= f1_cum[0]] = 0
@@ -1428,9 +1448,9 @@ f2_cum = np.cumsum(f2)
 draws2 = 1_000_000
 
 # generate draws from uniform distribution
-p = np.random.rand(draws2)
+p = rng.random(draws2)
 
-# generate draws of first coupling via uniform distribution
+# generate draws of second coupling via uniform distribution
 c2 = np.vstack([np.ones(draws2), np.ones(draws2)])
 # X=0, Y=0
 c2[0, p <= f2_cum[0]] = 0
@@ -1454,7 +1474,7 @@ f2_10 = sum((c2[0, :] == 1)*(c2[1, :] == 0))/draws2
 f2_11 = sum((c2[0, :] == 1)*(c2[1, :] == 1))/draws2
 
 # print output of second joint distribution
-print("first joint distribution for c2")
+print("second joint distribution for c2")
 c2_mtb = pt.PrettyTable()
 c2_mtb.field_names = ['c2_x_value', 'c2_y_value', 'c2_prob']
 c2_mtb.add_row([0, 0, f2_00])
@@ -1488,3 +1508,337 @@ print(c2_ymtb)
 We have verified that both joint distributions, $c_1$ and $c_2$, have identical marginal distributions of $X$ and $Y$, respectively.
 
 So they are both couplings of $X$ and $Y$.
+
+### Gaussian copula example
+
+A **Gaussian copula** uses the bivariate normal distribution to induce dependence between
+arbitrary marginal distributions.
+
+The construction has three steps:
+
+1. Draw $(Z_1, Z_2)$ from a bivariate standard normal with correlation $\rho$.
+2. Apply the standard normal CDF: $U_k = \Phi(Z_k)$. 
+   - The pair $(U_1, U_2)$ has uniform marginals but retains the dependence structure of $(Z_1, Z_2)$ --- this is the copula.
+3. Apply the inverse CDF of any desired marginal: $X_k = F_k^{-1}(U_k)$.
+
+The following code illustrates this with exponential marginals.
+
+```{code-cell} ipython3
+---
+mystnb:
+  figure:
+    caption: gaussian copula with exponential marginals
+    name: fig-gaussian-copula
+---
+
+# Gaussian copula parameters
+ρ_cop = 0.8
+n_cop = 100_000
+
+# Draw from bivariate standard normal with correlation ρ_cop
+z = rng.multivariate_normal(
+    [0, 0], [[1, ρ_cop], [ρ_cop, 1]], n_cop
+)
+
+# Apply normal CDF -> uniform marginals (the copula itself)
+u1 = stats.norm.cdf(z[:, 0])
+u2 = stats.norm.cdf(z[:, 1])
+
+# Apply inverse CDFs of desired marginals (here: Exponential)
+x1 = stats.expon.ppf(u1, scale=1.0)   # Exp with mean 1
+x2 = stats.expon.ppf(u2, scale=0.5)   # Exp with mean 0.5
+
+fig, axes = plt.subplots(1, 2, figsize=(10, 4))
+axes[0].scatter(u1[:3000], u2[:3000], alpha=0.2, s=2)
+axes[0].set_xlabel('$u_1$')
+axes[0].set_ylabel('$u_2$')
+axes[1].scatter(x1[:3000], x2[:3000], alpha=0.2, s=2)
+axes[1].set_xlabel('$x_1$ (Exp, mean=1)')
+axes[1].set_ylabel('$x_2$ (Exp, mean=0.5)')
+plt.show()
+
+print(f"Sample correlation of (x1, x2): {np.corrcoef(x1, x2)[0, 1]:.3f}")
+print(f"Sample correlation of (u1, u2): {np.corrcoef(u1, u2)[0, 1]:.3f}")
+```
+
+The left panel shows the copula itself -- the dependence structure in uniform coordinates, drawn from a bivariate normal with correlation $\rho = 0.8$.
+
+The right panel shows the same dependence translated to exponential marginals.
+
+Changing $\rho$ controls the strength of dependence while the marginals remain unchanged.
+
+## Exercises
+
+```{exercise}
+:label: prob_matrix_ex1
+
+**Independence Test**
+
+Consider the joint distribution
+
+$$
+F = \begin{bmatrix} 0.3 & 0.2 \\ 0.1 & 0.4 \end{bmatrix}
+$$
+
+where $X \in \{0,1\}$ and $Y \in \{10, 20\}$.
+
+1. Compute the marginal distributions $\mu_i = \text{Prob}\{X=i\}$ and $\nu_j = \text{Prob}\{Y=j\}$.
+
+1. Form the independence matrix $f^{\perp}_{ij} = \mu_i \nu_j$ (the outer product of the two marginal vectors).
+
+1. Compare $F$ with $f^{\perp}$ and determine whether $X$ and $Y$ are independent.
+
+1. Verify your conclusion by computing $\text{Prob}\{X=0|Y=10\}$ and checking whether it equals $\text{Prob}\{X=0\}$.
+```
+
+```{solution-start} prob_matrix_ex1
+:class: dropdown
+```
+
+Here is one solution:
+
+```{code-cell} ipython3
+F = np.array([[0.3, 0.2],
+              [0.1, 0.4]])
+
+μ = F.sum(axis=1)
+ν = F.sum(axis=0)
+print("μ (marginal of X):", μ)
+print("ν (marginal of Y):", ν)
+
+F_indep = np.outer(μ, ν)
+print("\nIndependence matrix (outer product):\n", F_indep)
+print("\nActual joint F:\n", F)
+
+print("\nIndependent (F == μ times ν)?", np.allclose(F, F_indep))
+
+prob_X0_given_Y10 = F[0, 0] / ν[0]
+print(f"\nProb(X=0 | Y=10) = {prob_X0_given_Y10:.4f}")
+print(f"Prob(X=0)         = {μ[0]:.4f}")
+```
+
+```{solution-end}
+```
+
+```{exercise}
+:label: prob_matrix_ex2
+
+**Covariance and Correlation**
+
+Using the same joint distribution $F$ and values $X \in \{0,1\}$, $Y \in \{10, 20\}$ as in Exercise 1:
+
+1. Compute $\mathbb{E}[X]$, $\mathbb{E}[Y]$, and $\mathbb{E}[XY] = \sum_i \sum_j x_i y_j f_{ij}$.
+
+1. Compute $\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$.
+
+1. Compute $\text{Cor}(X,Y) = \text{Cov}(X,Y) / (\sigma_X \sigma_Y)$.
+
+1. Show analytically that $X \perp Y$ implies $\text{Cov}(X,Y) = 0$.
+```
+
+```{solution-start} prob_matrix_ex2
+:class: dropdown
+```
+
+Here is one solution:
+
+```{code-cell} ipython3
+xs = np.array([0, 1])
+ys = np.array([10, 20])
+F  = np.array([[0.3, 0.2],
+               [0.1, 0.4]])
+
+μ = F.sum(axis=1)
+ν = F.sum(axis=0)
+
+E_X  = xs @ μ
+E_Y  = ys @ ν
+E_XY = sum(xs[i] * ys[j] * F[i, j] for i in range(2) for j in range(2))
+print(f"E[X] = {E_X}, E[Y] = {E_Y}, E[XY] = {E_XY}")
+
+cov_XY = E_XY - E_X * E_Y
+print(f"Cov(X,Y) = {cov_XY:.4f}")
+
+var_X  = ((xs - E_X)**2) @ μ
+var_Y  = ((ys - E_Y)**2) @ ν
+cor_XY = cov_XY / np.sqrt(var_X * var_Y)
+print(f"Cor(X,Y) = {cor_XY:.4f}")
+```
+
+For part 4: if $X \perp Y$ then $f_{ij} = \mu_i \nu_j$, so
+
+$$
+\mathbb{E}[XY] = \sum_i \sum_j x_i y_j \mu_i \nu_j
+= \left(\sum_i x_i \mu_i\right)\!\left(\sum_j y_j \nu_j\right)
+= \mathbb{E}[X]\,\mathbb{E}[Y]
+$$
+
+and therefore $\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] = 0$.
+
+```{solution-end}
+```
+
+```{exercise}
+:label: prob_matrix_ex3
+
+**Sum of Two Dice**
+
+Let $X$ and $Y$ be **independent** random variables, each uniformly distributed on $\{1,2,3,4,5,6\}$, and let $Z = X + Y$.
+
+1. Use the convolution formula $h_k = \sum_i f_i g_{k-i}$ to compute the distribution of $Z$.
+
+1. Plot the result generated by the formula.
+
+1. Simulate $10^6$ rolls and overlay the empirical histogram on the plot.
+
+1. Compute $\mathbb{E}[Z]$ and $\text{Var}(Z)$ from the two calculations
+```
+
+```{solution-start} prob_matrix_ex3
+:class: dropdown
+```
+
+Here is one solution:
+
+```{code-cell} ipython3
+f = np.ones(6) / 6
+g = np.ones(6) / 6
+h = [
+    sum(f[i]*g[k-i] for i in range(
+        max(0, k-len(g)+1), # f_i exists 
+        min(len(f), k+1))   # g_{k-i} exists
+        ) 
+        for k in range(len(f) + len(g) - 1)]
+z_vals = np.arange(2, 13)
+
+n = 1_000_000
+z_sim = rng.integers(1, 7, n) + rng.integers(1, 7, n)
+counts = np.bincount(z_sim, minlength=13)[2:]
+
+fig, ax = plt.subplots()
+ax.bar(z_vals - 0.2, h,          0.4, alpha=0.7, label='Theoretical')
+ax.bar(z_vals + 0.2, counts / n, 0.4, alpha=0.7, label='Empirical')
+ax.set_xlabel('Z = X + Y')
+ax.set_ylabel('Probability')
+ax.legend()
+plt.show()
+
+E_Z   = z_vals @ h
+Var_Z = ((z_vals - E_Z)**2) @ h
+print(f"Theory:     E[Z] = {E_Z:.2f}, Var(Z) = {Var_Z:.4f}")
+print(f"Simulation: E[Z] = {np.mean(z_sim):.2f}, Var(Z) = {np.var(z_sim):.4f}")
+```
+
+```{solution-end}
+```
+
+```{exercise}
+:label: prob_matrix_ex4
+
+**Multi-Step Transition Probabilities**
+
+Consider a two-state Markov chain with transition matrix
+
+$$
+P = \begin{bmatrix} 0.9 & 0.1 \\ 0.2 & 0.8 \end{bmatrix}
+$$
+
+where $p_{ij} = \text{Prob}\{X(t+1)=j \mid X(t)=i\}$.
+
+1. Starting from $\psi_0 = [1, 0]$, compute $\psi_n = \psi_0 P^n$ for $n = 1, 5, 20, 100$.
+
+1. Find the stationary distribution $\psi^*$ satisfying $\psi^* P = \psi^*$ and $\sum_i \psi^*_i = 1$.
+
+1. Verify numerically that $\psi_n \to \psi^*$ as $n$ grows.
+```
+
+```{solution-start} prob_matrix_ex4
+:class: dropdown
+```
+
+Here is one solution:
+
+```{code-cell} ipython3
+P = np.array([[0.9, 0.1],
+              [0.2, 0.8]])
+ψ0 = np.array([1.0, 0.0])
+
+for n in [1, 5, 20, 100]:
+    print(f"ψ_{n:3d} = {ψ0 @ np.linalg.matrix_power(P, n)}")
+
+A = np.vstack([P.T - np.eye(2), np.ones(2)])
+b = np.array([0.0, 0.0, 1.0])
+ψ_star, *_ = np.linalg.lstsq(A, b, rcond=None)
+print(f"\nStationary distribution: {ψ_star}")
+
+ψ_100 = ψ0 @ np.linalg.matrix_power(P, 100)
+print(f"ψ_100 close to stationary? {np.allclose(ψ_100, ψ_star, atol=1e-6)}")
+```
+
+```{solution-end}
+```
+
+```{exercise}
+:label: prob_matrix_ex5
+
+**Bayes' Law with a Discrete Prior**
+
+A coin has unknown bias $\theta \in \{0.2,\, 0.5,\, 0.8\}$ with prior $\pi = [0.25,\, 0.50,\, 0.25]$.
+
+Assume that, conditional on $\theta$, the coin flips are i.i.d. Bernoulli($\theta$).
+
+1. After observing $k = 7$ heads in $n = 10$ flips, compute the likelihood
+
+   $$
+   \mathcal{L}(\theta \mid \text{data}) = \binom{10}{7}\,\theta^7\,(1-\theta)^3
+   $$
+
+   for each $\theta$.
+
+2. Apply equation {eq}`eq:condprobbayes` to compute the posterior $\pi(\theta \mid \text{data})$.
+
+3. Plot the prior and posterior side by side.
+
+4. Repeat for $k = 3$ heads and describe how the posterior shifts.
+```
+
+```{solution-start} prob_matrix_ex5
+:class: dropdown
+```
+
+Here is one solution:
+
+```{code-cell} ipython3
+θ_vals = np.array([0.2, 0.5, 0.8])
+π = np.array([0.25, 0.50, 0.25])
+
+def compute_posterior(k, n, θ_vals, π):
+    likelihood = comb(n, k) * θ_vals**k * (1 - θ_vals)**(n - k)
+    unnorm = likelihood * π
+    return unnorm / unnorm.sum(), likelihood
+
+post7, lik7 = compute_posterior(7, 10, θ_vals, π)
+post3, lik3 = compute_posterior(3, 10, θ_vals, π)
+
+print("k=7:  likelihood =", lik7.round(4), 
+      " posterior =", post7.round(4))
+print("k=3:  likelihood =", lik3.round(4), 
+      " posterior =", post3.round(4))
+
+x = np.arange(len(θ_vals))
+w = 0.3
+fig, axes = plt.subplots(1, 2, figsize=(10, 4))
+for ax, post, title in zip(
+    axes, [post7, post3], ['k=7 heads', 'k=3 heads']):
+    ax.bar(x - w/2, π, w, label='Prior',     alpha=0.7)
+    ax.bar(x + w/2, post,  w, label='Posterior', alpha=0.7)
+    ax.set_xticks(x)
+    ax.set_xticklabels([f'θ={t}' for t in θ_vals])
+    ax.set_ylabel('Probability')
+    ax.set_title(title)
+    ax.legend()
+plt.show()
+```
+
+```{solution-end}
+```