Skip to content

use case based cookbooks#461

Open
KarthikAvinashFI wants to merge 52 commits intoastrofrom
feature/th-3418-use-case-based-cookbooks
Open

use case based cookbooks#461
KarthikAvinashFI wants to merge 52 commits intoastrofrom
feature/th-3418-use-case-based-cookbooks

Conversation

@KarthikAvinashFI
Copy link
Copy Markdown
Contributor

Pull Request

Description

Describe the changes in this pull request:

  • What feature/bug does this PR address?
  • Provide any relevant links or screenshots.

Checklist

  • Code compiles correctly.
  • Created/updated tests.
  • Linting and formatting applied.
  • Documentation updated.

Related Issues

Closes #<issue_number>

@linear
Copy link
Copy Markdown

linear bot commented Mar 11, 2026

@entelligence-ai-pr-reviews
Copy link
Copy Markdown

⚠️ Trial Period Expired ⚠️

Your trial period has expired. To continue using this feature, please upgrade to a paid plan here or book a time to chat here.

@KarthikAvinashFI KarthikAvinashFI changed the title initial commit use case based cookbooks Mar 11, 2026
KarthikAvinashFI and others added 25 commits March 23, 2026 20:29
- text_to_sql passes all 5 cases (doesn't catch subtle logic error)
- Updated similarity scores to match real values (0.95, 0.87, 0.58, 0.54)
- Updated narrative: multiple layers needed since intent validation alone misses bugs
- Removed redundant paragraph in execution testing section
- Updated decision matrix to gate on ground_truth_match + execution

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace broken completeness eval (SDK class lookup bug) with working scanners
- Replace duplicate answer_relevancy with threshold tuning
- Use SecretsScanner + InvisibleCharScanner (local) instead of PIIScanner + ToxicityScanner (broken EvalDelegate 400 errors)
- All sample outputs match real notebook results
- Explain faithfulness catches pricing bug, answer_relevancy local model limitation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- security -> prompt_injection, content_moderation -> toxicity
- Remove stale sample outputs, use generic truncated examples
- Update quality score prose to not reference exact scores
- Fix factual_accuracy -> context_adherence
- All code and prose consistent between MDX and notebook
- Remove hardcoded score ranges from prose
- Fix wrong character count (16 -> 15) and ratio (133% -> generic)
- Remove duplicate legal disclaimer paragraph
- Generic interpretive prose instead of specific run observations
- text-to-sql: "All five cases pass" -> "In our test run, all five cases pass"
- coding-agent-eval: "All six scenarios pass" -> conditional language
coding-agent-eval:
- Fix fact_result/fact_score -> adh_result/adh_score in MDX Step 3
- Remove duplicate paragraph in Step 4

translation-eval:
- Remove exact 125%/250%/30chars from prose
- Remove duplicate paragraphs in Steps 4 and 5
- Fix "130% threshold" -> "per-string-type threshold"
- Fix deprecated metric names: security -> prompt_injection, content_moderation -> toxicity
- Red-teaming: naive v1 prompt (role only), is_pass fix, synthetic data narrative, EduBright framing
- Compliance: toxicity instead of content_moderation in code and prose
- All sample outputs and prose references updated
…optimization-loop

- compliance-hipaa-gdpr: fix INPUT/OUTPUT_RULES alignment
- domain-hallucination-detection: real classification results, turing_small fix
- end-to-end-agent-testing: critical analysis, FMA, optimization trials
- red-teaming-llm: real Protect results (7/10 blocked), RT-007 fix narrative
- Remove simulation-optimization-loop (merged into end-to-end)
- Update navigation to remove deleted page

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace sk-proj-* pattern with your-openai-api-key placeholder in the
hardcoded_secret test snippet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep both copy-button script and FastNav component.
…e prod-quality-monitoring

- All 10 use-case intros now explicitly name the FutureAGI features used
  (Simulate, Evals, Optimize, Protect, Observe, Agent Compass, Knowledge Base,
  Prompt Management, Experimentation, Datasets, AutoEvalPipeline)
- production-quality-monitoring: split Step 1 into Step 1 (Define agent)
  and Step 2 (Trace every call), renumber subsequent steps. Replace old
  trace-spans video with three screenshots that show the trace detail,
  the eval columns in the trace table, and the populated Evals tab
- end-to-end-agent-testing: bump default to 100 conversations in step
  title and dashboard config, add scale-flex paragraph, polish duplicate
  Chat Details sentence with bolded eval names
- coding-agent-eval: replace sk-proj fake key in test data with placeholder
… cookbooks

secure-ai-evals-guardrails:
- Update intro to name FutureAGI Protect and Evals as bold proper nouns
- Add "before guardrails" demo in Step 1 showing the chatbot failing at
  prompt injection and PII leakage before any enforcement is added
- Add cross-links to end-to-end-agent-testing and production-quality-monitoring
  cookbooks in Steps 5 and 6
- Replace Explore further cards with valid sibling + quickstart links

domain-hallucination-detection:
- Add new Step 1 "Meet the chatbot you are evaluating" introducing the
  MediSafe pharma chatbot agent and the three hallucination patterns
- Reframe KB step intro to explain how Knowledge Base enables grounded
  evaluation (cross-references responses against source documents)
- Reframe test cases as real production interactions (not hypothetical),
  with cross-link to Simulate for generating them at scale
- Add cross-link to end-to-end-agent-testing in intro for readers who
  haven't built their agent yet
…nt fixes

Apply reviewer feedback across all 10 remaining use-case cookbooks:

- Intros now name exact FutureAGI features with bold proper nouns
  (Protect metrics by canonical name, Evals with evaluate() vs Evaluator
  distinction, Knowledge Base with indexing details, Simulate with
  scenario types, Prompt Management with version/label system)
- Cross-links added at natural points in narrative (not as lists):
  links to End-to-End Agent Testing, Production Quality Monitoring,
  Secure AI Evals, Protect Guardrails, Custom Eval Metrics, etc.
- Explore further CardGroups replaced with valid links using
  confirmed sidebar icons (flask, gauge, shield, zap, rocket, etc.)
- No fabricated analysis or data changes
- No em-dashes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant