Skip to content

Fix EFA field naming#392

Open
FarhanTejani wants to merge 1 commit intoaws:mainfrom
FarhanTejani:fix/node-count-resource-fields
Open

Fix EFA field naming#392
FarhanTejani wants to merge 1 commit intoaws:mainfrom
FarhanTejani:fix/node-count-resource-fields

Conversation

@FarhanTejani
Copy link
Member

@FarhanTejani FarhanTejani commented Mar 18, 2026

What's changing and why?

Closes #306

The v1.1 schema and model defined EFA fields as efa_interfaces/efa_interfaces_limit, but the jinja template used efa/efa_limit. This caused:

  • efa in config YAML to be rejected by validation ("Extra inputs are not permitted")
  • efa_interfaces in config YAML to pass validation but not render in the template

Standardized on efa/efa_limit to match the naming pattern of other resource fields (accelerators, vcpu, memory) and the Kubernetes resource key (vpc.amazonaws.com/efa).

Before/After UX

Before:

Specifying efa in config.yaml fails validation:

efa: 4
# Config validation errors:
#   efa: Extra inputs are not permitted

Specifying efa_interfaces passes validation but is silently ignored in the rendered YAML.

After:

EFA renders correctly in the K8s YAML:

efa: 4
efa_limit: 4
resources:
  requests:
    vpc.amazonaws.com/efa: 4
  limits:
    vpc.amazonaws.com/efa: 4

How was this change tested?

  • All unit tests pass
  • Added jinja template rendering tests to catch schema / template field name mismatches

Are unit tests added?

Yes. Added TestJinjaTemplateRendering with two tests that render the jinja template directly and verify all resource fields (accelerators, vcpu, memory, efa, accelerator partitions, etc.) appear correctly in the output. This catches the class of bug where schema field names drift from template variable names.

Are integration tests added?

N/A

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • All automated PR checks pass
  • Failed tests include local run results/screenshots proving they work
  • Changes are documentation-only

@FarhanTejani FarhanTejani requested a review from a team as a code owner March 18, 2026 04:59
return {k: v for k, v in kwargs.items() if v is not None}

# Build resources
if self.instance_type is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add this:

   if self.accelerators or self.accelerators_limit:
        raise ValueError("instance_type is required when specifying accelerator resources")

This should be part of PR #393 since it is fixing issue #317

@mollyheamazon mollyheamazon self-requested a review March 19, 2026 18:36
Copy link
Collaborator

@mollyheamazon mollyheamazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The efa field naming change looks good.
The other fix in the PR is that it removes the node_count vs resources mutual exclusivity check. This was discussed in thread link — please confirm with the stakeholders to align with this approach given the quota implications, or if you'd prefer we introduce --num-replicas instead.

@FarhanTejani FarhanTejani changed the title Fix node_count resource fields mutual exclusivity and EFA field naming Fix EFA field naming Mar 26, 2026
@FarhanTejani
Copy link
Member Author

Scoped this PR down to just the EFA field rename (#306) and template rendering tests. I'll open a separate PR for the node_count work after syncing with the stakeholders.

@FarhanTejani FarhanTejani force-pushed the fix/node-count-resource-fields branch from 21382e9 to 80b5443 Compare March 26, 2026 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HyperPod PyTorch job does not request EFA resources

2 participants