Environments are configured by creating a config.mjs or config.js file that exposes an object
that satisfies the EnvironmentConfig interface. This document covers all options in
EnvironmentConfig and what they do.
These properties all have to be specified in order for the environment to function
Human-readable name that is shown in eval reports about this environment.
Unique ID for the environment. If omitted, one is generated from the displayName.
ID of the client-side framework that the environment runs, for example angular.
An array defining the ratings that are executed as a part of the evaluation. The ratings determine the score assigned for the test run. Currently, the tool supports the following built-in ratings:
| Rating Name | Description |
|---|---|
PerBuildRating |
Assigns a score based on the build result of the generated code, e.g. "Does it build on the first run?" or "Does it build after X repair attempts?" |
PerFileRating |
Assigns a score based on the content of individual files generated by the LLM. Can be run either against all file types by setting the filter toPerFileRatingContentType.UNKNOWN or against specific files. |
LLMBasedRating |
Rates the generated code by asking an LLM to assign a score to it, e.g. "Does this app match the specified prompts?" |
Name of the package manager to use to install dependencies for the evaluated code.
Supports npm, pnpm and yarn. Defaults to npm.
Relative path to the system instructions that should be passed to the LLM when generating code.
Relative path to the system instructions that should be passed to the LLM when repairing failures.
Configures the prompts that should be evaluated against the environment. Can contain either strings
which represent glob patterns pointing to text files with the prompt's text
(e.g. ./prompts/**/*.md) or MultiStepPrompt objects (see below).
The prompts can be shared between environments
(e.g. executablePrompts: ['../some-other-env/prompts/**/*.md']).
When enabled, the system prompts for this environment won't be included in the final report. This is useful when evaluating confidential code.
Whether to skip installing dependencies during the eval run. This is useful if you've already installed dependencies through something like pnpm workspaces.
Prompts are typically stored in .md files. The tool supports the following template syntax inside
of these files in order to augment the prompt and reduce boilerplate:
| Helper / Variable | Description |
|---|---|
{{> embed file='../path/to/file.md' }} |
Embeds the content of the specified file in the current one. |
{{> contextFiles '**/*.foo' }} |
Specifies files that should be passed to the LLM as context when the prompt is executed. Should be a comma-separated string of glob pattern within the environments project code. E.g. {{> contextFiles '**/*.ts, **/*.html' }} passes all .ts and .html files as context. |
{{CLIENT_SIDE_FRAMEWORK_NAME}} |
Insert the name of the client-side framework of the current environment. |
{{FULL_STACK_FRAMEWORK_NAME}} |
Insert the name of the full-stack framework of the current environment. |
If you want to run a set of ratings against a specific prompt, set an object literal in the
executablePrompts array, instead of a string:
executablePrompts: [
// Runs only with the environment-level ratings.
'./prompts/foo/*.md',
// Runs the ratings specific to the `contact-form.md`, as well as the environment-level ones.
{
path: './prompts/bar/contact-form.md',
ratings: contactFormSpecificRatings,
},
];Multistep prompts evaluate workflows composed of one or more stages.
Steps execute one after another inside the same directory, but are rated individually. The tool
takes
snapshots after each step and includes them in the final report. You can create a multistep prompt
by
passing an instance of the MultiStepPrompt class into the executablePrompts array, for example:
executablePrompts: [
new MultiStepPrompt('./prompts/about-page', {
'step-1': ratingsForFirstStep,
'step-2': [...ratingsForFirstStep, ratingsForSecondStep],
}),
];The first parameter is the directory from which to resolve the individual step prompts.
All files in the directory have to be named step-{number}.md, for example:
my-env/prompts/about-page/step-1.md:
Create an "About us" page.
my-env/prompts/about-page/step-2.md:
Add a contact form to the "About us" page
my-env/prompts/about-page/step-3.md:
Make it so submitting the contact form redirects the user back to the homepage.
The second parameter of MultiStepPrompt defines ratings that should be run only against specific
steps. The key is the name of the step (e.g. step-2) while the value are the ratings that should
run against it.
These properties aren't required for the environment to run, but can be used to configure it further.
Directory into which the LLM-generated files are written, built, executed, and evaluated.
Can be an entire project or a handful of files to be merged with the
projectTemplate (see below)
Used for reducing the boilerplate when setting up an environment, projectTemplate specifies the
path of a project template to be merged together with the files from sourceDirectory, creating
the final project structure against which the evaluation runs.
For example, if the config has
projectTemplate: './templates/angular', sourceDirectory: './project',
the eval runner copies the files from ./templates/angular into the output directory
and then applies the files from ./project on top of them, merging directories and replacing
overlapping files.
Name of the full-stack framework that used in the evaluation, in addition to the
clientSideFramework. If omitted, the fullStackFramework is set to the same value as
the clientSideFramework.
IDs of Model Context Protocol (MCP) servers that are started and exposed to the LLM as a part of the evaluation.
Command used to build the generated code as a part of the evaluation.
Defaults to <package manager> run build.
Command used to start a local dev server as a part of the evaluation.
Defaults to <package manager> run start --port 0.
Command used to run tests against the generated code. If this property is not provided, tests will not be run. The command should exit with code 0 on success and a non-zero exit code on failure. The output from the command (both stdout and stderr) is captured and used for repair attempts if the tests fail. The test command will time out after 4 minutes.