Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmake: support generator expression in yaml module #80007

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

pillo79
Copy link
Collaborator

@pillo79 pillo79 commented Oct 17, 2024

This PR adds support for generator expressions (genexes) in values and lists to the yaml module, along with associated tests. As discussed in #78727, this is a prerequisite for cleaning up and expanding the LLEXT EDK information.

Since the yaml module will also be used from llext-edk.cmake in script mode, this PR expands the current YAML test suite so that it is executed both in CMake project and script modes.

This new functionality can be seen in operation in PR #83705, which rewrites the EDK to use the YAML import.

Implementation details

Generator expression semantics

Values that contain genexes must be stored in the YAML context passing the new GENEX flag to yaml_set() to enable genex expansion on the whole context.

Generator expressions can only be expanded by CMake after all configuration code has been executed and the final values of the project properties are defined. This is a challenge for current yaml_save(), which is expected to write out the YAML file immediately.
To achieve both goals, on contexts that use generator expressions the YAML file is written twice:

  • a first partial file is written at config time, from inside yaml_save(), excluding all values that contain a generator expression;
  • a fully-formed file is written at build time, as a pre-build step of a specified target.

To enable this two-step build, the TARGET parameter must be passed to the yaml_save() call; otherwise, only the partial file is written and a warning is issued if the file contains genexes.

Lists expanded from a single generator expression

By the time the generator expressions are expanded by CMake, the structure of the YAML file would be already defined, and all items inside a list would get expanded into a single element - definitely not the desired behavior.

To overcome this limitation, in YAML contexts that use generator expressions, lists are not directly expanded into proper JSON lists but stored in their CMake string format with a special marker.
When the yaml_save() function is called, the internal JSON representation is written using file(GENERATE ...) to a temporary file to obtain a seamlessly expanded JSON object. The temporary file is then processed (as a pre-build step of the specified target) by a custom command that reads the intermediate JSON, replaces the marked lists-as-string with the correct list format, and writes the final YAML file.

Review notes

The first 2 commits are preparatory refactors that do not change the functionality of the existing code:

Commits 3 and 4 expand the yaml module tests:

Commits 5 and 6 introduce generator expression logic and tests, and may be best viewed ignoring whitespace due to indentation changes for large blocks of existing code:

Copy link
Collaborator

@tejlmand tejlmand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing some prototyping of this, for sure this can be beneficial.

Haven't been into all details, but tried to read enough details of the code to understand the flow.

Some overall comments.
I think we should avoid both yaml_save() and yaml_generate().
Users should not care when writing the file whether it contains genex'es or not.

And it should be possible to do multiple writes, for example like this:

yaml_set(...)
yaml_set(...)
yaml_save(...)
yaml_set(... GENEX ...)
yaml_save(...)
yaml_set(...)
yaml_save(...)

I believe we can accomplish that by having each save writing the yaml file with only clean content, and then writing the json with the content and the genex'es.

If there are any genex'es, then the file(GENERATE ...) will be used to transform the internal genex json to the file with evaluated expressions.
The pre-build custom command is then used to convert the final json to final yaml file.
As I don't think file(GENERATE ...) allows to generate the same file multiple times, then we probably need to increment, like <name>_<n>.json.
But this is a detail in this context.

Also remember proper DEPENDS on the custom command.

But I like the idea and direction 👍

Comment on lines 420 to 422
message(FATAL_ERROR "YAML context '${ARG_YAML_NAME}' uses generator expressions."
"Use 'yaml_generate()' to write its contents to a file."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really what we want ?

I could see cases where we would like to write parts of a yaml file (such as build_info) during configure time even though we don't have all knowledge yet.
A partial yaml file may be useful in cases where CMake generator fails.

Secondly, the caller of yaml_save() may not know if other parts of the build system has decided to use genex'es, for example a genex could suddenly be used by a hal, Zephyr module, etc and thus only be active when said feature is enabled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to write intermediate, "broken" output files for the various reasons in the review comment. However, I think it would be fine to write to a different file (.partial_N.yaml?) when genexes are detected. This would still provide debugging info, but avoid most of the above breakages.

Copy link
Collaborator

@tejlmand tejlmand Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to write intermediate, "broken" output files for the various reasons in the review comment.

I said nothing about writing broken files. I said partial files.

For example like this:

build_info(zephyr version VALUE 1.2.3)
build_info(foo lib files VALUE $<TARGET_PROPERTY:foo,SOURCES>)
yaml_save(NAME build_info)    # Out file with known content is written immediately
                              # genex'es are evaluated (and written) at generator time.

... more code ...
build_info(foo bar VALUE baz)
yaml_save(NAME build_info)

So during CMake configure and generator time the following happens:

  • First yaml_save() (configure time)
    zephyr:
      version: 1.2.3
    
  • Second yaml_save() (configure time)
    zephyr:
      version: 1.2.3
    foo:
      bar: baz
    
  • Generator time,
    genex from the property evaluation
    zephyr:
      version: 1.2.3
    foo:
      bar: baz
      lib:
        - 'file1.c'
        - 'file2.c'
    

Comment on lines 483 to 484
add_custom_target(${ARG_YAML_NAME}_yaml_filter ALL DEPENDS ${yaml_file}.tmp ${yaml_target})
add_custom_command(TARGET ${yaml_target} PRE_BUILD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like something which can break in case two parts of the build system decides to call yaml_generate() on the same target independent of each other.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I only expected one "save" operation, as it is a target-generating step. What is a possible use case for multiple yaml_generate calls on the same scope?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that you could want to save known data as soon as you have them.
For example, sysbuild could want to save as much info as possible before CMake config step of individual images are executed (and potentially breaks the configure step).

This would provide knowledge to users / IDEs / other tools about configuration files, included images, etc. even if one of those fails.

# YAML needs to happen after cmake has completed processing.
#
# This scripts expects as input:
# - TMP_FILE: the name of the input file, in JSON format, that contains
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want / need a temp file, then let's at least keep the format clear in this script.

That is JSON_FILE instead of TMP_FILE.

Then it's up to the caller of this script to decide if the json file is just temporary or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will do. The rationale for TMP here was that the JSON still contains peculiar stuff like lists-as-strings, and so it is not a "final" JSON format.

Copy link
Collaborator Author

@pillo79 pillo79 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the early review @tejlmand! 👍

I think we should avoid [having] both yaml_save() and yaml_generate().
Users should not care when writing the file whether it contains genex'es or not.

I tend to disagree, or more specifically, I think that users must know if they want to support genexes or not. This choice results in very different flows:

  • when no genexes are expected, the YAML may be written immediately and its available even during config time; there is no target associated to the file.
  • when genexes may be present, an intermediate non-YAML file must be written after config is done, and then post-processed during build. If the results are required for further evaluation, targets must depend on the final file.
Config time (yaml_save) Build time (yaml_generate) Target / Deps
No genex full contents (same full contents) meaningless
With genex partial, broken full contents required

Unless we drop the immediate file write in yaml_save, IMO merging them has confusing results:

  • Most importantly, the resulting "common save" would have unspecified semantics - the file would be available immediately or as a build step depending on other code. Having two separate functions makes it explicit what the expected flow is. Note that it is harmless to expect genexes but not actually have them.

  • Secondly, generation requires an associated TARGET and this must be provided to the "common save" [1] when genexes are expected. I could write it anyway and skip the generation code if no target was given, but should I warn the user (noisy in your example!) or not (a "silent" bug)?

  • Finally, the build info YAML is re-loaded from disk to support sysbuild; in this case, a partially written file may be unparsable or introduce unexpected dynamics at best. What if, for example, I read in a genex from another CMake instance?

Let me know your thoughts on these issues.


Note [1]: To avoid that I also tried adding a scoped property early via yaml_create(TARGET ...), but this would require doing the same to yaml_load(), and a yaml_load(TARGET ...) is confusing at best.

Comment on lines 420 to 422
message(FATAL_ERROR "YAML context '${ARG_YAML_NAME}' uses generator expressions."
"Use 'yaml_generate()' to write its contents to a file."
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to write intermediate, "broken" output files for the various reasons in the review comment. However, I think it would be fine to write to a different file (.partial_N.yaml?) when genexes are detected. This would still provide debugging info, but avoid most of the above breakages.

Comment on lines 483 to 484
add_custom_target(${ARG_YAML_NAME}_yaml_filter ALL DEPENDS ${yaml_file}.tmp ${yaml_target})
add_custom_command(TARGET ${yaml_target} PRE_BUILD
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I only expected one "save" operation, as it is a target-generating step. What is a possible use case for multiple yaml_generate calls on the same scope?

# YAML needs to happen after cmake has completed processing.
#
# This scripts expects as input:
# - TMP_FILE: the name of the input file, in JSON format, that contains
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will do. The rationale for TMP here was that the JSON still contains peculiar stuff like lists-as-strings, and so it is not a "final" JSON format.

@tejlmand
Copy link
Collaborator

I think that users must know if they want to support genexes or not.

How can the Zephyr top-level CMakeLists.txt here:

yaml_save(NAME build_info)

know if some arbitrary HAL decides to call build_info() that includes a genex ?

And the same go for any other yaml save usage.
Code should be flexible enough that you can just decide to add a genex to your yaml call without having to change some other location where the yaml_save() is called.

@pillo79 pillo79 force-pushed the pr-yaml-genex branch 4 times, most recently from f929631 to 8978910 Compare December 20, 2024 10:52
This introductory commit refactors the `yaml_set` function separating
the bodies of list operations into internal functions while preserving
their original behavior.

The conditions that cause the value to be interpreted as a list are also
verified only once, avoiding multiple checks.

Signed-off-by: Luca Burelli <[email protected]>
This introductory commit removes the ignored 'actual' variable from the
'yaml_set()' calls in the tests to make the code cleaner. Also, a badly
indented block is fixed in the 'message()' macro.

No functional change is introduced by this commit.

Signed-off-by: Luca Burelli <[email protected]>
This test verifies that the APPEND LIST operation in yaml_set() works as
expected.

Signed-off-by: Luca Burelli <[email protected]>
To verify proper functionality, the YAML module test suite is now run
twice: once in the regular CMake project mode and once in the CMake
script mode.

Signed-off-by: Luca Burelli <[email protected]>
This commit adds support for generator expressions in values and lists
to the yaml module.

Generator expressions can only be expanded by CMake after all
configuration code has been executed and the final values of the project
properties are defined. This means that contexts that contain generator
expressions are written twice:

 - the first time, during yaml_save(), a comment with the raw unexpanded
   string is saved instead of the key that uses generator expressions in
   the YAML file;

 - the second time, as a pre-build step of a target, the generator
   expressions are expanded, and the final YAML file is written.

This two-step process also allows to overcome the issue of lists that
are extracted from generator expressions, whose elements would be
expanded into a single string if written directly to the YAML file.
Instead, the lists are stored in their CMake string format with a
special marker, expanded by CMake into a temporary JSON file, and the
conversion to a proper list is performed during the pre-build step.

Note that when generator expressions are used in the context:

 - the GENEX keyword must be provided to yaml_set(), to mark the context
   as containing generator expressions; this is necessary to avoid
   storing the genexes as raw strings in the YAML.

 - the YAML file must be written providing the TARGET parameter to
   yaml_save().  If it is not provided, a warning will be issued and
   only the partial file will be written.

Signed-off-by: Luca Burelli <[email protected]>
Add tests to the YAML suite for generator expressions. The test checks
that generator expressions are handled as expected at different stages:

 - immediately after yaml_save(), the file must contain only non-genex
   entries;
 - at a later time, after the target has been built, the file must
   contain all the expanded genex values.

Signed-off-by: Luca Burelli <[email protected]>
@pillo79
Copy link
Collaborator Author

pillo79 commented Jan 9, 2025

@tejlmand Here's v2, review ready:

  • rebased to take advantage of YAML script mode support
  • dropped yaml_generate() and implemented the required functionality for yaml_save():
    • immediately writes a file with comments in place of genex entries
    • full file is generated as a pre-build step of the specified TARGET
  • added full tests for the new yaml_save() functionality

This code is used in #83705 to rewrite the EDK to read the same data from build_info.yml and zephyr/.config instead of the command line arguments.

@pillo79 pillo79 marked this pull request as ready for review January 9, 2025 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants