Skip to content

Commit

Permalink
correct best practices links, be more specific in table descriptions
Browse files Browse the repository at this point in the history
  • Loading branch information
rbavery committed Dec 13, 2024
1 parent 2c3c44e commit 66ca180
Show file tree
Hide file tree
Showing 3 changed files with 47 additions and 42 deletions.
27 changes: 17 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ connectors, please refer to the [STAC Model](./README_STAC_MODEL.md) document.
- [Collection example](examples/collection.json): Shows the basic usage of the extension in a STAC Collection
- [JSON Schema](https://stac-extensions.github.io/mlm/)
- [Changelog](./CHANGELOG.md)
- [Open access paper](https://dl.acm.org/doi/10.1145/3681769.3698586) describing version 1.3.0 of the extension
- [Open access paper](https://dl.acm.org/doi/10.1145/3681769.3698586) describing version 1.3.0 of the extension
- [SigSpatial 2024 GeoSearch Workshop presentation](/docs/static/sigspatial_2024_mlm.pdf)

## Item Properties and Collection Fields
Expand Down Expand Up @@ -340,13 +340,13 @@ defined at the "Band Object" level, but at the [Model Input](#model-input-object
This is because, in machine learning, it is common to need overall statistics for the dataset used to train the model
to normalize all bands, rather than normalizing the values over a single product. Furthermore, statistics could be
applied differently for distinct [Model Input](#model-input-object) definitions, in order to adjust for intrinsic
properties of the model.
properties of the model.

Another distinction is that, depending on the model, statistics could apply to some inputs that have no reference to
any `bands` definition. In such case, defining statistics under `bands` would not be possible, or would intrude
ambiguous definitions.

Finally, contrary to the "`statistics`" property name employed by [Band Statistics][stac-1.1-stats], MLM employs the
Finally, contrary to the "`statistics`" property name employed by [Band Statistics][stac-1.1-stats], MLM employs the
distinct name `value_scaling`, although similar `minimum`, `maximum`, etc. sub-fields are employed.
This is done explicitly to disambiguate "informative" band statistics from "applied scaling operations" required
by the model inputs. This highlights the fact that `value_scaling` are not *necessarily* equal
Expand Down Expand Up @@ -449,7 +449,7 @@ Select one option from:
| `scale` | `value` | $data / value$ |
| `processing` | [Processing Expression](#processing-expression) | *according to `processing:expression`* |

When a scaling `type` approach is specified, it is expected that the parameters necessary
When a scaling `type` approach is specified, it is expected that the parameters necessary
to perform their calculation are provided for the corresponding input dimension data.

If none of the above values applies for a given dimension, `type: null` (literal `null`, not string) should be
Expand All @@ -463,7 +463,7 @@ dimensions. In such case, implicit broadcasting of the unique [Value Scaling Obj
performed for all applicable dimensions when running inference with the model.

If a custom scaling operation, or a combination of more complex operations (with or without [Resize](#resize-enum)),
must be defined instead, a [Processing Expression](#processing-expression) reference can be specified in place of
must be defined instead, a [Processing Expression](#processing-expression) reference can be specified in place of
the [Value Scaling Object](#value-scaling-object) of the respective input dimension, as shown below.

```json
Expand All @@ -478,7 +478,7 @@ the [Value Scaling Object](#value-scaling-object) of the respective input dimens

For operations such as L1 or L2 normalization, [Processing Expression](#processing-expression) should also be employed.
This is because, depending on the [Model Input](#model-input-object) dimensions and reference data, there is an
ambiguity regarding "how" and "where" such normalization functions must be applied against the input data.
ambiguity regarding "how" and "where" such normalization functions must be applied against the input data.
A custom mathematical expression should provide explicitly the data manipulation and normalization strategy.

In situations of very complex `value_scaling` operations, which cannot be represented by any of the previous definition,
Expand Down Expand Up @@ -667,8 +667,8 @@ In order to provide more context, the following roles are also recommended were
| href | string | URI to the model artifact. |
| type | string | The media type of the artifact (see [Model Artifact Media-Type](#model-artifact-media-type). |
| roles | \[string] | **REQUIRED** Specify `mlm:model`. Can include `["mlm:weights", "mlm:checkpoint"]` as applicable. |
| mlm:artifact_type | [Artifact Type](./best-practices.md#framework-specific-artifact-types) | Specifies the kind of model artifact. Typically related to a particular ML framework. This is **REQUIRED** if the `mlm:model` role is specified. |
| mlm:compile_method | string | Describes the method used to compile the ML model at either save time or runtime prior to inference. These options are mutually exclusive `["aot", "jit", null]`. |
| mlm:artifact_type | [Artifact Type](./best-practices.md#framework-specific-artifact-types) | Specifies the kind of model artifact, any string is allowed. Typically related to a particular ML framework, see [Best Practices - Framework Specific Artifact Types](./best-practices.md#framework-specific-artifact-types) for **RECOMMENDED** values. This field is **REQUIRED** if the `mlm:model` role is specified. |
| mlm:compile_method | [Compile Method](#compile-method) | null | Describes the method used to compile the ML model either when the model is saved or at model runtime prior to inference. |

Recommended Asset `roles` include `mlm:weights` or `mlm:checkpoint` for model weights that need to be loaded by a
model definition and `mlm:compiled` for models that can be loaded directly without an intermediate model definition.
Expand Down Expand Up @@ -701,12 +701,19 @@ is used for the artifact described by the media-type. However, users need to rem
official. In order to validate the specific framework and artifact type employed by the model, the MLM properties
`mlm:framework` (see [MLM Fields](#item-properties-and-collection-fields)) and
`mlm:artifact_type` (see [Model Asset](#model-asset)) should be employed instead to perform this validation if needed.
See the [the best practices document](./best-practices.md#framework-specific-artifact-types) on suggested
fields for framework specific artifact types.
See the [Best Practices - Framework Specific Artifact Types](./best-practices.md#framework-specific-artifact-types) on
suggested fields for framework specific artifact types.

[iana-media-type]: https://www.iana.org/assignments/media-types/media-types.xhtml
[pytorch-aot-inductor]: https://pytorch.org/docs/main/torch.compiler_aot_inductor.html

#### Compile Method

| Compile Method | Description |
|-:-:------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| aot | [Ahead-of-Time Compilation](https://en.wikipedia.org/wiki/Ahead-of-time_compilation). Converts a higher level code description of a model and a model's learned weights to a lower level representation prior to executing the model. This compiled model may be more portable by having fewer runtime dependencies and optimized for specific hardware. |
| jit | [Just-in-Time Compilation](https://en.wikipedia.org/wiki/Just-in-time_compilation). Converts a higher level code description of a model and a model's learned weights to a lower level representation while executing the model. JIT provides more flexibility in the optimization approaches that can be applied to a model compared to AOT, but sacrifices portability and performance. |

### Source Code Asset

| Field Name | Type | Description |
Expand Down
54 changes: 28 additions & 26 deletions best-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,20 @@
This document makes a number of recommendations for creating real world ML Model Extensions.
None of them are required to meet the core specification, but following these practices will improve the documentation
of your model and make life easier for client tooling and users. They come about from practical experience of
implementors and introduce a bit more 'constraint' for those who are creating STAC objects representing their
implementors and introduce a bit more 'constraint' for those who are creating STAC objects representing their
models or creating tools to work with STAC.

- [Using STAC Common Metadata Fields for the ML Model Extension](#using-stac-common-metadata-fields-for-the-ml-model-extension)
- [Recommended Extensions to Compose with the ML Model Extension](#recommended-extensions-to-compose-with-the-ml-model-extension)
- [Processing Extension](#processing-extension)
- [ML-AOI and Label Extensions](#ml-aoi-and-label-extensions)
- [Classification Extension](#classification-extension)
- [Scientific Extension](#scientific-extension)
- [File Extension](#file-extension)
- [Example Extension](#example-extension)
- [Version Extension](#version-extension)
- [ML Model Extension Best Practices](#ml-model-extension-best-practices)
- [Using STAC Common Metadata Fields for the ML Model Extension](#using-stac-common-metadata-fields-for-the-ml-model-extension)
- [Recommended Extensions to Compose with the ML Model Extension](#recommended-extensions-to-compose-with-the-ml-model-extension)
- [Processing Extension](#processing-extension)
- [ML-AOI and Label Extensions](#ml-aoi-and-label-extensions)
- [Classification Extension](#classification-extension)
- [Scientific Extension](#scientific-extension)
- [File Extension](#file-extension)
- [Example Extension](#example-extension)
- [Version Extension](#version-extension)
- [Framework Specific Artifact Types](#framework-specific-artifact-types)

## Using STAC Common Metadata Fields for the ML Model Extension

Expand Down Expand Up @@ -68,8 +70,8 @@ information regarding these references, see the [ML-AOI and Label Extensions](#m

### Processing Extension

It is recommended to use at least the `processing:lineage` and `processing:level` fields from
the [Processing Extension](https://github.com/stac-extensions/processing) to make it clear
It is recommended to use at least the `processing:lineage` and `processing:level` fields from
the [Processing Extension](https://github.com/stac-extensions/processing) to make it clear
how [Model Input Objects](./README.md#model-input-object) are processed by the data provider prior to an
inference preprocessing pipeline. This can help users locate the correct version of the dataset used during model
inference or help them reproduce the data processing pipeline.
Expand Down Expand Up @@ -99,7 +101,7 @@ Furthermore, the [`processing:expression`](https://github.com/stac-extensions/pr
should be specified with a reference to the STAC Item employing the MLM extension to provide full context of the source
of the derived product.

A potential representation of a STAC Asset could be as follows:
A potential representation of a STAC Asset could be as follows:
```json
{
"model-output": {
Expand Down Expand Up @@ -186,7 +188,7 @@ leading to a new MLM STAC Item definition (see also [STAC Version Extension](#ve

### Classification Extension

Since it is expected that a model will provide some kind of classification values as output, the
Since it is expected that a model will provide some kind of classification values as output, the
[Classification Extension](https://github.com/stac-extensions/classification) can be leveraged inside
MLM definition to indicate which class values can be contained in the resulting output from the model prediction.

Expand All @@ -201,7 +203,7 @@ For more details, see the [Model Output Object](README.md#model-output-object) d

### Scientific Extension

Provided that most models derive from previous scientific work, it is strongly recommended to employ the
Provided that most models derive from previous scientific work, it is strongly recommended to employ the
[Scientific Extension][stac-ext-sci] to provide references corresponding to the
original source of the model (`sci:doi`, `sci:citation`). This can help users find more information about the model,
its underlying architecture, or ways to improve it by piecing together the related work (`sci:publications`) that
Expand Down Expand Up @@ -285,17 +287,17 @@ educational purposes only.

## Framework Specific Artifact Types

The `mlm:artifact_type` field can be used to clarify how the model was saved which
can help users understand how to load it or in what runtime contexts it should be used. For example, PyTorch offers
[various strategies][pytorch-frameworks] for providing model definitions, such as Pickle (`.pt`),
[TorchScript][pytorch-jit-script], or [PyTorch Ahead-of-Time Compilation][pytorch-aot-inductor]
(`.pt2`) approach. Since they all refer to the same ML framework, the
[Model Artifact Media-Type](./README.md#model-artifact-media-type) can be insufficient in this case to detect which
strategy should be used to employ the model definition.
The `mlm:artifact_type` field can be used to clarify how the model was saved which can help users understand how to
load it or in which runtime contexts it should be used. Applying this artifact type definition should restrict
explicitly its use to a specific runtime. For example, PyTorch offers [various strategies][pytorch-frameworks] for
exporting models, such as Pickle (`.pt`), [TorchScript][pytorch-jit-script], and
[PyTorch Ahead-of-Time Compilation][pytorch-aot-inductor] (`.pt2`). Since each approach is associated with the same
ML framework, the [Model Artifact Media-Type](./README.md#model-artifact-media-type) can be insufficient in this case
to detect which strategy should be used to deploy the model artifact.

The following are some proposed *Artifact Type* values for the Model Asset's
The following are some proposed *Artifact Type* values for the Model Asset's
[`mlm:artifact_type` field](./README.md#model-asset). Other names are
permitted, as these values are not validated by the schema. Note that the names are selected using the
permitted, as these values are not validated by the schema. Note that the names are selected using the
framework-specific definitions to help the users understand how the model artifact was created, although these exact
names are not strictly required either.

Expand All @@ -306,7 +308,7 @@ names are not strictly required either.
| `torch.export.save` | A model artifact storing an [ExportedProgram][exported-program] obtained by [`torch.export.export`][pytorch-export] (i.e.: `.pt2`). |
| `tf.keras.Model.save` | Saves a [.keras model file][keras-model], a unified zip archive format containing the architecture, weights, optimizer, losses, and metrics. |
| `tf.keras.Model.save_weights` | A [.weights.h5][keras-save-weights] file containing only model weights for use by Tensorflow or Keras. |
| `tf.keras.Model.export(format='tf_saved_model')` | TF Saved Model is the [recommended format][tf-keras-recommended] by the Tensorflow team for whole model saving/loading for inference. See this example to [save and load models][keras-example] and the docs for [different save methods][keras-methods] in TF and Keras. Also available from `keras.Model.export(format='tf_saved_model')` |
| `tf.keras.Model.export` | [TF Saved Model][tf-saved-model] is the [recommended format][tf-keras-recommended] by the Tensorflow team for whole model saving/loading for inference. See the docs for [different save methods][keras-methods] in TF and Keras. |

[exported-program]: https://pytorch.org/docs/main/export.html#serialization
[pytorch-aot-inductor]: https://pytorch.org/docs/main/torch.compiler_aot_inductor.html
Expand All @@ -315,7 +317,7 @@ names are not strictly required either.
[pytorch-jit-script]: https://pytorch.org/docs/stable/jit.html
[pytorch-save]: https://pytorch.org/tutorials/beginner/saving_loading_models.html
[keras-save-weights]: https://keras.io/api/models/model_saving_apis/weights_saving_and_loading/#save_weights-method
[keras-example]: https://keras.io/guides/serialization_and_saving/
[tf-saved-model]: https://keras.io/api/models/model_saving_apis/export/
[tf-keras-recommended]: https://www.tensorflow.org/guide/saved_model#creating_a_savedmodel_from_keras
[keras-methods]: https://keras.io/2.16/api/models/model_saving_apis/
[keras-model]: https://keras.io/api/models/model_saving_apis/model_saving_and_loading/
8 changes: 2 additions & 6 deletions json-schema/schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -356,12 +356,8 @@
"$comment": "Particularity of the 'not/required' approach: they must be tested one by one. Otherwise, it validates that they are all (simultaneously) not present.",
"not": {
"anyOf": [
{
"required": [
"mlm:artifact_type",
"mlm:compile_method"
]
}
{"required": ["mlm:artifact_type"]},
{"required": ["mlm:compile_method"]}
]
}
},
Expand Down

0 comments on commit 66ca180

Please sign in to comment.