diff --git a/docs/source/optimization_ov.mdx b/docs/source/optimization_ov.mdx index 866573dca9..0f51d3cb60 100644 --- a/docs/source/optimization_ov.mdx +++ b/docs/source/optimization_ov.mdx @@ -62,6 +62,27 @@ tokenizer.save_pretrained(save_dir) The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device. +### Weights compression + +For large language models (LLMs), it is often beneficial to only quantize weights, and keep activations in floating point precision. This method does not require a calibration dataset. To enable weights compression, set the `weights_only` parameter of `OVQuantizer`: + +```python +from optimum.intel.openvino import OVQuantizer, OVModelForCausalLM +from transformers import AutoModelForCausalLM + +save_dir = "int8_weights_compressed_model" +model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-3b") +quantizer = OVQuantizer.from_pretrained(model, task="text-generation") +quantizer.quantize(save_directory=save_dir, weights_only=True) +``` + +To load the optimized model for inference: + +```python +optimized_model = OVModelForCausalLM.from_pretrained(save_dir) +``` + +Weights compression is enabled for PyTorch and OpenVINO models: the starting model can be an `AutoModelForCausalLM` or `OVModelForCausalLM` instance. ## Training-time optimization @@ -221,4 +242,4 @@ text = "He's a dreadful magician." outputs = cls_pipe(text) [{'label': 'NEGATIVE', 'score': 0.9840195178985596}] -``` \ No newline at end of file +```