Adapters
Adapters are small model fragments that can be loaded on top of base models in LoRAX -- either during server initialization or at runtime as part of the request parameters.
Types
LoRA
LoRA is a popular parameter efficient fine-tuning method to improve response quality.
LoRAX can load any LoRA adapter dynamically at runtime per request, and batch many different LoRAs together at once for high throughput.
Usage:
"parameters": {
"adapter_id": "predibase/conllpp"
}
Medusa
Medusa is a speculative decoding method that speeds up next-token generation by attempting to generate more than one token at a time.
LoRAX can load Medusa adapters dynamically at runtime per request provided that the LoRAX server was initialized with a default Medusa adapter.
Usage:
"parameters": {
"adapter_id": "predibase/Mistral-7B-Instruct-v0.2-magicoder-medusa"
}
Source
You can provide an adapter from the HuggingFace Hub, a local file path, or S3.
Just make sure that the adapter was trained on the same base model used in the deployment. LoRAX only supports one base model at a time, but any number of adapters derived from it!
Huggingface Hub
By default, LoRAX will load adapters from the Huggingface Hub.
Usage:
"parameters": {
"adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k",
"adapter_source": "hub"
}
Predibase
Any adapter hosted in Predibase can be used in LoRAX by setting adapter_source="pbase"
.
When using Predibase hosted adapters, the adapter_id
format is <model_repo>/<model_version>
. If the model_version
is
omitted, the latest version in the Model Repoistory
will be used.
Usage:
"parameters": {
"adapter_id": "model_repo/model_version",
"adapter_source": "pbase"
}
Local
When specifying an adapter in a local path, the adapter_id
should correspond to the root directory of the adapter containing the following files:
root_adapter_path/
adapter_config.json
adapter_model.bin
adapter_model.safetensors
The weights must be in one of either a adapter_model.bin
(pickle) or adapter_model.safetensors
(safetensors) format. If both are provided, safestensors will be used.
See the PEFT library for detailed examples showing how to save adapters in this format.
Usage:
"parameters": {
"adapter_id": "/data/adapters/vineetsharma--qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k",
"adapter_source": "local"
}
S3
Similar to a local path, an S3 path can be provided. Just make sure you have the appropriate environment variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
set so you can authenticate to AWS.
Usage:
"parameters": {
"adapter_id": "s3://adapters_bucket/vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k",
"adapter_source": "s3"
}
Merging Adapters
Multiple adapters can be mixed / merged together per request to create powerful ensembles of different specialized adapters.
This is particularly useful when you want your LLM to be capable of handling multiple types of tasks based on the user's prompt without requiring them to specify the type of task they wish to perform.
See Merging Adapters for details.
Private Adapter Repositories
For hosted adapter repositories like HuggingFace Hub and Predibase, you can perform inference using private adapters per request.
Usage:
"parameters": {
"adapter_id": "my-repo/private-adapter",
"api_token": "<auth_token>"
}
If you prefer not to include the token in the request body, you can include it in the "Authorization" header:
headers = {
"Authorization": f"Bearer {api_token}"
}
client = Client("http://127.0.0.1:8080", headers=headers)
curl http://127.0.0.1:8080/generate \
-X POST \
-d '{
"inputs": "..."
}' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer ${API_TOKEN}"
The authorization check is performed per-request in the background (prior to batching to prevent slowing down inference) every time, so even if the adapter is cachd locally or the authorization token has been invalidated, the check will be performed and handled appropriately.
For details on generating API tokens, see:
Fallback to Global HuggingFace Token
In some cases, you may want LoRAX to use the global HUGGING_FACE_HUB_TOKEN
set in the environment for all adapter
requests by default. To enable this, set the following in your environment:
export LORAX_USE_GLOBAL_HF_TOKEN=1
It is suggested to only enable this if you are comfortable with external callers to your server being able to access all of the adapters in your private HF repo.