OpenAI Compatible API

LoRAX supports OpenAI Chat Completions v1 compatible endpoints that serve as a drop-in replacement for the OpenAI SDK. It supports multi-turn chat conversations while retaining support for dynamic adapter loading.

Chat Completions v1

Using the existing OpenAI Python SDK, replace the base_url with your LoRAX endpoint with /v1 appended. The api_key can be anything, as it is unused.

The model parameter can be set to the empty string "" to use the base model, or any adapter ID on the HuggingFace hub.

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:8080/v1",
)

resp = client.chat.completions.create(
    model="alignment-handbook/zephyr-7b-dpo-lora",
    messages=[
        {
            "role": "system",
            "content": "You are a friendly chatbot who always responds in the style of a pirate",
        },
        {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
    ],
    max_tokens=100,
)
print("Response:", resp.choices[0].message.content)

Streaming

The streaming API is supported with the stream=True parameter:

messages = client.chat.completions.create(
    model="alignment-handbook/zephyr-7b-dpo-lora",
    messages=[
        {
            "role": "system",
            "content": "You are a friendly chatbot who always responds in the style of a pirate",
        },
        {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
    ],
    max_tokens=100,
    stream=True,
)

for message in messages:
    print(message)

REST API

The REST API can be used directly in addition to the Python SDK:

curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "alignment-handbook/zephyr-7b-dpo-lora",
  "messages": [
  {
      "role": "system",
      "content": "You are a friendly chatbot who always responds in the style of a pirate"
  },
  {
      "role": "user",
      "content": "How many helicopters can a human eat in one sitting?"
  }
  ],
  "max_tokens": 100
}'

Chat Templates

Multi-turn chat conversations are supported through HuggingFace chat templates.

If the adapter selected with the model parameter has its own tokenizer and chat template, LoRAX will apply the adapter's chat template to the request during inference. If, however, the adapter does not have its own chat template, LoRAX will fallback to using the base model chat template. If this does not exist, an error will be raised, as chat templates are required for multi-turn conversations.

Structured Output (JSON)

See here for an example.

Completions v1

The legacy completions v1 API can be used as well. This is useful in cases where the model does not have a chat template or you do not wish to interact with the model in a multi-turn conversation.

Note, however, that you will need to provide any template boilerplate as part of the prompt as unlike the v1/chat/completions API, it will not be inserted automatically.

Note

Structured Output (JSON mode) is not supported in the legacy completions API. Please use the Chat Completions API above instead.

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:8080/v1",
)

# synchronous completions
completion = client.completions.create(
    model=adapter_id,
    prompt=prompt,
)
print("Completion result:", completion.choices[0].text)

# streaming completions
completion_stream = client.completions.create(
    model=adapter_id,
    prompt=prompt,
    stream=True,
)

for message in completion_stream:
    print("Completion message:", message)

REST API

curl http://127.0.0.1:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "",
"prompt": "Instruct: Write a detailed analogy between mathematics and a lighthouse.\nOutput:",
"max_tokens": 100
}'