Table of Contents
lorax.client
Client Objects
class Client()
Client to make calls to a LoRAX instance
Example:
from lorax import Client
client = Client("http://127.0.0.1:8080")
client.generate("Why is the sky blue?", adapter_id="some/adapter").generated_text
' Rayleigh scattering'
result = ""
for response in client.generate_stream("Why is the sky blue?", adapter_id="some/adapter"):
if not response.token.special:
result += response.token.text
result
' Rayleigh scattering'
__init__
def __init__(base_url: str,
headers: Optional[Dict[str, str]] = None,
cookies: Optional[Dict[str, str]] = None,
timeout: int = 60)
Arguments:
- base_url (
str): LoRAX instance base url - headers (
Optional[Dict[str, str]]): Additional headers - cookies (
Optional[Dict[str, str]]): Cookies to include in the requests - timeout (
int): Timeout in seconds
generate
def generate(prompt: str,
adapter_id: Optional[str] = None,
adapter_source: Optional[str] = None,
merged_adapters: Optional[MergedAdapters] = None,
api_token: Optional[str] = None,
do_sample: bool = False,
max_new_tokens: int = 20,
best_of: Optional[int] = None,
repetition_penalty: Optional[float] = None,
return_full_text: bool = False,
seed: Optional[int] = None,
stop_sequences: Optional[List[str]] = None,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
truncate: Optional[int] = None,
typical_p: Optional[float] = None,
watermark: bool = False,
response_format: Optional[Union[Dict[str, Any],
ResponseFormat]] = None,
decoder_input_details: bool = False,
details: bool = True) -> Response
Given a prompt, generate the following text
Arguments:
- prompt (
str): Input text - adapter_id (
Optional[str]): Adapter ID to apply to the base model for the request - adapter_source (
Optional[str]): Source of the adapter ("hub", "local", "s3", "pbase") - merged_adapters (
Optional[MergedAdapters]): Merged adapters to apply to the base model for the request - api_token (
Optional[str]): API token for accessing private adapters - do_sample (
bool): Activate logits sampling - max_new_tokens (
int): Maximum number of generated tokens - best_of (
int): Generate best_of sequences and return the one if the highest token logprobs - repetition_penalty (
float): The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. return_full_text (bool): Whether to prepend the prompt to the generated text - seed (
int): Random sampling seed - stop_sequences (
List[str]): Stop generating tokens if a member ofstop_sequencesis generated - temperature (
float): The value used to module the logits distribution. - top_k (
int): The number of highest probability vocabulary tokens to keep for top-k-filtering. - top_p (
float): If set to < 1, only the smallest set of most probable tokens with probabilities that add up totop_por higher are kept for generation. - truncate (
int): Truncate inputs tokens to the given size - typical_p (
float): Typical Decoding mass See Typical Decoding for Natural Language Generation for more information - watermark (
bool): Watermarking with A Watermark for Large Language Models - response_format (
Optional[Union[Dict[str, Any], ResponseFormat]]): Optional specification of a format to impose upon the generated text, e.g.,:{ "type": "json_object", "schema": { "type": "string", "title": "response" } } - decoder_input_details (
bool): Return the decoder input token logprobs and ids - details (
bool): Return the token logprobs and ids for generated tokens
Returns:
Response- generated response
generate_stream
def generate_stream(prompt: str,
adapter_id: Optional[str] = None,
adapter_source: Optional[str] = None,
merged_adapters: Optional[MergedAdapters] = None,
api_token: Optional[str] = None,
do_sample: bool = False,
max_new_tokens: int = 20,
repetition_penalty: Optional[float] = None,
return_full_text: bool = False,
seed: Optional[int] = None,
stop_sequences: Optional[List[str]] = None,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
truncate: Optional[int] = None,
typical_p: Optional[float] = None,
watermark: bool = False,
response_format: Optional[Union[Dict[str, Any],
ResponseFormat]] = None,
details: bool = True) -> Iterator[StreamResponse]
Given a prompt, generate the following stream of tokens
Arguments:
- prompt (
str): Input text - adapter_id (
Optional[str]): Adapter ID to apply to the base model for the request - adapter_source (
Optional[str]): Source of the adapter (hub, local, s3) - merged_adapters (
Optional[MergedAdapters]): Merged adapters to apply to the base model for the request - api_token (
Optional[str]): API token for accessing private adapters - do_sample (
bool): Activate logits sampling - max_new_tokens (
int): Maximum number of generated tokens - repetition_penalty (
float): The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. return_full_text (bool): Whether to prepend the prompt to the generated text - seed (
int): Random sampling seed - stop_sequences (
List[str]): Stop generating tokens if a member ofstop_sequencesis generated - temperature (
float): The value used to module the logits distribution. - top_k (
int): The number of highest probability vocabulary tokens to keep for top-k-filtering. - top_p (
float): If set to < 1, only the smallest set of most probable tokens with probabilities that add up totop_por higher are kept for generation. - truncate (
int): Truncate inputs tokens to the given size - typical_p (
float): Typical Decoding mass See Typical Decoding for Natural Language Generation for more information - watermark (
bool): Watermarking with A Watermark for Large Language Models response_format (Optional[Union[Dict[str, Any], ResponseFormat]]): Optional specification of a format to impose upon the generated text, e.g.,:{ "type": "json_object", "schema": { "type": "string", "title": "response" } } - details (
bool): Return the token logprobs and ids for generated tokens
Returns:
Iterator[StreamResponse]- stream of generated tokens
AsyncClient Objects
class AsyncClient()
Asynchronous Client to make calls to a LoRAX instance
Example:
from lorax import AsyncClient
client = AsyncClient("https://api-inference.huggingface.co/models/bigscience/bloomz")
response = await client.generate("Why is the sky blue?", adapter_id="some/adapter")
response.generated_text
' Rayleigh scattering'
result = ""
async for response in client.generate_stream("Why is the sky blue?", adapter_id="some/adapter"):
if not response.token.special:
result += response.token.text
result
' Rayleigh scattering'
__init__
def __init__(base_url: str,
headers: Optional[Dict[str, str]] = None,
cookies: Optional[Dict[str, str]] = None,
timeout: int = 60)
Arguments:
- base_url (
str): LoRAX instance base url - headers (
Optional[Dict[str, str]]): Additional headers - cookies (
Optional[Dict[str, str]]): Cookies to include in the requests - timeout (
int): Timeout in seconds
generate
async def generate(prompt: str,
adapter_id: Optional[str] = None,
adapter_source: Optional[str] = None,
merged_adapters: Optional[MergedAdapters] = None,
api_token: Optional[str] = None,
do_sample: bool = False,
max_new_tokens: int = 20,
best_of: Optional[int] = None,
repetition_penalty: Optional[float] = None,
return_full_text: bool = False,
seed: Optional[int] = None,
stop_sequences: Optional[List[str]] = None,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
truncate: Optional[int] = None,
typical_p: Optional[float] = None,
watermark: bool = False,
response_format: Optional[Union[Dict[str, Any],
ResponseFormat]] = None,
decoder_input_details: bool = False,
details: bool = True) -> Response
Given a prompt, generate the following text asynchronously
Arguments:
- prompt (
str): Input text - adapter_id (
Optional[str]): Adapter ID to apply to the base model for the request - adapter_source (
Optional[str]): Source of the adapter (hub, local, s3) - merged_adapters (
Optional[MergedAdapters]): Merged adapters to apply to the base model for the request - api_token (
Optional[str]): API token for accessing private adapters - do_sample (
bool): Activate logits sampling - max_new_tokens (
int): Maximum number of generated tokens - best_of (
int): Generate best_of sequences and return the one if the highest token logprobs repetition_penalty (float): The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. return_full_text (bool): Whether to prepend the prompt to the generated text - seed (
int): Random sampling seed - stop_sequences (
List[str]): Stop generating tokens if a member ofstop_sequencesis generated - temperature (
float): The value used to module the logits distribution. - top_k (
int): The number of highest probability vocabulary tokens to keep for top-k-filtering. - top_p (
float): If set to < 1, only the smallest set of most probable tokens with probabilities that add up totop_por higher are kept for generation. - truncate (
int): Truncate inputs tokens to the given size - typical_p (
float): Typical Decoding mass See Typical Decoding for Natural Language Generation for more information - watermark (
bool): Watermarking with A Watermark for Large Language Models - response_format (
Optional[Union[Dict[str, Any], ResponseFormat]]): Optional specification of a format to impose upon the generated text, e.g.,:{ "type": "json_object", "schema": { "type": "string", "title": "response" } } - decoder_input_details (
bool): Return the decoder input token logprobs and ids - details (
bool): Return the token logprobs and ids for generated tokens
Returns:
Response- generated response
generate_stream
async def generate_stream(
prompt: str,
adapter_id: Optional[str] = None,
adapter_source: Optional[str] = None,
merged_adapters: Optional[MergedAdapters] = None,
api_token: Optional[str] = None,
do_sample: bool = False,
max_new_tokens: int = 20,
repetition_penalty: Optional[float] = None,
return_full_text: bool = False,
seed: Optional[int] = None,
stop_sequences: Optional[List[str]] = None,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
truncate: Optional[int] = None,
typical_p: Optional[float] = None,
watermark: bool = False,
response_format: Optional[Union[Dict[str, Any],
ResponseFormat]] = None,
details: bool = True) -> AsyncIterator[StreamResponse]
Given a prompt, generate the following stream of tokens asynchronously
Arguments:
- prompt (
str): Input text - adapter_id (
Optional[str]): Adapter ID to apply to the base model for the request - adapter_source (
Optional[str]): Source of the adapter (hub, local, s3) - merged_adapters (
Optional[MergedAdapters]): Merged adapters to apply to the base model for the request - api_token (
Optional[str]): API token for accessing private adapters - do_sample (
bool): Activate logits sampling - max_new_tokens (
int): Maximum number of generated tokens - repetition_penalty (
float): The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. return_full_text (bool): Whether to prepend the prompt to the generated text - seed (
int): Random sampling seed - stop_sequences (
List[str]): Stop generating tokens if a member ofstop_sequencesis generated - temperature (
float): The value used to module the logits distribution. - top_k (
int): The number of highest probability vocabulary tokens to keep for top-k-filtering. - top_p (
float): If set to < 1, only the smallest set of most probable tokens with probabilities that add up totop_por higher are kept for generation. - truncate (
int): Truncate inputs tokens to the given size - typical_p (
float): Typical Decoding mass See Typical Decoding for Natural Language Generation for more information - watermark (
bool): Watermarking with A Watermark for Large Language Models - response_format (
Optional[Union[Dict[str, Any], ResponseFormat]]): Optional specification of a format to impose upon the generated text, e.g.,:{ "type": "json_object", "schema": { "type": "string", "title": "response" } } - details (
bool): Return the token logprobs and ids for generated tokens
Returns:
AsyncIterator[StreamResponse]- stream of generated tokens