ModelSync provides two primary endpoints for generation: completions and chat completions.
Both are OpenAI-compatible and powered by vLLM.
/completions
Best for simple generation tasks. Takes a single text prompt and continues it.
/chat/completions
Better for conversational interactions. Supports multiple messages with roles (user/assistant) and maintains conversation context.
curl -X POST "https://api.modelsync.ai/v1/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $YOUR_API_KEY" \
-d '{
"model": "Llama-3.3-70B-Instruct",
"prompt": "Write a poem about coding in Python:",
"max_tokens": 400,
"temperature": 0.7,
"repetition_penalty": 1.1
}'
Response:
{
"id": "cmpl-2h38kn309nvb1",
"object": "text_completion",
"created": 1703262150,
"model": "Llama-3.3-70B-Instruct",
"choices": [{
"text": "In lines of indented grace,\nPython slithers into place.\nWith modules, functions, and loops galore,\nBuilding programs we can't ignore.\n\nSimple syntax, readable and clear,\nMaking complex tasks appear near...",
"index": 0,
"logprobs": null,
"finish_reason": "length"
}],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 400,
"total_tokens": 408
}
}
curl -X POST "https://api.modelsync.ai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $YOUR_API_KEY" \
-d '{
"model": "Llama-3.3-70B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful coding assistant."
},
{
"role": "user",
"content": "Write a Python function to calculate Fibonacci numbers."
}
],
"temperature": 0.7,
"max_tokens": 400,
"repetition_penalty": 1.1
}'
Response:
{
"id": "chatcmpl-20osd348shx1",
"object": "chat.completion",
"created": 1703262150,
"model": "Llama-3.3-70B-Instruct",
"choices": [{
"message": {
"role": "assistant",
"content": "Here's a Python function to calculate Fibonacci numbers recursively:\n\ndef fibonacci(n):\nif n <= 1:\nreturn n\nreturn fibonacci(n-1) + fibonacci(n-2)\n\n# Example usage:\nfor i in range(10):\nprint(f'fibonacci({i}) = {fibonacci(i)}')"
},
"logprobs": null,
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 33,
"completion_tokens": 89,
"total_tokens": 122
}
}
prompt
The prompt to generate completions for.
messages
Array of message objects containing:
role
: "system", "user", or "assistant"content
: message textmodel
ID of the model to use. See the models endpoint to list available models.
max_tokens
Maximum number of completion tokens to generate per output sequence. Defaults to 16.
temperature
Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. Defaults to 1.0.
repetition_penalty
Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values greater than 1 encourage the model to use new tokens, while values less than 1 encourage the model to repeat tokens. Defaults to 1.0.
top_p
Float that controls the cumulative probability of the top tokens to consider. Must be in [0, 1]. Set to 1 to consider all tokens. Defaults to 1.0.
stream
Whether to stream back partial progress. Defaults to false.
Check out the vLLM documentation for a full list of sampling parameters.
When working with models, there are two important token limits to consider:
Context Window
Each model has a maximum context window that limits the combined length of your input and output. Exceeding this limit will result in input truncation, so careful context management is essential.
Generation Length (Output Tokens)
Each request can generate up to 400 tokens. You can set a lower limit using max_tokens if you need shorter responses. When a generation reaches this limit, you can continue the generation using the methods below.
Continuing Long Generations
When a response is cut off (stop_reason = "length"), you can continue the generation in two ways:
For completions: Append the previous response to your prompt.
For chat completions: Keep the full message history and add a new user message with "continue".
Monitor token usage via the usage
field in responses to track your token consumption.