Generate Content

ModelSync provides two primary endpoints for generation: completions and chat completions.

Both are OpenAI-compatible and powered by vLLM.

POST

/completions

Best for simple generation tasks. Takes a single text prompt and continues it.

POST

/chat/completions

Better for conversational interactions. Supports multiple messages with roles (user/assistant) and maintains conversation context.

📝Practical Examples

Completion Example

curl -X POST "https://api.modelsync.ai/v1/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $YOUR_API_KEY" \ -d '{ "model": "Llama-3.3-70B-Instruct", "prompt": "Write a poem about coding in Python:", "max_tokens": 400, "temperature": 0.7, "repetition_penalty": 1.1 }'

Response:

{ "id": "cmpl-2h38kn309nvb1", "object": "text_completion", "created": 1703262150, "model": "Llama-3.3-70B-Instruct", "choices": [{ "text": "In lines of indented grace,\nPython slithers into place.\nWith modules, functions, and loops galore,\nBuilding programs we can't ignore.\n\nSimple syntax, readable and clear,\nMaking complex tasks appear near...", "index": 0, "logprobs": null, "finish_reason": "length" }], "usage": { "prompt_tokens": 8, "completion_tokens": 400, "total_tokens": 408 } }

Chat Completion Example

curl -X POST "https://api.modelsync.ai/v1/chat/completions" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $YOUR_API_KEY" \ -d '{ "model": "Llama-3.3-70B-Instruct", "messages": [ { "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "Write a Python function to calculate Fibonacci numbers." } ], "temperature": 0.7, "max_tokens": 400, "repetition_penalty": 1.1 }'

Response:

{ "id": "chatcmpl-20osd348shx1", "object": "chat.completion", "created": 1703262150, "model": "Llama-3.3-70B-Instruct", "choices": [{ "message": { "role": "assistant", "content": "Here's a Python function to calculate Fibonacci numbers recursively:\n\ndef fibonacci(n):\nif n <= 1:\nreturn n\nreturn fibonacci(n-1) + fibonacci(n-2)\n\n# Example usage:\nfor i in range(10):\nprint(f'fibonacci({i}) = {fibonacci(i)}')" }, "logprobs": null, "finish_reason": "stop" }], "usage": { "prompt_tokens": 33, "completion_tokens": 89, "total_tokens": 122 } }

⚙️Request Parameters

Required Parameters for /completion:

prompt

The prompt to generate completions for.

Required Parameters for /chat/completion:

messages

Array of message objects containing:

  • role: "system", "user", or "assistant"
  • content: message text

Required Parameters for both:

model

ID of the model to use. See the models endpoint to list available models.

Optional Parameters for both:

max_tokens

Maximum number of completion tokens to generate per output sequence. Defaults to 16.

temperature

Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. Defaults to 1.0.

repetition_penalty

Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values greater than 1 encourage the model to use new tokens, while values less than 1 encourage the model to repeat tokens. Defaults to 1.0.

top_p

Float that controls the cumulative probability of the top tokens to consider. Must be in [0, 1]. Set to 1 to consider all tokens. Defaults to 1.0.

stream

Whether to stream back partial progress. Defaults to false.

Check out the vLLM documentation for a full list of sampling parameters.

📏Managing Context Length

When working with models, there are two important token limits to consider:

Context Window

Each model has a maximum context window that limits the combined length of your input and output. Exceeding this limit will result in input truncation, so careful context management is essential.

Generation Length (Output Tokens)

Each request can generate up to 400 tokens. You can set a lower limit using max_tokens if you need shorter responses. When a generation reaches this limit, you can continue the generation using the methods below.

Continuing Long Generations

When a response is cut off (stop_reason = "length"), you can continue the generation in two ways:

For completions: Append the previous response to your prompt.

For chat completions: Keep the full message history and add a new user message with "continue".

Monitor token usage via the usage field in responses to track your token consumption.

💡Tips for Better Results

  • Use lower temperature values for more factual and focused responses
  • Use higher temperature values for more creative and varied outputs
  • Use repetition_penalty to reduce repetitive text in longer generations
  • Be specific and clear in your prompts - the better the context you provide, the better the results