TensorSharp.Server 提供三种 API 风格以及若干工具型接口:
- 兼容 Ollama(
/api/generate、/api/chat/ollama、/api/tags、/api/show) - 兼容 OpenAI(
/v1/chat/completions、/v1/models) - Web UI(
/api/chat、/api/models、/api/models/load、/api/upload) - 工具型接口(
/api/version、/api/queue/status)
启动服务时通过 --model 指定承载的模型文件,必要时通过 --mmproj 指定多模态投影器。Web UI 与兼容接口仅暴露这一个承载模型;/api/models/load 可以重新加载它,但不会在运行时切换到任意其他文件。
# 仅文本模型
./TensorSharp.Server --model ~/work/model/Qwen3-4B-Q8_0.gguf --backend ggml_metal
# 多模态模型(显式指定投影器)
./TensorSharp.Server --model ~/work/model/gemma-4-E4B-it-Q8_0.gguf \
--mmproj ~/work/model/gemma-4-mmproj-F16.gguf --backend ggml_metal
# 覆盖默认请求 token 上限(请求未提供 max_tokens / num_predict 时使用)
./TensorSharp.Server --model ~/work/model/Qwen3-4B-Q8_0.gguf --backend ggml_metal --max-tokens 4096服务默认监听 http://localhost:5000(可通过 ASP.NET Core 标准的 PORT / ASPNETCORE_URLS 环境变量覆盖)。
curl http://localhost:5000/api/tags响应:
{
"models": [
{"name": "Qwen3-4B-Q8_0", "model": "Qwen3-4B-Q8_0.gguf", "size": 4530000000, "modified_at": "2025-03-15T10:00:00Z"}
]
}curl -X POST http://localhost:5000/api/show \
-H "Content-Type: application/json" \
-d '{"model": "Qwen3-4B-Q8_0.gguf"}'curl -X POST http://localhost:5000/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"prompt": "What is 1+1?",
"stream": false,
"options": {
"num_predict": 50,
"temperature": 0.7,
"top_p": 0.9
}
}'响应:
{
"model": "Qwen3-4B-Q8_0.gguf",
"created_at": "2025-03-15T10:00:00Z",
"response": "1+1 equals 2.",
"done": true,
"done_reason": "stop",
"total_duration": 1500000000,
"prompt_eval_count": 15,
"prompt_eval_duration": 300000000,
"eval_count": 10,
"eval_duration": 1200000000
}curl -X POST http://localhost:5000/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"prompt": "Tell me a joke.",
"stream": true,
"options": {"num_predict": 100}
}'每一行都是一条 JSON(newline-delimited JSON):
{"model":"Qwen3-4B-Q8_0.gguf","created_at":"...","response":"Why","done":false}
{"model":"Qwen3-4B-Q8_0.gguf","created_at":"...","response":" did","done":false}
...
{"model":"Qwen3-4B-Q8_0.gguf","created_at":"...","response":"","done":true,"done_reason":"stop","total_duration":...,"eval_count":...}
图片以 base64 字节序列传入 images 数组:
IMG_B64=$(base64 < photo.png)
curl -X POST http://localhost:5000/api/generate \
-H "Content-Type: application/json" \
-d "{
\"model\": \"gemma-4-E4B-it-Q8_0.gguf\",
\"prompt\": \"What is in this image?\",
\"images\": [\"$IMG_B64\"],
\"stream\": false,
\"options\": {\"num_predict\": 200}
}"curl -X POST http://localhost:5000/api/chat/ollama \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"stream": false,
"options": {"num_predict": 100}
}'响应:
{
"model": "Qwen3-4B-Q8_0.gguf",
"created_at": "2025-03-15T10:00:00Z",
"message": {"role": "assistant", "content": "The capital of France is Paris."},
"done": true,
"done_reason": "stop",
"total_duration": 2000000000,
"prompt_eval_count": 20,
"prompt_eval_duration": 500000000,
"eval_count": 15,
"eval_duration": 1500000000
}curl -X POST http://localhost:5000/api/chat/ollama \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true,
"options": {"num_predict": 50}
}'curl -X POST http://localhost:5000/api/chat/ollama \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [
{"role": "user", "content": "My name is Alice."},
{"role": "assistant", "content": "Nice to meet you, Alice!"},
{"role": "user", "content": "What is my name?"}
],
"stream": false,
"options": {"num_predict": 50}
}'IMG_B64=$(base64 < photo.png)
curl -X POST http://localhost:5000/api/chat/ollama \
-H "Content-Type: application/json" \
-d "{
\"model\": \"gemma-4-E4B-it-Q8_0.gguf\",
\"messages\": [{
\"role\": \"user\",
\"content\": \"Describe this image.\",
\"images\": [\"$IMG_B64\"]
}],
\"stream\": false,
\"options\": {\"num_predict\": 200}
}"支持思维链的架构(Qwen 3、Qwen 3.5、Gemma 4、GPT OSS、Nemotron-H)可接受 "think": true,并将思考过程与可见回答分开返回:
curl -X POST http://localhost:5000/api/chat/ollama \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [{"role": "user", "content": "Solve 17 * 23 step by step."}],
"think": true,
"stream": false,
"options": {"num_predict": 200}
}'响应中思维过程位于 message.thinking:
{
"message": {
"role": "assistant",
"content": "17 * 23 = 391.",
"thinking": "17 * 20 = 340. 17 * 3 = 51. 340 + 51 = 391."
},
"done": true,
"done_reason": "stop"
}工具按 Ollama tool API 的形式定义。服务端会根据当前架构识别工具调用的线协议(如 Qwen / Nemotron-H 使用 <tool_call>...</tool_call>,Gemma 4 使用 <|tool_call>...<tool_call|>),并解析为结构化的 tool_calls:
curl -X POST http://localhost:5000/api/chat/ollama \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [{"role": "user", "content": "What is the weather in Paris?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "获取某城市的当前天气。",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "目标城市"},
"units": {"type": "string", "enum": ["c", "f"]}
},
"required": ["city"]
}
}
}],
"stream": false,
"options": {"num_predict": 200}
}'模型决定调用工具时的响应:
{
"message": {
"role": "assistant",
"content": "",
"tool_calls": [{
"function": {
"name": "get_weather",
"arguments": {"city": "Paris", "units": "c"}
}
}]
},
"done": true,
"done_reason": "tool_calls"
}继续会话时,把 assistant 的 tool call 与一条 role: "tool" 的消息(包含函数返回结果)追加到 messages,再次请求 /api/chat/ollama 即可。
curl http://localhost:5000/v1/models响应:
{
"object": "list",
"data": [
{"id": "Qwen3-4B-Q8_0", "object": "model", "owned_by": "local"}
]
}curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+3?"}
],
"max_tokens": 50,
"temperature": 0.7
}'响应:
{
"id": "chatcmpl-abc123...",
"object": "chat.completion",
"created": 1710500000,
"model": "Qwen3-4B-Q8_0.gguf",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "2 + 3 = 5."},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 8,
"total_tokens": 28
}
}curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50,
"stream": true
}'每个 chunk 以 SSE 形式发送:
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":...,"model":"...","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":...,"model":"...","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":...,"model":"...","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{...}}
data: [DONE]
curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [
{"role": "user", "content": "Return a JSON object with keys answer and confidence for 2+3."}
],
"response_format": {"type": "json_object"},
"max_tokens": 80
}'响应:
{
"choices": [{
"message": {
"role": "assistant",
"content": "{\"answer\":5,\"confidence\":\"high\"}"
},
"finish_reason": "stop"
}]
}TensorSharp.Server 接收 OpenAI Chat Completions 的 response_format 形式,会向 prompt 中注入严格 JSON 指令,并在返回前对最终输出进行校验。
curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [
{
"role": "system",
"content": "You are a concise extraction assistant."
},
{
"role": "user",
"content": "Extract the city and country from: Paris, France."
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "location_extraction",
"strict": true,
"schema": {
"type": "object",
"properties": {
"city": { "type": "string" },
"country": { "type": "string" },
"confidence": { "type": ["string", "null"] }
},
"required": ["city", "country", "confidence"],
"additionalProperties": false
}
}
},
"max_tokens": 120
}'响应:
{
"choices": [{
"message": {
"role": "assistant",
"content": "{\"city\":\"Paris\",\"country\":\"France\",\"confidence\":null}"
},
"finish_reason": "stop"
}]
}IMG_B64=$(base64 < photo.png)
curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"gemma-4-E4B-it-Q8_0.gguf\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{\"type\": \"text\", \"text\": \"What is in this image?\"},
{\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/png;base64,$IMG_B64\"}}
]
}],
\"max_tokens\": 200
}"curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [{"role": "user", "content": "What is the weather in Paris?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "获取某城市的当前天气。",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"},
"units": {"type": "string", "enum": ["c", "f"]}
},
"required": ["city"]
}
}
}],
"max_tokens": 200
}'模型发出工具调用时,响应使用 OpenAI 风格字段:
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"city\":\"Paris\",\"units\":\"c\"}"
}
}]
},
"finish_reason": "tool_calls"
}]
}将 assistant 的 tool_calls 与一条 {"role": "tool", "tool_call_id": "...", "content": "..."} 消息追加到 messages,即可继续工具循环。
# 推理队列快照(busy 标志、待处理请求数、累计处理数)
curl http://localhost:5000/api/queue/status
# 服务版本
curl http://localhost:5000/api/version
# 承载模型 + 可用后端 + 默认设置
curl http://localhost:5000/api/models/api/models 返回唯一承载的 GGUF(如有投影器一并返回),加载后的后端名、可用后端列表、解析出的架构以及配置好的默认 max_tokens。/api/tags、/v1/models、/api/show 中的模型条目始终汇报通过 --model 实际启动的文件。
| 参数 | 类型 | 默认值 | 描述 |
|---|---|---|---|
num_predict |
int | 200 | 生成的最大 token 数 |
temperature |
float | 0 | 采样温度(0 = 贪心) |
top_k |
int | 0 | Top-K 过滤(0 = 关闭) |
top_p |
float | 1.0 | 核采样阈值 |
min_p |
float | 0 | 最小概率过滤 |
repeat_penalty |
float | 1.0 | 重复惩罚 |
presence_penalty |
float | 0 | 出现惩罚 |
frequency_penalty |
float | 0 | 频率惩罚 |
seed |
int | -1 | 随机种子(-1 = 不指定) |
stop |
array | null | 停止序列 |
| 参数 | 类型 | 默认值 | 描述 |
|---|---|---|---|
max_tokens |
int | 200 | 生成的最大 token 数 |
temperature |
float | 0 | 采样温度 |
top_p |
float | 1.0 | 核采样阈值 |
presence_penalty |
float | 0 | 出现惩罚 |
frequency_penalty |
float | 0 | 频率惩罚 |
seed |
int | -1 | 随机种子 |
stop |
string/array | null | 停止序列 |
response_format |
object | null | text、json_object 或 json_schema |
import requests
import json
url = "http://localhost:5000/api/generate"
payload = {
"model": "Qwen3-4B-Q8_0.gguf",
"prompt": "What is machine learning?",
"stream": False,
"options": {"num_predict": 100, "temperature": 0.7}
}
resp = requests.post(url, json=payload)
print(resp.json()["response"])import requests
import json
url = "http://localhost:5000/api/generate"
payload = {
"model": "Qwen3-4B-Q8_0.gguf",
"prompt": "Tell me a story.",
"stream": True,
"options": {"num_predict": 200}
}
with requests.post(url, json=payload, stream=True) as resp:
for line in resp.iter_lines():
if line:
data = json.loads(line)
if not data["done"]:
print(data["response"], end="", flush=True)
else:
print(f"\n[Done: {data['eval_count']} tokens]")from openai import OpenAI
client = OpenAI(base_url="http://localhost:5000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="Qwen3-4B-Q8_0.gguf",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+3?"}
],
max_tokens=50,
temperature=0.7
)
print(response.choices[0].message.content)from openai import OpenAI
import json
client = OpenAI(base_url="http://localhost:5000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="Qwen3-4B-Q8_0.gguf",
messages=[
{"role": "user", "content": "Extract the city and country from: Tokyo, Japan."}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "location_extraction",
"strict": True,
"schema": {
"type": "object",
"properties": {
"city": {"type": "string"},
"country": {"type": "string"},
"confidence": {"type": ["string", "null"]}
},
"required": ["city", "country", "confidence"],
"additionalProperties": False
}
}
}
)
payload = json.loads(response.choices[0].message.content)
print(payload["city"], payload["country"], payload["confidence"])from openai import OpenAI
client = OpenAI(base_url="http://localhost:5000/v1", api_key="not-needed")
stream = client.chat.completions.create(
model="Qwen3-4B-Q8_0.gguf",
messages=[{"role": "user", "content": "Tell me about Python."}],
max_tokens=200,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()注意事项:
response_format.type = "json_schema"当前不能与tools或think同时使用。- 流式结构化输出请求会先在服务端缓存并校验,再以 chunk 形式发出。
- 非法 schema 返回 HTTP
400;模型输出未能通过校验则返回 HTTP422。
test_requests.jsonl 文件包含针对所有接口的示例请求。可通过下面的脚本批量运行:
while IFS= read -r line; do
ENDPOINT=$(echo "$line" | python3 -c "import sys,json; print(json.load(sys.stdin)['endpoint'])")
METHOD=$(echo "$line" | python3 -c "import sys,json; print(json.load(sys.stdin)['method'])")
BODY=$(echo "$line" | python3 -c "import sys,json; b=json.load(sys.stdin).get('body'); print(json.dumps(b) if b else '')")
echo "=== $METHOD $ENDPOINT ==="
if [ "$METHOD" = "GET" ]; then
curl -s "http://localhost:5000$ENDPOINT" | python3 -m json.tool
else
curl -s -X POST "http://localhost:5000$ENDPOINT" \
-H "Content-Type: application/json" \
-d "$BODY" | head -c 500
fi
echo -e "\n"
done < test_requests.jsonl