https://github.com/ollama/ollama/blob/main/docs/gpu.md
Ollama supports Nvidia GPUs with compute capability 5.0+.
Check your compute compatibility to see if your card is supported: https://developer.nvidia.com/cuda-gpus
Compute Capability Family Cards 9.0 NVIDIA H200
H100
8.9 GeForce RTX 40xx RTX 4090
RTX 4080 SUPER
RTX 4080
RTX 4070 Ti SUPER
RTX 4070 Ti
RTX 4070 SUPER
RTX 4070
RTX 4060 Ti
RTX 4060
NVIDIA Professional L4
L40
RTX 6000
8.6 GeForce RTX 30xx RTX 3090 Ti
RTX 3090
RTX 3080 Ti
RTX 3080
RTX 3070 Ti
RTX 3070
RTX 3060 Ti
RTX 3060
RTX 3050 Ti
RTX 3050
NVIDIA Professional A40
RTX A6000
RTX A5000
RTX A4000
RTX A3000
RTX A2000
A10
A16
A2
8.0 NVIDIA A100
A30
7.5 GeForce GTX/RTX GTX 1650 Ti
TITAN RTX
RTX 2080 Ti
RTX 2080
RTX 2070
RTX 2060
NVIDIA Professional T4
RTX 5000
RTX 4000
RTX 3000
T2000
T1200
T1000
T600
T500
Quadro RTX 8000
RTX 6000
RTX 5000
RTX 4000
7.0 NVIDIA TITAN V
V100
Quadro GV100
6.1 NVIDIA TITAN TITAN Xp
TITAN X
GeForce GTX GTX 1080 Ti
GTX 1080
GTX 1070 Ti
GTX 1070
GTX 1060
GTX 1050 Ti
GTX 1050
Quadro P6000
P5200
P4200
P3200
P5000
P4000
P3000
P2200
P2000
P1000
P620
P600
P500
P520
Tesla P40
P4
6.0 NVIDIA Tesla P100
Quadro GP100
5.2 GeForce GTX GTX TITAN X
GTX 980 Ti
GTX 980
GTX 970
GTX 960
GTX 950
Quadro M6000 24GB
M6000
M5000
M5500M
M4000
M2200
M2000
M620
Tesla M60
M40
5.0 GeForce GTX GTX 750 Ti
GTX 750
NVS 810
Quadro K2200
K1200
K620
M1200
M520
M5000M
M4000M
M3000M
M2000M
M1000M
K620M
M600M
M500M
For building locally to support older GPUs, see developer.md
If you have multiple NVIDIA GPUs in your system and want to limit Ollama to use a subset, you can set CUDA_VISIBLE_DEVICES
to a comma separated list of GPUs. Numeric IDs may be used, however ordering may vary, so UUIDs are more reliable. You can discover the UUID of your GPUs by running nvidia-smi -L
If you want to ignore the GPUs and force CPU usage, use an invalid GPU ID (e.g., "-1")
On linux, after a suspend/resume cycle, sometimes Ollama will fail to discover your NVIDIA GPU, and fallback to running on the CPU. You can workaround this driver bug by reloading the NVIDIA UVM driver with sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm
Ollama supports the following AMD GPUs:
Family Cards and accelerators AMD Radeon RX 7900 XTX
7900 XT
7900 GRE
7800 XT
7700 XT
7600 XT
7600
6950 XT
6900 XTX
6900XT
6800 XT
6800
Vega 64
Vega 56
AMD Radeon PRO W7900
W7800
W7700
W7600
W7500
W6900X
W6800X Duo
W6800X
W6800
V620
V420
V340
V320
Vega II Duo
Vega II
VII
SSG
AMD Instinct MI300X
MI300A
MI300
MI250X
MI250
MI210
MI200
MI100
MI60
MI50
With ROCm v6.1, the following GPUs are supported on Windows.
Family Cards and accelerators AMD Radeon RX 7900 XTX
7900 XT
7900 GRE
7800 XT
7700 XT
7600 XT
7600
6950 XT
6900 XTX
6900XT
6800 XT
6800
AMD Radeon PRO W7900
W7800
W7700
W7600
W7500
W6900X
W6800X Duo
W6800X
W6800
V620
Ollama leverages the AMD ROCm library, which does not support all AMD GPUs. In some cases you can force the system to try to use a similar LLVM target that is close. For example The Radeon RX 5400 is gfx1034
(also known as 10.3.4) however, ROCm does not currently support this target. The closest support is gfx1030
. You can use the environment variable HSA_OVERRIDE_GFX_VERSION
with x.y.z
syntax. So for example, to force the system to run on the RX 5400, you would set HSA_OVERRIDE_GFX_VERSION="10.3.0"
as an environment variable for the server. If you have an unsupported AMD GPU you can experiment using the list of supported types below.
If you have multiple GPUs with different GFX versions, append the numeric device number to the environment variable to set them individually. For example, HSA_OVERRIDE_GFX_VERSION_0=10.3.0
and HSA_OVERRIDE_GFX_VERSION_1=11.0.0
At this time, the known supported GPU types on linux are the following LLVM Targets. This table shows some example GPUs that map to these LLVM targets:
LLVM Target An Example GPU gfx900 Radeon RX Vega 56 gfx906 Radeon Instinct MI50 gfx908 Radeon Instinct MI100 gfx90a Radeon Instinct MI210 gfx940 Radeon Instinct MI300 gfx941 gfx942 gfx1030 Radeon PRO V620 gfx1100 Radeon PRO W7900 gfx1101 Radeon PRO W7700 gfx1102 Radeon RX 7600
AMD is working on enhancing ROCm v6 to broaden support for families of GPUs in a future release which should increase support for more GPUs.
Reach out on Discord or file an issue for additional help.
If you have multiple AMD GPUs in your system and want to limit Ollama to use a subset, you can set ROCR_VISIBLE_DEVICES
to a comma separated list of GPUs. You can see the list of devices with rocminfo
. If you want to ignore the GPUs and force CPU usage, use an invalid GPU ID (e.g., "-1"). When available, use the Uuid
to uniquely identify the device instead of numeric value.
In some Linux distributions, SELinux can prevent containers from accessing the AMD GPU devices. On the host system you can run sudo setsebool container_use_devices=1
to allow containers to use devices.
Ollama supports GPU acceleration on Apple devices via the Metal API.
Welcome to Ollama for Windows.
No more WSL required!
Ollama now runs as a native Windows application, including NVIDIA and AMD Radeon GPU support. After installing Ollama for Windows, Ollama will run in the background and the ollama
command line is available in cmd
, powershell
or your favorite terminal application. As usual the Ollama api will be served on http://localhost:11434
.
Windows 10 22H2 or newer, Home or Pro
NVIDIA 452.39 or newer Drivers if you have an NVIDIA card
AMD Radeon Driver https://www.amd.com/en/support if you have a Radeon card
Ollama uses unicode characters for progress indication, which may render as unknown squares in some older terminal fonts in Windows 10. If you see this, try changing your terminal font settings.
The Ollama install does not require Administrator, and installs in your home directory by default. You'll need at least 4GB of space for the binary install. Once you've installed Ollama, you'll need additional space for storing the Large Language models, which can be tens to hundreds of GB in size. If your home directory doesn't have enough space, you can change where the binaries are installed, and where the models are stored.
To install the Ollama application in a location different than your home directory, start the installer with the following flag
OllamaSetup.exe /DIR="d:\some\location"
To change where Ollama stores the downloaded models instead of using your home directory, set the environment variable OLLAMA_MODELS
in your user account.
Start the Settings (Windows 11) or Control Panel (Windows 10) application and search for environment variables.
Click on Edit environment variables for your account.
Edit or create a new variable for your user account for
OLLAMA_MODELS
where you want the models storedClick OK/Apply to save.
If Ollama is already running, Quit the tray application and relaunch it from the Start menu, or a new terminal started after you saved the environment variables.
Here's a quick example showing API access from powershell
(Invoke-WebRequest -method POST -Body '{"model":"llama3.2", "prompt":"Why is the sky blue?", "stream": false}' -uri http://localhost:11434/api/generate ).Content | ConvertFrom-json
Ollama on Windows stores files in a few different locations. You can view them in the explorer window by hitting <Ctrl>+R
and type in:
explorer %LOCALAPPDATA%\Ollama
contains logs, and downloaded updatesapp.log contains most resent logs from the GUI application
server.log contains the most recent server logs
upgrade.log contains log output for upgrades
explorer %LOCALAPPDATA%\Programs\Ollama
contains the binaries (The installer adds this to your user PATH)explorer %HOMEPATH%\.ollama
contains models and configurationexplorer %TEMP%
contains temporary executable files in one or moreollama*
directories
The Ollama Windows installer registers an Uninstaller application. Under Add or remove programs
in Windows Settings, you can uninstall Ollama.
Note
If you have changed the OLLAMA_MODELS location, the installer will not remove your downloaded models
The easiest way to install Ollama on Windows is to use the OllamaSetup.exe
installer. It installs in your account without requiring Administrator rights. We update Ollama regularly to support the latest models, and this installer will help you keep up to date.
If you'd like to install or integrate Ollama as a service, a standalone ollama-windows-amd64.zip
zip file is available containing only the Ollama CLI and GPU library dependencies for Nvidia. If you have an AMD GPU, also download and extract the additional ROCm package ollama-windows-amd64-rocm.zip
into the same directory. This allows for embedding Ollama in existing applications, or running it as a system service via ollama serve
with tools such as NSSM.
Note
If you are upgrading from a prior version, you should remove the old directories first.
Ollama on macOS and Windows will automatically download updates. Click on the taskbar or menubar item and then click "Restart to update" to apply the update. Updates can also be installed by downloading the latest version manually.
On Linux, re-run the install script:
curl -fsSL https://ollama.com/install.sh | sh
Review the Troubleshooting docs for more about using logs.
Please refer to the GPU docs.
By default, Ollama uses a context window size of 2048 tokens. This can be overridden with the OLLAMA_CONTEXT_LENGTH
environment variable. For example, to set the default context length to 8K, use: OLLAMA_CONTEXT_LENGTH=8192 ollama serve
.
To change this when using ollama run
, use /set parameter
:
/set parameter num_ctx 4096
When using the API, specify the num_ctx
parameter:
curl http://localhost:11434/api/generate -d '{ "model": "llama3.2", "prompt": "Why is the sky blue?", "options": { "num_ctx": 4096 }}'
Use the ollama ps
command to see what models are currently loaded into memory.
ollama ps
Output:
NAME ID SIZE PROCESSOR UNTIL llama3:70b bcfb190ca3a7 42 GB 100% GPU 4 minutes from now
The Processor
column will show which memory the model was loaded in to:
100% GPU
means the model was loaded entirely into the GPU100% CPU
means the model was loaded entirely in system memory48%/52% CPU/GPU
means the model was loaded partially onto both the GPU and into system memory
Ollama server can be configured with environment variables.
If Ollama is run as a macOS application, environment variables should be set using launchctl
:
For each environment variable, call
launchctl setenv
.launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
Restart Ollama application.
If Ollama is run as a systemd service, environment variables should be set using systemctl
:
Edit the systemd service by calling
systemctl edit ollama.service
. This will open an editor.For each environment variable, add a line
Environment
under section[Service]
:[Service]Environment="OLLAMA_HOST=0.0.0.0:11434"
Save and exit.
Reload
systemd
and restart Ollama:systemctl daemon-reload systemctl restart ollama
On Windows, Ollama inherits your user and system environment variables.
First Quit Ollama by clicking on it in the task bar.
Start the Settings (Windows 11) or Control Panel (Windows 10) application and search for environment variables.
Click on Edit environment variables for your account.
Edit or create a new variable for your user account for
OLLAMA_HOST
,OLLAMA_MODELS
, etc.Click OK/Apply to save.
Start the Ollama application from the Windows Start menu.
Ollama pulls models from the Internet and may require a proxy server to access the models. Use HTTPS_PROXY
to redirect outbound requests through the proxy. Ensure the proxy certificate is installed as a system certificate. Refer to the section above for how to use environment variables on your platform.
Note
Avoid setting HTTP_PROXY
. Ollama does not use HTTP for model pulls, only HTTPS. Setting HTTP_PROXY
may interrupt client connections to the server.
The Ollama Docker container image can be configured to use a proxy by passing -e HTTPS_PROXY=https://proxy.example.com
when starting the container.
Alternatively, the Docker daemon can be configured to use a proxy. Instructions are available for Docker Desktop on macOS, Windows, and Linux, and Docker daemon with systemd.
Ensure the certificate is installed as a system certificate when using HTTPS. This may require a new Docker image when using a self-signed certificate.
FROM ollama/ollamaCOPY my-ca.pem /usr/local/share/ca-certificates/my-ca.crtRUN update-ca-certificates
Build and run this image:
docker build -t ollama-with-ca .docker run -d -e HTTPS_PROXY=https://my.proxy.example.com -p 11434:11434 ollama-with-ca
No. Ollama runs locally, and conversation data does not leave your machine.
Ollama binds 127.0.0.1 port 11434 by default. Change the bind address with the OLLAMA_HOST
environment variable.
Refer to the section above for how to set environment variables on your platform.
Ollama runs an HTTP server and can be exposed using a proxy server such as Nginx. To do so, configure the proxy to forward requests and optionally set required headers (if not exposing Ollama on the network). For example, with Nginx:
server { listen 80; server_name example.com; # Replace with your domain or IP
location / { proxy_pass http://localhost:11434; proxy_set_header Host localhost:11434;
}
}
Ollama can be accessed using a range of tools for tunneling tools. For example with Ngrok:
ngrok http 11434 --host-header="localhost:11434"
To use Ollama with Cloudflare Tunnel, use the --url
and --http-host-header
flags:
cloudflared tunnel --url http://localhost:11434 --http-host-header="localhost:11434"
Ollama allows cross-origin requests from 127.0.0.1
and 0.0.0.0
by default. Additional origins can be configured with OLLAMA_ORIGINS
.
For browser extensions, you'll need to explicitly allow the extension's origin pattern. Set OLLAMA_ORIGINS
to include chrome-extension://*
, moz-extension://*
, and safari-web-extension://*
if you wish to allow all browser extensions access, or specific extensions as needed:
# Allow all Chrome, Firefox, and Safari extensions OLLAMA_ORIGINS=chrome-extension://*,moz-extension://*,safari-web-extension://* ollama serve
Refer to the section above for how to set environment variables on your platform.
macOS:
~/.ollama/models
Linux:
/usr/share/ollama/.ollama/models
Windows:
C:\Users\%username%\.ollama\models
If a different directory needs to be used, set the environment variable OLLAMA_MODELS
to the chosen directory.
Note: on Linux using the standard installer, the
ollama
user needs read and write access to the specified directory. To assign the directory to theollama
user runsudo chown -R ollama:ollama <directory>
.
Refer to the section above for how to set environment variables on your platform.
There is already a large collection of plugins available for VSCode as well as other editors that leverage Ollama. See the list of extensions & plugins at the bottom of the main repository readme.
The Ollama Docker container can be configured with GPU acceleration in Linux or Windows (with WSL2). This requires the nvidia-container-toolkit. See ollama/ollama for more details.
GPU acceleration is not available for Docker Desktop in macOS due to the lack of GPU passthrough and emulation.
This can impact both installing Ollama, as well as downloading models.
Open Control Panel > Networking and Internet > View network status and tasks
and click on Change adapter settings
on the left panel. Find the vEthernel (WSL)
adapter, right click and select Properties
. Click on Configure
and open the Advanced
tab. Search through each of the properties until you find Large Send Offload Version 2 (IPv4)
and Large Send Offload Version 2 (IPv6)
. Disable both of these properties.
If you are using the API you can preload a model by sending the Ollama server an empty request. This works with both the /api/generate
and /api/chat
API endpoints.
To preload the mistral model using the generate endpoint, use:
curl http://localhost:11434/api/generate -d '{"model": "mistral"}'
To use the chat completions endpoint, use:
curl http://localhost:11434/api/chat -d '{"model": "mistral"}'
To preload a model using the CLI, use the command:
ollama run llama3.2 ""
By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you're making numerous requests to the LLM. If you want to immediately unload a model from memory, use the ollama stop
command:
ollama stop llama3.2
If you're using the API, use the keep_alive
parameter with the /api/generate
and /api/chat
endpoints to set the amount of time that a model stays in memory. The keep_alive
parameter can be set to:
a duration string (such as "10m" or "24h")
a number in seconds (such as 3600)
any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
'0' which will unload the model immediately after generating a response
For example, to preload a model and leave it in memory use:
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": -1}'
To unload the model and free up memory use:
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": 0}'
Alternatively, you can change the amount of time all models are loaded into memory by setting the OLLAMA_KEEP_ALIVE
environment variable when starting the Ollama server. The OLLAMA_KEEP_ALIVE
variable uses the same parameter types as the keep_alive
parameter types mentioned above. Refer to the section explaining how to configure the Ollama server to correctly set the environment variable.
The keep_alive
API parameter with the /api/generate
and /api/chat
API endpoints will override the OLLAMA_KEEP_ALIVE
setting.
If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded. You can adjust how many requests may be queue by setting OLLAMA_MAX_QUEUE
.
Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing.
If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads.
Parallel request processing for a given model results in increasing the context size by the number of parallel requests. For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.
The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms:
OLLAMA_MAX_LOADED_MODELS
- The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 * the number of GPUs or 3 for CPU inference.OLLAMA_NUM_PARALLEL
- The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory.OLLAMA_MAX_QUEUE
- The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512
Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting. Once ROCm v6.2 is available, Windows Radeon will follow the defaults above. You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.
When loading a new model, Ollama evaluates the required VRAM for the model against what is currently available. If the model will entirely fit on any single GPU, Ollama will load the model on that GPU. This typically provides the best performance as it reduces the amount of data transferring across the PCI bus during inference. If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs.
Flash Attention is a feature of most modern models that can significantly reduce memory usage as the context size grows. To enable Flash Attention, set the OLLAMA_FLASH_ATTENTION
environment variable to 1
when starting the Ollama server.
The K/V context cache can be quantized to significantly reduce memory usage when Flash Attention is enabled.
To use quantized K/V cache with Ollama you can set the following environment variable:
OLLAMA_KV_CACHE_TYPE
- The quantization type for the K/V cache. Default isf16
.
Note: Currently this is a global option - meaning all models will run with the specified quantization type.
The currently available K/V cache quantization types are:
f16
- high precision and memory usage (default).q8_0
- 8-bit quantization, uses approximately 1/2 the memory off16
with a very small loss in precision, this usually has no noticeable impact on the model's quality (recommended if not using f16).q4_0
- 4-bit quantization, uses approximately 1/4 the memory off16
with a small-medium loss in precision that may be more noticeable at higher context sizes.
How much the cache quantization impacts the model's response quality will depend on the model and the task. Models that have a high GQA count (e.g. Qwen2) may see a larger impact on precision from quantization than models with a low GQA count.
You may need to experiment with different quantization types to find the best balance between memory usage and quality.
推荐本站淘宝优惠价购买喜欢的宝贝:
本文链接:https://sg.hqyman.cn/post/9610.html 非本站原创文章欢迎转载,原创文章需保留本站地址!
休息一下~~