14
2025
03
10:52:08

ollama 支持 gpu,Ollama Windows FAQ

https://github.com/ollama/ollama/blob/main/docs/gpu.md  


GPU

Nvidia

Ollama supports Nvidia GPUs with compute capability 5.0+.

Check your compute compatibility to see if your card is supported: https://developer.nvidia.com/cuda-gpus

Compute CapabilityFamilyCards
9.0NVIDIAH200 H100
8.9GeForce RTX 40xxRTX 4090 RTX 4080 SUPER RTX 4080 RTX 4070 Ti SUPER RTX 4070 Ti RTX 4070 SUPER RTX 4070 RTX 4060 Ti RTX 4060

NVIDIA ProfessionalL4 L40 RTX 6000
8.6GeForce RTX 30xxRTX 3090 Ti RTX 3090 RTX 3080 Ti RTX 3080 RTX 3070 Ti RTX 3070 RTX 3060 Ti RTX 3060 RTX 3050 Ti RTX 3050

NVIDIA ProfessionalA40 RTX A6000 RTX A5000 RTX A4000 RTX A3000 RTX A2000 A10 A16 A2
8.0NVIDIAA100 A30
7.5GeForce GTX/RTXGTX 1650 Ti TITAN RTX RTX 2080 Ti RTX 2080 RTX 2070 RTX 2060

NVIDIA ProfessionalT4 RTX 5000 RTX 4000 RTX 3000 T2000 T1200 T1000 T600 T500

QuadroRTX 8000 RTX 6000 RTX 5000 RTX 4000
7.0NVIDIATITAN V V100 Quadro GV100
6.1NVIDIA TITANTITAN Xp TITAN X

GeForce GTXGTX 1080 Ti GTX 1080 GTX 1070 Ti GTX 1070 GTX 1060 GTX 1050 Ti GTX 1050

QuadroP6000 P5200 P4200 P3200 P5000 P4000 P3000 P2200 P2000 P1000 P620 P600 P500 P520

TeslaP40 P4
6.0NVIDIATesla P100 Quadro GP100
5.2GeForce GTXGTX TITAN X GTX 980 Ti GTX 980 GTX 970 GTX 960 GTX 950

QuadroM6000 24GB M6000 M5000 M5500M M4000 M2200 M2000 M620

TeslaM60 M40
5.0GeForce GTXGTX 750 Ti GTX 750 NVS 810

QuadroK2200 K1200 K620 M1200 M520 M5000M M4000M M3000M M2000M M1000M K620M M600M M500M

For building locally to support older GPUs, see developer.md

GPU Selection

If you have multiple NVIDIA GPUs in your system and want to limit Ollama to use a subset, you can set CUDA_VISIBLE_DEVICES to a comma separated list of GPUs. Numeric IDs may be used, however ordering may vary, so UUIDs are more reliable. You can discover the UUID of your GPUs by running nvidia-smi -L If you want to ignore the GPUs and force CPU usage, use an invalid GPU ID (e.g., "-1")

Linux Suspend Resume

On linux, after a suspend/resume cycle, sometimes Ollama will fail to discover your NVIDIA GPU, and fallback to running on the CPU. You can workaround this driver bug by reloading the NVIDIA UVM driver with sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm

AMD Radeon

Ollama supports the following AMD GPUs:

Linux Support

FamilyCards and accelerators
AMD Radeon RX7900 XTX 7900 XT 7900 GRE 7800 XT 7700 XT 7600 XT 7600 6950 XT 6900 XTX 6900XT 6800 XT 6800 Vega 64 Vega 56
AMD Radeon PROW7900 W7800 W7700 W7600 W7500 W6900X W6800X Duo W6800X W6800 V620 V420 V340 V320 Vega II Duo Vega II VII SSG
AMD InstinctMI300X MI300A MI300 MI250X MI250 MI210 MI200 MI100 MI60 MI50

Windows Support

With ROCm v6.1, the following GPUs are supported on Windows.

FamilyCards and accelerators
AMD Radeon RX7900 XTX 7900 XT 7900 GRE 7800 XT 7700 XT 7600 XT 7600 6950 XT 6900 XTX 6900XT 6800 XT 6800
AMD Radeon PROW7900 W7800 W7700 W7600 W7500 W6900X W6800X Duo W6800X W6800 V620

Overrides on Linux

Ollama leverages the AMD ROCm library, which does not support all AMD GPUs. In some cases you can force the system to try to use a similar LLVM target that is close. For example The Radeon RX 5400 is gfx1034 (also known as 10.3.4) however, ROCm does not currently support this target. The closest support is gfx1030. You can use the environment variable HSA_OVERRIDE_GFX_VERSION with x.y.z syntax. So for example, to force the system to run on the RX 5400, you would set HSA_OVERRIDE_GFX_VERSION="10.3.0" as an environment variable for the server. If you have an unsupported AMD GPU you can experiment using the list of supported types below.

If you have multiple GPUs with different GFX versions, append the numeric device number to the environment variable to set them individually. For example, HSA_OVERRIDE_GFX_VERSION_0=10.3.0 and HSA_OVERRIDE_GFX_VERSION_1=11.0.0

At this time, the known supported GPU types on linux are the following LLVM Targets. This table shows some example GPUs that map to these LLVM targets:

LLVM TargetAn Example GPU
gfx900Radeon RX Vega 56
gfx906Radeon Instinct MI50
gfx908Radeon Instinct MI100
gfx90aRadeon Instinct MI210
gfx940Radeon Instinct MI300
gfx941
gfx942
gfx1030Radeon PRO V620
gfx1100Radeon PRO W7900
gfx1101Radeon PRO W7700
gfx1102Radeon RX 7600

AMD is working on enhancing ROCm v6 to broaden support for families of GPUs in a future release which should increase support for more GPUs.

Reach out on Discord or file an issue for additional help.

GPU Selection

If you have multiple AMD GPUs in your system and want to limit Ollama to use a subset, you can set ROCR_VISIBLE_DEVICES to a comma separated list of GPUs. You can see the list of devices with rocminfo. If you want to ignore the GPUs and force CPU usage, use an invalid GPU ID (e.g., "-1"). When available, use the Uuid to uniquely identify the device instead of numeric value.

Container Permission

In some Linux distributions, SELinux can prevent containers from accessing the AMD GPU devices. On the host system you can run sudo setsebool container_use_devices=1 to allow containers to use devices.

Metal (Apple GPUs)

Ollama supports GPU acceleration on Apple devices via the Metal API.








Ollama Windows

Welcome to Ollama for Windows.

No more WSL required!

Ollama now runs as a native Windows application, including NVIDIA and AMD Radeon GPU support. After installing Ollama for Windows, Ollama will run in the background and the ollama command line is available in cmdpowershell or your favorite terminal application. As usual the Ollama api will be served on http://localhost:11434.

System Requirements

  • Windows 10 22H2 or newer, Home or Pro

  • NVIDIA 452.39 or newer Drivers if you have an NVIDIA card

  • AMD Radeon Driver https://www.amd.com/en/support if you have a Radeon card

Ollama uses unicode characters for progress indication, which may render as unknown squares in some older terminal fonts in Windows 10. If you see this, try changing your terminal font settings.

Filesystem Requirements

The Ollama install does not require Administrator, and installs in your home directory by default. You'll need at least 4GB of space for the binary install. Once you've installed Ollama, you'll need additional space for storing the Large Language models, which can be tens to hundreds of GB in size. If your home directory doesn't have enough space, you can change where the binaries are installed, and where the models are stored.

Changing Install Location

To install the Ollama application in a location different than your home directory, start the installer with the following flag

OllamaSetup.exe /DIR="d:\some\location"

Changing Model Location

To change where Ollama stores the downloaded models instead of using your home directory, set the environment variable OLLAMA_MODELS in your user account.

  1. Start the Settings (Windows 11) or Control Panel (Windows 10) application and search for environment variables.

  2. Click on Edit environment variables for your account.

  3. Edit or create a new variable for your user account for OLLAMA_MODELS where you want the models stored

  4. Click OK/Apply to save.

If Ollama is already running, Quit the tray application and relaunch it from the Start menu, or a new terminal started after you saved the environment variables.

API Access

Here's a quick example showing API access from powershell

(Invoke-WebRequest -method POST -Body '{"model":"llama3.2", "prompt":"Why is the sky blue?", "stream": false}' -uri http://localhost:11434/api/generate ).Content | ConvertFrom-json

Troubleshooting

Ollama on Windows stores files in a few different locations. You can view them in the explorer window by hitting <Ctrl>+R and type in:

  • explorer %LOCALAPPDATA%\Ollama contains logs, and downloaded updates

    • app.log contains most resent logs from the GUI application

    • server.log contains the most recent server logs

    • upgrade.log contains log output for upgrades

  • explorer %LOCALAPPDATA%\Programs\Ollama contains the binaries (The installer adds this to your user PATH)

  • explorer %HOMEPATH%\.ollama contains models and configuration

  • explorer %TEMP% contains temporary executable files in one or more ollama* directories

Uninstall

The Ollama Windows installer registers an Uninstaller application. Under Add or remove programs in Windows Settings, you can uninstall Ollama.

Note

If you have changed the OLLAMA_MODELS location, the installer will not remove your downloaded models

Standalone CLI

The easiest way to install Ollama on Windows is to use the OllamaSetup.exe installer. It installs in your account without requiring Administrator rights. We update Ollama regularly to support the latest models, and this installer will help you keep up to date.

If you'd like to install or integrate Ollama as a service, a standalone ollama-windows-amd64.zip zip file is available containing only the Ollama CLI and GPU library dependencies for Nvidia. If you have an AMD GPU, also download and extract the additional ROCm package ollama-windows-amd64-rocm.zip into the same directory. This allows for embedding Ollama in existing applications, or running it as a system service via ollama serve with tools such as NSSM.

Note

If you are upgrading from a prior version, you should remove the old directories first.







FAQ

How can I upgrade Ollama?

Ollama on macOS and Windows will automatically download updates. Click on the taskbar or menubar item and then click "Restart to update" to apply the update. Updates can also be installed by downloading the latest version manually.

On Linux, re-run the install script:

curl -fsSL https://ollama.com/install.sh | sh

How can I view the logs?

Review the Troubleshooting docs for more about using logs.

Is my GPU compatible with Ollama?

Please refer to the GPU docs.

How can I specify the context window size?

By default, Ollama uses a context window size of 2048 tokens. This can be overridden with the OLLAMA_CONTEXT_LENGTH environment variable. For example, to set the default context length to 8K, use: OLLAMA_CONTEXT_LENGTH=8192 ollama serve.

To change this when using ollama run, use /set parameter:

/set parameter num_ctx 4096

When using the API, specify the num_ctx parameter:

curl http://localhost:11434/api/generate -d '{  "model": "llama3.2",  "prompt": "Why is the sky blue?",  "options": {    "num_ctx": 4096  }}'

How can I tell if my model was loaded onto the GPU?

Use the ollama ps command to see what models are currently loaded into memory.

ollama ps

Output:

NAME      	ID          	SIZE 	PROCESSOR	UNTIL
llama3:70b	bcfb190ca3a7	42 GB	100% GPU 	4 minutes from now

The Processor column will show which memory the model was loaded in to:

  • 100% GPU means the model was loaded entirely into the GPU

  • 100% CPU means the model was loaded entirely in system memory

  • 48%/52% CPU/GPU means the model was loaded partially onto both the GPU and into system memory

How do I configure Ollama server?

Ollama server can be configured with environment variables.

Setting environment variables on Mac

If Ollama is run as a macOS application, environment variables should be set using launchctl:

  1. For each environment variable, call launchctl setenv.

    launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
  2. Restart Ollama application.

Setting environment variables on Linux

If Ollama is run as a systemd service, environment variables should be set using systemctl:

  1. Edit the systemd service by calling systemctl edit ollama.service. This will open an editor.

  2. For each environment variable, add a line Environment under section [Service]:

    [Service]Environment="OLLAMA_HOST=0.0.0.0:11434"
  3. Save and exit.

  4. Reload systemd and restart Ollama:

    systemctl daemon-reload
    systemctl restart ollama

Setting environment variables on Windows

On Windows, Ollama inherits your user and system environment variables.

  1. First Quit Ollama by clicking on it in the task bar.

  2. Start the Settings (Windows 11) or Control Panel (Windows 10) application and search for environment variables.

  3. Click on Edit environment variables for your account.

  4. Edit or create a new variable for your user account for OLLAMA_HOSTOLLAMA_MODELS, etc.

  5. Click OK/Apply to save.

  6. Start the Ollama application from the Windows Start menu.

How do I use Ollama behind a proxy?

Ollama pulls models from the Internet and may require a proxy server to access the models. Use HTTPS_PROXY to redirect outbound requests through the proxy. Ensure the proxy certificate is installed as a system certificate. Refer to the section above for how to use environment variables on your platform.

Note

Avoid setting HTTP_PROXY. Ollama does not use HTTP for model pulls, only HTTPS. Setting HTTP_PROXY may interrupt client connections to the server.

How do I use Ollama behind a proxy in Docker?

The Ollama Docker container image can be configured to use a proxy by passing -e HTTPS_PROXY=https://proxy.example.com when starting the container.

Alternatively, the Docker daemon can be configured to use a proxy. Instructions are available for Docker Desktop on macOSWindows, and Linux, and Docker daemon with systemd.

Ensure the certificate is installed as a system certificate when using HTTPS. This may require a new Docker image when using a self-signed certificate.

FROM ollama/ollamaCOPY my-ca.pem /usr/local/share/ca-certificates/my-ca.crtRUN update-ca-certificates

Build and run this image:

docker build -t ollama-with-ca .docker run -d -e HTTPS_PROXY=https://my.proxy.example.com -p 11434:11434 ollama-with-ca

Does Ollama send my prompts and answers back to ollama.com?

No. Ollama runs locally, and conversation data does not leave your machine.

How can I expose Ollama on my network?

Ollama binds 127.0.0.1 port 11434 by default. Change the bind address with the OLLAMA_HOST environment variable.

Refer to the section above for how to set environment variables on your platform.

How can I use Ollama with a proxy server?

Ollama runs an HTTP server and can be exposed using a proxy server such as Nginx. To do so, configure the proxy to forward requests and optionally set required headers (if not exposing Ollama on the network). For example, with Nginx:

server {    listen 80;    server_name example.com;  # Replace with your domain or IP
    location / {        proxy_pass http://localhost:11434;        proxy_set_header Host localhost:11434;
    }
}

How can I use Ollama with ngrok?

Ollama can be accessed using a range of tools for tunneling tools. For example with Ngrok:

ngrok http 11434 --host-header="localhost:11434"

How can I use Ollama with Cloudflare Tunnel?

To use Ollama with Cloudflare Tunnel, use the --url and --http-host-header flags:

cloudflared tunnel --url http://localhost:11434 --http-host-header="localhost:11434"

How can I allow additional web origins to access Ollama?

Ollama allows cross-origin requests from 127.0.0.1 and 0.0.0.0 by default. Additional origins can be configured with OLLAMA_ORIGINS.

For browser extensions, you'll need to explicitly allow the extension's origin pattern. Set OLLAMA_ORIGINS to include chrome-extension://*moz-extension://*, and safari-web-extension://* if you wish to allow all browser extensions access, or specific extensions as needed:

# Allow all Chrome, Firefox, and Safari extensions
OLLAMA_ORIGINS=chrome-extension://*,moz-extension://*,safari-web-extension://* ollama serve

Refer to the section above for how to set environment variables on your platform.

Where are models stored?

  • macOS: ~/.ollama/models

  • Linux: /usr/share/ollama/.ollama/models

  • Windows: C:\Users\%username%\.ollama\models

How do I set them to a different location?

If a different directory needs to be used, set the environment variable OLLAMA_MODELS to the chosen directory.

Note: on Linux using the standard installer, the ollama user needs read and write access to the specified directory. To assign the directory to the ollama user run sudo chown -R ollama:ollama <directory>.

Refer to the section above for how to set environment variables on your platform.

How can I use Ollama in Visual Studio Code?

There is already a large collection of plugins available for VSCode as well as other editors that leverage Ollama. See the list of extensions & plugins at the bottom of the main repository readme.

How do I use Ollama with GPU acceleration in Docker?

The Ollama Docker container can be configured with GPU acceleration in Linux or Windows (with WSL2). This requires the nvidia-container-toolkit. See ollama/ollama for more details.

GPU acceleration is not available for Docker Desktop in macOS due to the lack of GPU passthrough and emulation.

Why is networking slow in WSL2 on Windows 10?

This can impact both installing Ollama, as well as downloading models.

Open Control Panel > Networking and Internet > View network status and tasks and click on Change adapter settings on the left panel. Find the vEthernel (WSL) adapter, right click and select Properties. Click on Configure and open the Advanced tab. Search through each of the properties until you find Large Send Offload Version 2 (IPv4) and Large Send Offload Version 2 (IPv6)Disable both of these properties.

How can I preload a model into Ollama to get faster response times?

If you are using the API you can preload a model by sending the Ollama server an empty request. This works with both the /api/generate and /api/chat API endpoints.

To preload the mistral model using the generate endpoint, use:

curl http://localhost:11434/api/generate -d '{"model": "mistral"}'

To use the chat completions endpoint, use:

curl http://localhost:11434/api/chat -d '{"model": "mistral"}'

To preload a model using the CLI, use the command:

ollama run llama3.2 ""

How do I keep a model loaded in memory or make it unload immediately?

By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you're making numerous requests to the LLM. If you want to immediately unload a model from memory, use the ollama stop command:

ollama stop llama3.2

If you're using the API, use the keep_alive parameter with the /api/generate and /api/chat endpoints to set the amount of time that a model stays in memory. The keep_alive parameter can be set to:

  • a duration string (such as "10m" or "24h")

  • a number in seconds (such as 3600)

  • any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")

  • '0' which will unload the model immediately after generating a response

For example, to preload a model and leave it in memory use:

curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": -1}'

To unload the model and free up memory use:

curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": 0}'

Alternatively, you can change the amount of time all models are loaded into memory by setting the OLLAMA_KEEP_ALIVE environment variable when starting the Ollama server. The OLLAMA_KEEP_ALIVE variable uses the same parameter types as the keep_alive parameter types mentioned above. Refer to the section explaining how to configure the Ollama server to correctly set the environment variable.

The keep_alive API parameter with the /api/generate and /api/chat API endpoints will override the OLLAMA_KEEP_ALIVE setting.

How do I manage the maximum number of requests the Ollama server can queue?

If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded. You can adjust how many requests may be queue by setting OLLAMA_MAX_QUEUE.

How does Ollama handle concurrent requests?

Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing.

If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads.

Parallel request processing for a given model results in increasing the context size by the number of parallel requests. For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.

The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms:

  • OLLAMA_MAX_LOADED_MODELS - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 * the number of GPUs or 3 for CPU inference.

  • OLLAMA_NUM_PARALLEL - The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory.

  • OLLAMA_MAX_QUEUE - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512

Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting. Once ROCm v6.2 is available, Windows Radeon will follow the defaults above. You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.

How does Ollama load models on multiple GPUs?

When loading a new model, Ollama evaluates the required VRAM for the model against what is currently available. If the model will entirely fit on any single GPU, Ollama will load the model on that GPU. This typically provides the best performance as it reduces the amount of data transferring across the PCI bus during inference. If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs.

How can I enable Flash Attention?

Flash Attention is a feature of most modern models that can significantly reduce memory usage as the context size grows. To enable Flash Attention, set the OLLAMA_FLASH_ATTENTION environment variable to 1 when starting the Ollama server.

How can I set the quantization type for the K/V cache?

The K/V context cache can be quantized to significantly reduce memory usage when Flash Attention is enabled.

To use quantized K/V cache with Ollama you can set the following environment variable:

  • OLLAMA_KV_CACHE_TYPE - The quantization type for the K/V cache. Default is f16.

Note: Currently this is a global option - meaning all models will run with the specified quantization type.

The currently available K/V cache quantization types are:

  • f16 - high precision and memory usage (default).

  • q8_0 - 8-bit quantization, uses approximately 1/2 the memory of f16 with a very small loss in precision, this usually has no noticeable impact on the model's quality (recommended if not using f16).

  • q4_0 - 4-bit quantization, uses approximately 1/4 the memory of f16 with a small-medium loss in precision that may be more noticeable at higher context sizes.

How much the cache quantization impacts the model's response quality will depend on the model and the task. Models that have a high GQA count (e.g. Qwen2) may see a larger impact on precision from quantization than models with a low GQA count.

You may need to experiment with different quantization types to find the best balance between memory usage and quality.





推荐本站淘宝优惠价购买喜欢的宝贝:

image.png

本文链接:https://sg.hqyman.cn/post/9610.html 非本站原创文章欢迎转载,原创文章需保留本站地址!

分享到:
打赏





休息一下~~


« 上一篇 下一篇 »

发表评论:

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。

请先 登录 再评论,若不是会员请先 注册

您的IP地址是: