Claude Code and Microsoft Foundry Local

About 2150 wordsAbout 7 min

2026-02-05

I have a Copilot+ PC laptop with a Snapdragon X Elite in it. It has a NPU, which ironically is rarely used for running AI language models. I wanted to see if I could get a popular coding agent working offline and disconnected from power to provide any value on my laptop. Simply put, is productive AI coding assistance in airplane mode possible?

How to get Claude Code to work with locally running models using Ollama has been described by many. One of the better is here because it highlights how to do it offline.

The problem with running models using Ollama is that their execution providers are either for CPU or GPU. Using them simply drains the battery on my laptop when disconnected from power so fast that I get no value in time from them in my edge case scenario.

Enter Microsoft Foundry Local. Foundry Local is only in Public Preview, so I had to set expectations accordingly, but what is super nice with Foundry Local is that it runs ONNX models that can execute on the NPU. In theory much more power efficient compared to running on GPU/CPU.

As of writing it appears that using Claude Code together with Foundry Local is undocumented and unsupported. Not really a good sign, but hence this blog post.

Getting started using Foundry Local on a Windows PC is ridiculously easy:

winget install Microsoft.FoundryLocal

List the models that Microsoft has compiled and made available for download:

foundry model list

In my scenario I selected the largest model listed as running on NPU and allowing tools. Without support for tools a model will be relatively useless as coding agent backend.

foundry model run qwen2.5-7b

At first run the model variant most suitable for your system will automatically download and you get a simple prompt when completed: Although the model used did poor jokes, it replied instantly. Notice the short spike in NPU usage. At this point testing Foundry Local felt promising...

I will not go into detail about all the rabbit holes I went down from here. In the end, I got it somewhat working by this setup:

Claude Code originally had to run in WSL on Windows, but these days Anthropic seems to recommend using their native installer and run from Powershell. The reason I moved Claude Code from WSL to Powershell is however because Foundry Local service does not bind to a host reachable from WSL.
LiteLLM AI Gateway is used because Claude Code expects the Anthropic API in the backend and Microsoft Foundry Local expose a OpenAI compatible API. LiteLLM is primarily an interpreter in this scenario. LiteLLM can on Windows be installed either in WSL or as a Docker container. Because of the host binding issue of Foundry Local it had to be Docker.
Microsoft Foundry Local use a different port on each run, unless explicitly specified. Which one is currently used can be found by:

foundry service status

I had to do tweaking of settings back and forth a lot during troubleshooting and I honestly do not know if all I have set is even needed anymore.

Microsoft Foundry Local

Install it and run a model. Note which port running on. Done.

LiteLLM

Create a config.yaml file somewhere and put this (replace with another model if you have another in Foundry Local) in it:

model_list:
  # FoundryLocal models
  - model_name: qwen2.5-7b
    litellm_params:
      model: azure_ai/qwen2.5-7b-instruct-qnn-npu:2
      api_key: os.environ/FOUNDRYLOCAL_API_KEY
      api_base: os.environ/FOUNDRYLOCAL_API_BASE
      additional_drop_params: ["reasoning_effort"]
      allowed_openai_params: ["tool_choice"]
      tool_choice: { "type": "auto" }
      timeout: 2400
      temperature: 0.1

litellm_settings:
  enable_preview_features: true
  modify_params: true
  request_timeout: 300

Note the prefix "azure_ai". LiteLLM does not officially support Foundry Local, but they do for Microsoft Foundry in Azure. I imagined that it would work using prefix for any provider of OpenAI API compatible models, but using azure_ai is the only one that has not thrown errors for me so far. The LiteLLM transformation between Anthropic API requests to Foundry Local OpenAI compatible endpoints appear to have its closest match when azure_ai is used.

Open terminal in Docker Desktop and run:

docker pull docker.litellm.ai/berriai/litellm:main-latest

docker run -v c:\TEMP\LiteLLM\litellm_config.yaml:/app/config.yaml -e FOUNDRYLOCAL_API_KEY=dummy -e FOUNDRYLOCAL_API_BASE=http://host.docker.internal:foundry-service-current-port/v1 -p 4000:4000 docker.litellm.ai/berriai/litellm:main-latest --config /app/config.yaml --detailed_debug

Replace c:\TEMP\LiteLLM\litellm_config.yaml with path to your previously created config.yaml and replace foundry-service-current-port with whatever you have Foundry Local running on. Note the use of host.docker.internal. This is what makes LiteLLM able to connect to Foundry Local running on docker host localhost.

Claude Code

Modify settings.json for Claude Code:

{
  "$schema": "https://json.schemastore.org/claude-code-settings.json",
  "env": {
    "BASH_DEFAULT_TIMEOUT_MS": "1800000",
    "BASH_MAX_TIMEOUT_MS": "7200000",
    "API_TIMEOUT_MS": "3600000",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "qwen2.5-7b",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "qwen2.5-7b",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "qwen2.5-7b",
    "ANTHROPIC_SMALL_FAST_MODEL": "qwen2.5-7b",
    "CLAUDE_CODE_SUBAGENT_MODEL": "qwen2.5-7b"
  },
  "companyAnnouncements": [
    "Welcome to MEEQIT! Review our guidelines at blog.meeqit.se",
    "Reminder: Code reviews by Markus required for all PRs"
  ],
  "skipDangerousModePermissionPrompt": true
}

Set environment variables for Claude Code:

$env:CLAUDE_CODE_USE_FOUNDRY = 1
$env:ANTHROPIC_FOUNDRY_BASE_URL = "http://127.0.0.1:4000"
$env:FOUNDRYLOCAL_BASE_URL = "http://127.0.0.1:4000"
$env:FOUNDRYLOCAL_API_KEY = "dummy"
$env:CLAUDE_CODE_SKIP_FOUNDRY_AUTH = 1
$env:ANTHROPIC_FOUNDRY_API_KEY = "foundry"
$env:ANTHROPIC_BASE_URL = "http://127.0.0.1:4000"
$env:ANTHROPIC_FOUNDRY_API_KEY = "dummy"
$env:ANTHROPIC_DEFAULT_SONNET_MODEL="qwen2.5-7b"
$env:ANTHROPIC_DEFAULT_HAIKU_MODEL="qwen2.5-7b"
$env:ANTHROPIC_DEFAULT_OPUS_MODEL="qwen2.5-7b"
$env:ANTHROPIC_SMALL_FAST_MODEL="qwen2.5-7b"
$env:CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = 1
$env:CLAUDE_CODE_SUBAGENT_MODEL="qwen2.5-7b"
$env:CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING = 1

Note the overlap of environment variables from settings.json and what I manually set in Powershell. This should not be needed and is a result of my trial and errors to get a somewhat working config. I simply have not bothered to strip it down to minimum needed.

What is then working using using Claude Code with this config? Well, the increased timeout values I set is a little telling I am afraid..

In a fresh Claude Code session, using the same simple prompt I used with the native Foundry Local interface, I got a reply in close to 7 minutes.
If you read the debug output from LiteLLM in Docker it becomes obvious why the same user prompt towards the same model takes so much longer to process using Claude Code. There is simply so much default context added by Claude Code to make the model aware of its tools and desired behaviour.

Next we try getting Claude Code to actually accomplish something by prompting it to create a file, which should trigger its skills and tools. I used "Create a TEST.txt file containing the text AWESOME TEST" as prompt. This worked fine, although it crunched for 14 minutes to do so, and then continued crunching for close to 20 minutes without producing any output for me. What was the extra 20 minutes about? Long story short, SUGGESTION MODE, also known as "Prompt suggestions" in Claude Code config settings. From the LiteLLM debug output it is possible to see what Claude Code sends back to the model after completing the user prompt. This stands out (edited for readability):

SUGGESTION MODE: Suggest what the user might naturally type next into Claude Code.
FIRST: Look at the user's recent messages and original request.
Your job is to predict what THEY would type - not what you think they should do.
THE TEST: Would they think "I was just about to type that"?
EXAMPLES: User asked "fix the bug and run tests", bug is fixed → "run the tests"
After code written → "try it out"
Claude offers options → suggest the one the user would likely pick, based on conversation
Claude asks to continue → "yes" or "go ahead"
Task complete, obvious follow-up → "commit this" or "push it"
After error or misunderstanding → silence (let them assess/correct)
Be specific: "run the tests" beats "continue".
NEVER SUGGEST:
- Evaluative ("looks good", "thanks")
- Questions ("what about...?")
- Claude-voice ("Let me...", "I\'ll...", "Here\'s...")
- New ideas they didn't ask about
- Multiple sentences
Stay silent if the next step isn't obvious from what the user said.
Format: 2-12 words, match the user's style. Or nothing.
Reply with ONLY the suggestion, no quotes or explanation.

Basically the model had to process the Claude Code instructions for 20 minutes only to conclude "Stay silent". Delightfull waste of energy. Disabling "Prompt suggestions" in Claude Code stops the post processing from happening.

Next I tried getting Claude Code to edit existing file as that should trigger multiple tools (Read, Write). This failed initially and I had to tweak a lot to get it working (included in LiteLLM config.yaml settings above). Note it says it took 52 minutes to complete this run. This is an unfair number as I was away on a technical break and missed the dialog where Claude Code asked me to confirm writing to file. Each bullet point in the output does however equal to a ~7min round trip to the model, so it probably took about 35 minutes.

Lastly, I want subagents to work. To test this I have been using the /init command. Claude Code, the Qwen model, and I, have been struggling a lot with this: The Alibaba Qwen model has tendencies to fall back to simplified Chinese output when stressed. I have even seen it throw in some Spanish here and there. Even with reduced temperature and top_p I feel it is prone to hallucinate.

I was about to give up and post my findings as is, when my settings combo tweaking and log digging produced this: Explore agent started! Having read this I figured this could be about the mentioned Glob tool issues on Windows.. However, once I added hooks I could not get the Explore agent to start again, not even after removing hooks, and most of my attempts end up with LiteLLM and Claude Code getting stuck in a loop sending the same request to Foundry Local over and over. No more time playing around with this. After all, Microsoft Foundry Local is still only in preview and LiteLLM does not officially support so let's get back to this some other time.

Summary

Can you use Microsoft Foundry Local QNN models with Claude Code? Yes.

Should you? No.

Does this mean Microsoft Foundry Local is useless? Absolutely NOT. Using it with the native prompt interface provided by Microsoft gave me a acceptable script suggestion on prompt "Create a PowerShell script that iterates through all Site Collections in a SharePoint Server Web Application." in seconds, and you can use it in VS Code using AI Toolkit and probably with the Continue extension (yet to be tested by me personally).

Correction, I have now tested Microsoft Foundry Local with the Continue VS Code extension and it works by similar principles as my Claude Code setup. Configure the extension to use Azure Open AI and point it to LiteLLM and you are good to go. Had it add the following mermaid chart to this page (which is essentially a markdown file) and it completed the mermaid markdown snippet in seconds, but actually getting it inserted at the bottom of page took +20 minutes as the model had to read the file multiple times to figure out where end was.

So to conclude, you can get productivity value in Airplane Mode by running local models on the NPU. Just do not expect to work much using any agentic workflows, especially not if disconnected from power as it will drain your battery fast.

In case you want a better experience using LiteLLM in front of Foundry Local you can upvote this suggestion.