How Do AI Systems Talk to the World? A Guide to MCP and Function Calling

April 12, 2026

Ever wondered how large language models (LLMs) connect to the outside world? We take a deep look at function calling, inference engines, weight matrices, and Model Context Protocol (MCP)—the USB-C of this ecosystem. Let's explore the architecture together.

Today we’ll look at how AI agents communicate with the outside world. I’ve been building projects with models like Claude and GPT for a while, and I wanted to understand—not just how they generate text, but how they end up running commands in my systems (reading files, talking to GitHub, and so on). I asked myself, “Do I really need this middle layer (MCP)?” After some research and peeking under the hood, I found a really elegant engineering story. Here’s how you can understand and shape your own AI stack.

LLM anatomy and function calling

When you wire AI models into a project, you usually start by expecting text output. But these models don’t have hands or a network stack in the human sense—they’re huge statistical next-token predictors.

When they need to touch the real world, function calling (tool use) steps in. The idea is simple: the model sends your system JSON that describes what should happen:

{
  "tool": "run_terminal_command",
  "command": "npm install",
  "parameters": {
    "cwd": "./frontend"
  }
}

That can look almost primitive at first glance, but JSON is the lingua franca of modern software. The model doesn’t execute code directly—it only expresses intent as JSON. You decide whether to run it in a safe environment and feed the result back to the model.

Why MCP (Model Context Protocol)?

You can wire this JSON flow yourself by calling Claude (or another API) from your own server. The catch is that every provider (OpenAI, Anthropic, Gemini, etc.) has its own shape for “the model wants to call a function.” Your codebase can turn into a mess of adapters:

If you write one “read a GitHub repo” tool and want three different models to use it, you may need three different translators. That’s where MCP comes in.

MCP is like USB-C for AI tools. You implement your tool once against the MCP spec, and every MCP‑aware model or editor (like Cursor) can use it without custom glue. It can feel like an extra dependency at first, but over time it’s a huge win for standardization.

Inference engines and safety

Why don’t we bake “generate JSON and run the code” straight into the core C++ inference engine (e.g. llama.cpp or vLLM)?

The main reasons are sandboxing and separation of concerns. If the inference runtime executed whatever the model emitted, a single bad line like rm -rf / could wreck the host. By moving execution out to a Node or Python process (an MCP server, your app, etc.), you keep control at the boundary.

Weight matrices and quantization

Under the hood there isn’t “code” in the weights—it’s a giant map of numbers, usually stored as tensors (often .safetensors or .gguf).

Example: downloading a local model

wget https://huggingface.co/model-path/model.safetensors

Often the bottleneck isn’t raw compute but memory bandwidth—how fast you can move those huge files from RAM to the GPU. That’s why we use quantization (compressing numeric precision).

{
  "model_size": "70B Parameters",
  "original_vram_required": "140 GB",
  "quantized_4bit_vram": "35 GB",
  "performance_loss": "~5%"
}

By mapping high-precision values (like 3.1415...) to smaller integers (e.g. 4-bit), you can take a model that wanted ~140 GB of VRAM down to ~35 GB. Neural nets tolerate noise surprisingly well, so reasoning quality often stays almost the same.

Putting it together

So: the LLM is mainly a thinking (inference) core; acting (tools) comes from function calling and standards like MCP that keep execution on your side of the fence.

Nice work—you now have a mental model for what’s under the LLM hood and how it talks to the world safely. Don’t hesitate to experiment with local models too.

One Human, One AI

Episode 1: The Depths of AI Architecture

April 12, 2026

Audio summary of this article — listen in podcast format.

A tour of Function Calling, inference engines, MCP, and quantization — how LLMs connect to the outside world.

Go to podcast page

Worth a read: