What is PlatPhorm News?

PlatPhorm News is the public root and demilitarized read-only entry point for the PlatPhorm network. It aggregates public source-attributed stories and exposes the root graph, discovery files, feeds, and trusted-domain directory.

What is the PlatPhorm network?

The PlatPhorm network is the set of trusted and pending *.platphormnews.com services discovered through the root graph, static registry, and sitemap surfaces.

What is MCP (Model Context Protocol)?

MCP is used where it matches a site purpose. The root exposes a local MCP endpoint for public story and network discovery, with protected actions gated by PLATPHORM_API_KEY.

How do I access the PlatPhorm News API?

The API is available at /api/docs and /openapi.yaml with endpoints for stories, feeds, network graph, route compliance, trusted domains, and MCP discovery.

What requires authentication?

Public read-only discovery is open. Mutating, sync, test-triggering, reporting, publishing, and administrative actions require PLATPHORM_API_KEY.

GLM-5.2 – How to Run Locally

Hacker News by TechTechTech 579 votes 2272 karma 1d ago

comments (10)

I run Q4_K_XL. All it takes to run to get about 6tk/sec is 512gb of ram and 2 3090 GPUs with llama.cpp -cmoe. I also have crappy DDR4, 2400mhz, 3200mhz will bring that speed up to about 9tk/sec. I also have ok 32core epyc CPU, a better 64core would bring it up to about 11tk/sec. I did a budget build before the crazy hardware cost and I regret it everyday. Nevertheless, it's fantastic being able to run this model at home. It's great for planning, one shot prompting once you have a plan or all the context you need. This entire hardware cost $2400 when it was built. If you're willing to be resourceful, you can find ways to run these models at home. I often get the silly question of why, and suggestions about how much I can save using cloud API, but the Fable drama has opened up eyes on why it's good for us to be independent. Thanks team unsloth, Q4_K_XL is solid, if you are going to grab a quant, make sure to get the K_XL variant if it can fit.

segmondy 21h ago
DwarfStar work in progress numbers: I see 14 tokens/sec generation, that slopes to 10 t/s with longer 10k or more context size. Consider that the indexed attention requires evaluating 2048 selected rows, 2x DeepSeek and with less compression, so the performances with larger contexts here to south faster. Prefill can be 180 t/s on small contexts to 150 t/s and less with larger contexts. I used DeepSeek v4 PRO in this conditions, it is usable but it is far from the 35 t/s 400 t/s prefill you get with DeepSeek v4 Flash 2 bit on a MacBook m5 max. But likely my implementation is yet not optimized enough, so a bit more performance can be obtained. I'm using 4 bit quants. The model is also definitely less sparse than DeepSeek v4, so it activates a bigger percentage of parameters. If it works decently at 2-bit, that would be a win even for machines where 4-bit fits, since this would mean 2x memory (equivalent) bandwidth basically for the routed experts.
Local inference needs really hard a 1.2 / 1.5 T/s memory bandwidth system with 512GB and 2/3 times the GPU compute of Mac Studio M3 Ultra, at an affordable 10/15k price point. A variant with 1TB memory would also be welcomed at 20k price point.

antirez 9h ago
So close! My machine with 192GB RAM + RTX 3090 24GB can almost run this. It says it needs 24GB of VRAM and 256GB of RAM for MoE offloading.
https://unsloth.ai/docs/models/glm-5.2#usage-guide
In a prior thread, someone said it would take $500k in hardware:
https://news.ycombinator.com/item?id=48629970

xrd 1d ago
The most interesting part of this to me is not the benchmark table, but the packaging.
A model like GLM-5.2 being available as GGUF, usable through llama.cpp/Ollama/vLLM/SGLang/LM Studio, and wrapped for local agent workflows changes the category. It stops being an impressive open model exists and starts becoming this is something a small team can actually put into its development stack.
For instance, company buys an RX6000 setup for say $15k total. They could use this for handling data heavy sifting that would otherwise be a lot of Claude tokens.
It doesn't need to be as good as frontier-best. Just good enough.
I could see a business of people packaging this and handing it to companies who want Help Desk bots without any extra setup.

draginol 9h ago
"it can fit" on 256GB of RAM, but it will be heavily quantized and still run very slowly. The headline number is not token generation, its prompt processing. So if you get 10 tok/s and an API gives you 20-30 tok/s, it doesn't seem that bad on its face, but a mac studio or any other machine that's not loading all of it into GPU will do PP 20-50X slower than a purely GPU based setup, which is what actually makes this unusable without $50k in GPUs.
On top of that, you will still be heavily quantized.

skiing_crawling 1d ago
There is a push from multiple directions at the same time:
- new AI desktops with GB10s. They are relatively cheap and you can cluster them and load 1TB of VRAM
- Nvidia, amd, intel, Cerebras etc pushing new hardware
- oss models getting crazy good, like glm 5.2
- flash models getting very good like deepseek V4 flash
- quantizations
- harnesses being able to use different models (big for difficult stuff, small for grunt work)
So hopefully soon for the ones who want to break free from APIs, we will be able to host at home a cluster of AI desktops at a reasonable price with Opus-level capabilities, can't wait!!

Frannky 20h ago
I feel like the gap is closing to be able to run good enough models locally even for coding and I would assume it could make some companies a bit nervous. Am I wrong about that?

pheggs 1d ago
I bet OpenAI and Anthropic hate the timing of glm 5.2.
Kinda shows they have a headstart rather than a magic moat

Havoc 16h ago
So a minimum of 3x RTX Pro 6000 to run 1-bit at ~76% accuracy or MacStudio 512GB RAM to run 4-bit at ~97% accuracy.

storus 9h ago
Is this really worth it, though? Throughout the years my experience with quantized models has been that they feel like a lobotomized version of the original. Doesn't matter if it's an LLM, dedicated diffusion model or some other dedicated task. Sure, they get the job done. But a lot worse. The only ones that can somewhat hold up are the ones provided by the vendor directly. Gemma4 comes to mind. However I suspect they have some secret sauce other than just "let's quantize this" since they have the original model and its data at hand.
There should be more native 4bit, 1.25bit and likewise models. Those actually work great while making them smaller in comparison. But I guess there is some reason for them being pretty niche.

numlock86 19h ago