comments (10)

  • I seriously dont' know all this big hullabaloo about one shot prompting.

    by definition, a single prompt wont' constitute the complexity of a software project. ergo, what you'll get is a series of assumptions made by the model based on preexisting code in its training corpus.

    I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.

    Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.

    I'd also like to see it identify bugs and potential performance increases by identifying existing code and suggesting refactors based on context it can pickup about the particular use case you are trying to create.

    These are way more valuable metrics than "hey build X"

    cultofmetatron

  • > So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer in raw WebGL from scratch

    Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.

    Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).

    meander_water

  • At work we use Anthropic models and have basically no limits. So I am very familiar with what Opus can do. I also see the bills, I know what it costs.

    At home I make a point of trying other models / tools on my side projects. So I've been using OpenCode and trying tons of models via OpenRouter. I tried Kimi, Deepseek, MiMo, etc.

    GLM 5.2 is a _major_ step up from every other non-GPT/Claude/Gemini model I've tried. It's not as good as latest Claude Opus, but it feels every bit as good as Opus from ~4 months ago at a fraction of the price.

    To me this model is the "it just works" moment for open weights models. We had this for closed weights models in late 2025 when Opus 4.5 landed. This is the same feeling I'm having with GLM 5.2. It's 90% as good as what I get from Anthropic for 1/5th of the cost and without any concern of lock-in.

    habosa

  • I've been checking out GLM 5.2 on some projects and few thoughts on it:

    - it takes it sweet time to get code rolling, not the fastest model by any means

    - it strays a lot during discovery/planning but then corrects

    - it's not steering friendly, as it hallucinates things that it doesn't follow later on

    - its output is quite good

    A sample use case: I was optimizing rendering on Swift+Zig codebase. It chocked on 5k data entries.

    GLM 5.2 spent 20 minutes building the benchmarks and getting data out, which made me frustrated so I blocked non-editing tool access and went AFK, after approx. 30 minutes I found that it used already-made benchmarks and some "conclusions" to optimize 3 choke points. Output pointed that it couldn't validate suspicions and asked for more data.

    Implementation worked well, it was idiomatic and non-intrusive. I would even say that it was more idiomatic than GPT 5.5 effects on same repo.

    I would opt in in using it more BUT GPT usually completes same requests 5x faster.

    GLM 5.2 was spark for preparing and running inside isolated containers with JJ workspaces (so that multiple can be ran in parallel).

    xlii

  • I feel like another comparison worth looking at is purely cost.

    Capability per dollar is something I care about:

        Opus API    $5/$25
        Sonnet API  $5/$15
        Haiku API   $1/$5
    
        GLM 5.2 API $1.4/$4.4
    
    So you're really getting near opus level capability for the price of haiku.

    faxmeyourcode

  • I was never able to get these models to collaborate with me the way Opus does. I'm probably an outliner, I don't one-shot projects, I don't vibe code. I basically use LLMs are if I was working with a coworker, fairly smart one, but with short memory and often missing the big picture. Sometimes I can delegate more, sometimes less, but I know I always have to stay on top of what's happening, because it WILL create mess when it hits something hard. With the Antropic models, this kind of cooperation is easy (with the exception of Opus 4.6, which was bad for some reason).

    lukaslalinsky

  • > GLM-5.2 cost a fraction as much. Opus finished in half the time and shipped a cleaner game.

    The only thing you measured was a single tunable of how much work should be done on a vague prompt. Now make the prompt be something that causes GLM-5.2 to cost 4x of previous budget, to get something comparable.

    (And the wallclock time measures the inference provider, not the model.)

    yencabulator

  • I've signed up with Ollama to experiment with these open source models. For the past 3 months, it's just been experimenting, trying it out. GLM is the first model that I am using on a daily basis to do my coding work (as well as using Claude). It's good - I've been maxing out my Ollama usage limits everyday :)

    postatic

  • > Opus 4.8 built in Claude Code; GLM-5.2 built in Pi over OpenRouter.

    It would be more interesting and accurate to see the comparison on the same harness if the intent is to compare the frontier models.

    Pi is relatively new and does not have many features built-in compared to Claude Code. It was chosen intentionally this way as Pi's goal is not to create a bloat builtin of tools most don't use but to allow the users to customize to fit their need -- similar to Neovim vs IDE.

    The end-user "vibe coding" experience is *heavily* swayed by the harness because prompt effectively drives how a model outputs an answer.

    jameson

  • I’m actually amazed at the output since GLM doesn’t have eyes. If GLM 5.2 costs 1/5 as much, seems like it could be set up to reach out to a multimodal model for vision tasks when required. Closer to parity but probably still significantly cheaper.

    toddmorey