• 1 Post
  • 4 Comments
Joined 2 years ago
cake
Cake day: March 22nd, 2024

help-circle


  • I find the overhead of docker crazy, especially for simpler apps. Like, do I really need 150GB of hard drive space, an extensive poorly documented config, and a whole nested computer running just because some project refuses to fix their dependency hell?

    Yet it’s so common. It does feel like usability has gone on the back burner, at least in some sectors of software. And it’s such a relief when I read that some project consolidated dependencies down to C++ or Rust, and it will just run and give me feedback without shipping a whole subcomputer.


  • It’s less optimal.

    On a 3090, I simply can’t run Command-R or Qwen 2.5 34B well at 64K-80K context with ollama. Its slow even at lower context, the lack of DRY sampling and some other things majorly hit quality.

    Ollama is meant to be turnkey, and thats fine, but LLMs are extremely resource intense. Sometimes the manual setup/configuration is worth it to squeeze out every ounce of extra performance and quantization quality.

    Even on CPU-only setups, you are missing out on (for instance) the CPU-optimized quantizations llama.cpp offers now, or the more advanced sampling kobold.cpp offers, or more fine grained tuning of flash attention configs, or batched inference, just to start.

    And as I hinted at, I don’t like some other aspects of ollama, like how they “leech” off llama.cpp and kinda hide the association without contributing upstream, some hype and controversies in the past, and hints that they may be cooking up something commercial.