After watching videos to see how veteran programmers – and even non-programmers!!! – are using AI to code, I decided to grab the wheel and take AI for a spin.
I heard great things this weekend about Google’s brand new Gemini 2.5 Pro, which was initially free because Google still considered it experimental. It was reportedly close to the coding performance of Anthropic’s Claude 3.7 Sonnet, which many consider the best coding LLM, but with several advantages: the paid version would be significantly cheaper than Claude – “Sonnet is 2.4x more expensive, yet not as good.” – and had a much larger “context window,” meaning Gemini can evaluate and consider much more of your data while helping you.
Why AI + Python?
I decided to (try to) use AI to build something with Python for several reasons:
- AI’s abilities are proportional to the quantity and quality of relevant code it has
stolen… er, been trained on, so it does best with popular languages like Python. - One of AI’s best use cases today is learning new languages. I’ve been doing mostly Elixir the past ten years and know JavaScript pretty well and want to see how well AI can teach me. I’ve been readying Python books and asking LLMs Python questions in recent weeks, so it’s definitely time to get hands-on.
- I love data analysis and have been re-learning R. (Elixir is surprisingly good at stats and ML too.) Python is extremely popular among data analysts and has fabulous tooling for stats and data visualization, so I’d love to unlock the Python data ecosystem.
(I’ve intentionally avoided Python for many years because I dislike: object-oriented programming, mutability, calling methods on objects (rather than pipelining data through functions), certain Python syntax, Python’s lousy backward compatibility, and how hard it is to manage Python environments.)
A (shockingly expensive) day with Google Gemini 2.5 Pro
Monday around noon, an idea for a website struck me and I started using Gemini 2.5 to build it using Python, VS Code, and the Cline plugin. Progress was impressively smooth and fast. I built something quite similar to my vision using only language prompts. I intentionally avoided modifying the code directly.
After working a while, I started getting rate-limited. My workflow slowed way down. Since I had heard Gemini 2.5 was much cheaper than Claude, I decided to pay for it to maintain my momentum. I gave Google my credit card and got back to work.
I didn’t think a few hours of coding could possibly cost much, so I didn’t think much about money. But as my code grew, I noticed the cost numbers seem to be creeping up. IIRC, initially I was seeing $0.20-$0.40 initially, so I went to Google’s page to see how much I had spent so far. IT TOLD ME $0! Perhaps it gave me a free week?
To be safe, I set an alert when my spending hit $25. I figured that within a few days I would receive an alert and have a sense of how affordable this was.
I then returned to programming. I grew more suspicious as the number crept up to more like $0.70-$0.80 late in my session. My Google Cloud Billing page still told me I had spent $0. But I noticed it did say “Your costs are usually recorded within 24 hours.” “USUALLY” within “24 hours”!?! Aren’t these computers?!? Why is latency potentially > 24 hours?!?
Anyhow, I was excited about what I was building.
I felt pretty good about the app’s functionality and planned to start my day Tuesday by enhancing the UI, performance (I was using Flask but needed to productionize this with a Python WSGI), and security (I have admin pages requiring authentication)…
…until I refreshed my Google Cloud Billing page and saw I owed $140!!!!
WTF!?!? Google still told me $0 when I last checked. Daaaaaaamn, Google!!!
I felt especially upset thinking back to a series of apologies Gemini had made to me when it had repeatedly failed to handle a refactoring involving CSS. It looped for a while, costing me more and more money before I told it to stop.
I suspect a big reason for the massive expense is that my VS Code / Cline set up was sending tons of data – all my code and my prompts??? – to Gemini with each request. The large context window means Gemini can evaluate more of your data. But – I believe – Gemini apparently doesn’t cache this data between API calls, so all the data passing racked up large fees. I noticed that “frivolousfidget” warned of this a couple days ago on Reddit:
Pricing is relative, without prompt caching discount gemini 2.5 can be way more expensive.
This may change:
Other production Gemini models have prompt caching, and so will 2.5 but it isn’t prod yet. The Gemini caching discount is not as good as Claude’s.
– funbike
Lesson painfully learned! Like the first time you spin up a cloud server then discover months later that you forgot to shut it down.
Side note: New pricing option alert
As I was writing this blog post, Anthropic announced new high-usage subscriptions, which might actually make sense for full-time coders, given that I racked up a $140 charge in half a day of Gemini usage!
Customers can pay $100 per month for five times the amount of usage as the company’s Pro plan, or $200 per month for 20 times the amount. OpenAI’s ChatGPT Pro, which is comparable to Claude’s Max tier, costs a flat rate of $200 per month. …Claude’s Max plan is a step above Anthropic’s free offering and the $20 per month usage tier.
Anthropic steps up competition with OpenAI, rolls out $200 per month subscription
Adventures in running a coding LLM locally via Ollama and using it in VS Code
After the shock of my $0 bill becoming a $140 bill literally overnight, I set aside my project and decided to focus instead on running an open-source LLM on the desktop I built last fall with an NVIDIA RTX 4070 Ti Super, which has 16GB of VRAM. This is more than enough VRAM to run a 7 billion parameter LLM fully within the GPU and potentially enough to run a 14 billion or 32 billion parameter model with help from system RAM & CPU.
(I will complete and deploy the project I started building Monday and will announce it on my blog when I complete it. I already have a domain set up. But I can’t just burn wads of cash on AI coding.)
In Reddit forums, I see many struggling to get LLMs running locally on their hardware to work well for AI coding. It’s easy to run models locally with Ollama and query them, but it’s much harder to use them for coding.
If I had a spare $2,200+ burning a hole in my pocket, the best current solution would be buying a 64GB RAM Mac Mini M4 Pro. Apple’s unified RAM makes it possible for GPUs to use system memory. With 64GB, you can run reasonably large models. Not nearly as large as the top models but pretty large ones. And I’m drooling over NVIDIA’s forthcoming DGX Spark machines, designed for powerful local ML/AI tasks!
Sadly, I’m limited to my 16GB 4070, but I am able to run decently large models on it.
I’ve now run a bunch of open source LLMs locally, but only a couple work half decently with two VS Code plugins I’ve been using, Cline and Roo Code. Some LLMs even work significantly better/worse in one of these plugins than the other. But the main problem is that these plugins are optimized for Anthropic’s Claude Sonnet and barely work with many other LLMs.
Given how expensive AI coding with the most powerful models is and how hard it is to get less powerful open-source LLMs running locally to do what you want, a promising hybrid approach is emerging. People are trying to use the large, expensive AI models to develop detailed project plans and then hand these plans off to a dumber free, local LLM to implement, similar to how architects and senior engineers might plan a larger project and break the work up into a sequence of required changes implementable by more junior developers who lack the broad context and wisdom of their more experienced colleagues. For example, GosuCoder is pursuing this approach.
Another approach for separating planning from implementation that I intend to try to make my local LLM more productive through better prompts developed by superior LLMs is Roo Code’s new Boomerang Tasks. (Tools like Cline and Roo Code enable you to switch AI models and to specify one model for planning and another for implementation.) For more info: Reddit users discussing the new Boomerang Tasks, and documentation of Boomerang Tasks.
Can a local LLM AI code something useful?
I eventually found a few open-source LLMs that worked decently with either Cline or Roo Code. maryasov has modified some open-source LLMs to work with Cline/Roo. I found qwen2.5-coder-cline worked pretty well. I’m also now having moderate success with Microsoft’s Phi-4.
I just downloaded Gemma 3:27b, which runs slowly on my hardware but provides great answers to my questions. It didn’t work with Roo Code but does seem to work with Cline.
Also, even when it’s working well, it often surprises me by being simultaneously smart and dumb. For example, a moment ago, it misunderstood the nature of the bug it had just introduced. I had to explain it was a variable scope problem. Once I did, it correctly decided to pass the variable into the function as a new function parameter, but it created a whole new copy of the function that called that function, rather than edit the existing function. This model seems poorly equipped to think several steps ahead and see the big picture, though this could be caused by poor prompting. It has been fumbling to solve what seems to me an easy-to-fix bug it created, despite me repeatedly tipping it off to what’s wrong.
Even though my files and prompts are local, running the LLMs locally feels at least an order of magnitude slower than when I worked with Gemini on Monday. My hardware is not remotely powerful enough to keep up with commercial AI services. But I may be able to make valuable use of my local LLMs for coding with patience and thoughtful approaches to crafting detailed, specific, clear, logically sequenced prompts that my local LLMs can tackle one small change at a time.
I realized as I was writing up this post that I’ve been running Cline and Roo Code almost exclusively in “Act” mode, seldom in “Plan” mode. This may be part of why I haven’t gotten better results.
I managed to build a Hangman game entirely using prompts against local LLMs. It works, but it’s nothing special. You can find it here. I named it “word_games” because I hope to continue building it out to learn more about Python and using LLMs to write and test code. I’ll probably next add a word-unscrambling game and try to extract the shared random word selection logic into a separate module. I’ll also add more tests. (I presently have just one.)
With thanks to Mohamed Nohassi for the photo shared on Unsplash