GPT Pilot Reviewed - A Glimpse into the future of development

Posted Mar 12, 2024 Updated Mar 12, 2024

By Thamara Andrade

17 min read

The year is 2040. You get out of your flying Tesla Model 74 right outside the hangar on the 62nd floor east of the skyscraper, walk by the cafeteria, grab a coffee, and sit down at your desk to start your workday. You open Jira (because it’s 2040 and yes, we still use Jira), just to be faced with an urgent task of developing an application for a customer.

Without much thinking, you open your Visual Studio Code. After all, you are an OG and don’t mess with the cool and new IDEs people use these days right on their Apple Vision 13 Pro Max Ultra Plus. You create a new directory for the new app, drag and drop the request and all the relevant extra collateral into that directory, open the terminal, and issue the command: “create-new-app”. Three minutes later, you are already deploying the application and closing the Jira ticket. After all, you are not a developer anymore, you are an AI handler now.

Jokes aside, this is not a post meant to scare you with any narrative of “the AI will steal your job”, but the advance of AI, especially generative AI, is undeniable.

As a software engineer, as much as I love coding, I wouldn’t object to a tool that would do most of the boring work for me, with little or no guidance whatsoever. And based on what we have seen related to AI these past few years, we may not be that far from a tool like this.

“The only way of discovering the limits of the possible is to venture a little way past them into the impossible.” - Arthur C. Clarke

GPT Pilot

Last December, I was numbly scrolling through Twitter (as one does between Christmas and New Year’s), and found out about GPT Pilot, an open-source project that was trending on GitHub that day. The repo read, “Dev tool that writes scalable apps from scratch while the developer oversees the implementation”. I was perplexed, extremely curious, and I was not alone.

Turns out, that was not the first time this repository was trending. Earlier in October, the repository almost doubled the number of stars in the span of a week, reaching almost 12k, just 2 months after being public.

I went through the README, watched a few videos, read some posts, and I was sold on the idea. I loaded some credits on my OpenAI account, downloaded the extension, and got to working on creating an app to see where it would land me.

The usage statement from GPT Pilot was very simple: you start by describing the application you want to build, then it asks some questions for further clarification of the request as well as what are the specifications (currently it’s geared towards web/node-based applications). Once the specs are cleared, it starts working on the application, aiming at a more autonomous execution.

In simple words, the way GPT Pilot works is by defining AI agents for each step in the development of an application. If you are not familiar, you can think of an agent being a ChatGPT chat instance that you ask to act as a specific persona. For example, a Developer Agent would take a task as an input and output what are the steps necessary to implement it, in a human-readable form. A Code Monkey Agent (as named in the GPT Pilot), will take the developer’s description and any existing code, and return the code that implements the items the Developer Agent specified. The difference from a regular ChatGPT conversation is that the Agent does have access to other tools beyond the LLM, being able to run the code autonomously, for example.

For most professionals in a usual development cycle, GPT Pilot implements an agent for it. It includes, for example, a Product Owner, a Specification Writer, an Architect, a Reviewer, a Technical Writer, and so on. You can read about how it all works here.

So, I started playing with it, and the experience was far from optimal. I was under the incorrect self-inflicting big expectations, and it was not what I was seeing. I found myself having to step in a lot, even for just clicking “next”. The AI was hitting some issues and not being able to recover from them. It was not a pleasant experience.

But the whole project was still just starting. I took on my metaphorical hat for the developers of GPT Pilot in the form of a GitHub star to their collection, and carried on with my life.

GPT Pilot, now a YCombinator-backed company

Just last week, GPT Pilot popped up in my timeline again, this time with the news of its developing company, Pythagora, being backed by YCombinator. And to my surprise, GPT Pilot continues to be an open-source tool. With the exciting free time that a weekend brings, I decided to try GPT Pilot again.

I just clicked the extension that had been sitting in my VSCode for a few months, and with my OpenAI key already set up and some dollars to burn on my account, I clicked the big and purple (I love purple!) “Create New App” button, and what came next really made me think I was living in 2040.

If you are interested in a more tl;dr version, you can see the image below, and jump to the “Pros and Cons” section. But if you don’t mind the journey, let me start from the beginning.

Problem statement

For my app, I asked for a static website that would be able to calculate the net salary for an employee working under the Brazilian labor laws, starting from the gross value, and not only computing the net salary itself and displaying it but showing how the taxes and other deductions were calculated.

Initial prompt: I want to create a static and simple website that the user can use to check if their pay slip is correct. This will be based on Brazilian rules for computing taxes from the payment. The user should be able to enter their base salary, and the site will compute the tax as well as other related attributes (INSS, for example).

It followed by asking me some questions regarding the deductibles, to which I answered that, on top of the tax and social insurance contribution it mentioned, I would also want it to consider that the employee could participate in a stock purchase plan.

Problem stated, it asked me some more questions on the technical specification of the site, if it needed some sort of authentication and to store the data for each user or just provide anonymous, client-side computation for example. And for the sake of simplicity, I chose the latter.

It then stated the overall description and goals for the app, as well as some technical details, and got to work.

Overall Description and Goals (as defined by GPT Pilot) The application is a web-based tool designed to allow Brazilian employees to verify the accuracy of their pay slips. It calculates various salary-related attributes, including INSS (National Social Security Institute) contributions, IRRF (Income Tax Retained at the Source), and FGTS (Service Time Guarantee Fund) based on the Brazilian CLT (Consolidation of Labor Laws) employment contracts. The application will provide users with the ability to enter their base salary, the number of dependents, and the percentage of income for the ESPP (Employee Stock Purchase Plan) program. The tool will process data on the client side, ensuring anonymity and data privacy.

GPT Pilot - The app is done!

Two hours and some interactions with me later, I got the application I asked for. In total, ~300 lines of JavaScript, HTML, and CSS (yes, it even included some styling that I hadn’t asked for but was glad that it thought of it). But how was my experience?

Application completed

Autonomy

I was astonished by the capabilities of the tool. It did require my attention in some parts, but mostly, it was able to guide me to get the information that it needed without much hustle. In total, there were less than 10 times where it required me to act on something, this included “allowing” it to run some npm install commands or similar and the more “complicated” UI testing.

The way it approached the problem included steps for:

Boilerplate code/UI components
Implementation for the core logic (computation)
Displaying the computed data on the interface
Including the description/details on how everything was computed
Styling and making things pretty

Task progression

Whenever the flow finished a step, it instructed me to open the site locally to test, providing literal commands and step-by-step instructions. On the occasion where something was not behaving as it should, it would further instruct me to gather more information from the terminal or by inspecting the HTML itself and providing the relevant tags and data. With my input, it would carry on the troubleshooting phase until I gave the green light.

Human test request

I suspect that there’s still a lot in this debugging process that can be automated, and I’m curious to see how GPT Pilot will carry on these automations. Comparing to my experience 3 months ago, this tool has already come a long way, and I bet it will be able to cut down on the necessity of human oversight even more.

Correctness and efficiency

My proposal was indeed a simple task, but I was caught by surprise by how much the tool was able to accomplish with so brief requirements on my part. I didn’t include any details on what were the taxes, or how the other deductions were computed. It got all that from the LLM’s knowledge on that matter. I didn’t mention how to structure the site, how it was supposed to look. Nothing. GPT Pilot figured it all out itself.

But it’s not all flowers. After around 30 min, when it was working on displaying the computed data on the page, it bumped into an issue where things were simply not showing up. From my perspective as someone who knows a bit of HTML and JavaScript, I knew the issue was due to the style of a container not being set to “block” from “none” (something like not updating the visibility of the container). Something that would be very simple to solve, just call one function on the JavaScript side! I did try to hint this to the AI while executing the debugging steps it asked me, but the AI completely ignored my suggestions.

This led to it rewriting most of the JavaScript for the application. It was around this time that it decided to make the code asynchronous somehow, totally uncalled for based on the low complexity of the site. It got working in the end, and proceeded to the following step, but those decisions would cost in the long run. More specifically, would cost the whole process to take more time to run, as in the subsequent steps, the Promises and whatnot that were added, started causing problems and it took a while for it to recover.

Troubleshooter in action

I wonder how much faster it would have been able to finish if it just listened to me. This probably would matter less if the tool was literally doing everything by itself and running in the background while I, the handler, would be focused on something else. I would like to see if there could be some mechanism where I, the human overseeing the work, could actively stop it from taking wrong or more complicated paths.

Another cool feature I would like to see is if GPT Pilot could provide some way for the user to return to a specific step, for example, by using version control to really commit each step that was implemented and tested. Not only would such a feature be helpful in these scenarios where the AI skews to a direction I don’t want, but also as a way of documenting each small step of the development, as one developer (human or AI) should be doing.

Cost

I mentioned the cost of time as a negative side effect of the GPT Pilot autonomy, but that’s not all. It’s not cheap to use it either, money-wise.

This little static site with a little more than 300 lines of code cost me $21.41 USD! That’s about 5 cents of the dollar per line! In the LLM world, the exchange is set on tokens, not lines of code, and for this project (as shown on the GPT Pilot window), was used a little over 2 million tokens! Two million tokens are roughly equivalent to 400k words, which is comparable to a very long novel or possibly even multiple books depending on their length. Can you imagine a conversation with ChatGPT spanning the length of a very long novel?

But I’m not making a fair comparison as the tokenization of code differs a lot from narrative text. I asked ChatGPT what would be the estimated size of a codebase with this amount of tokens, and it said, “2 million tokens could easily represent a large JavaScript codebase, possibly spanning tens of thousands to hundreds of thousands of lines of code, depending on factors like code complexity and formatting”. Based on my knowledge, I think this is a fair guess.

The amount of tokens necessary is most likely due to the back-and-forth of sharing the code being worked on among the many agents involved. My educated guess is that many of these come from the review and troubleshooting iterations.

It’s worth mentioning that GPT Pilot was using the latest GPT-4-Turbo model, which is indeed a lot more expensive than its 3.5 counterpart. GPT-3.5-Turbo charges $0.50 per 1M tokens, while GPT-4-Turbo charges $10.00 per 1M tokens (no, there’s no extra 0 here, GPT-4-Turbo is 20x more expensive than the 3.5 version).

I may have let something pass during the setup, but I could not find a way to avoid the usage of the more expensive model. GPT Pilot does seem to offer a way to use a local LLM model, but I didn’t investigate further.

Also, Pythagora provides some subscription options with some alternatives to creating and using your own OpenAI key, from a Pay as You Go model where it somehow offers the same GPT-4-Turbo model with a 10% discount on top of the OpenAI price, to other modes where you are charged by the hour of GPT Pilot usage with some other limits involved.

I wonder how much of their business model will support the flexibility of different LLMs and providers in the future.

Considering today’s cost, and assuming it will grow at least at par with the complexity of the solution, what may look like a huge bill for some that have the “I totally do this myself” gene like myself, can actually be a saving point for others. As the technology improves, and GPT Pilot itself enhances its workflow, we can leverage this tool as a helper, not a replacement. It does the heavy work of repetitive tasks, so developers can focus on more complicated and creative parts of development. This cooperation between AI and developers doesn’t reduce the importance of human skill but instead makes it stronger, making work more productive and scalable.

Pros and Cons

Here’s the breakdown of the advantages and drawbacks I so verbosely exposed so far:

Autonomous Development
- Pros: GPT Pilot can handle the majority of the development process, reducing the need for human intervention.
- Cons: The user still needs to oversee the work, waiting for when the tool will need attention.
User Feedback
- Pros: Speedy generation of code with minimal input from the developer, accelerating the development cycle.
- Cons: User feedback is not considered during development, potentially steering away from the best solution.
Accuracy
- Pros: Demonstrates a good understanding of the requirements and can produce correct code based on very brief descriptions.
- Cons: Relies on the expertise of the model, which can be expensive.
Cost
- Pros: Enables the rapid creation of applications, potentially reducing the time-to-market of new products and tools.
- Cons: The usage of GPT Pilot can be expensive, particularly with the use of powerful models like GPT-4-Turbo.
Flexibility
- Pros: Offers flexibility in terms of project scope and complexity, accommodating various development needs, supporting most well-known libraries.
- Cons: Currently focused on web and node-based solutions.
Promising Future Potential
- Pros: Continuous updates and improvements suggest the tool may become even more proficient over time.
- Cons: We still to figure out how the company’s business model will impact its access.

Would I use it again?

In the most software engineer way possible, the answer is that it depends. I believe the cost of the underlying technology today would be a blocker for the development of more complex applications for most individuals, especially if the objective is a side project without any financial source. But it might be less imperative for a startup that wants to quickly put together something as a minimal viable product.

Without a clear necessity, I can’t justify the cost to myself to use it again ina “production” setting. As someone who loves to program, try out new technology, and solve issues that I face (and believe others face as well) with solutions in the form of open-source tools, a tool like GPT Pilot would be a game-changer! Many times I do have the ideas, but lack the will to push the inertia frontier to start. Such a tool could most definitely help with this, but at what cost?

As I do have some credits still on my account, I will most likely play with GPT Pilot a bit more, aiming not for a fully working solution, but a solution that will at least pave the ground for my human-powered knowledge and capability to conclude. It would also be very interesting to see how it relates to other AI tools like GitHub Copilot.

Final Thoughts

All being said, I do think the work done on GPT Pilot has been nothing other than astonishing. Being able to see how much progress was made in just 3 months makes me anxious to know where this, and also generative AI as a whole, will be in 3 more months. The elegant simplicity amidst the complexity of this solution is truly inspiring, and although I’m not into the whole rooting for a company (besides the one I work for obvious reasons), I’ll make an exception and will continue following and cheering for the success of this project.

At the same time, it’s funny to see that my main complaints about the tool are the cost, which will eventually get lower, and the balance between AI autonomy and human oversight. Some will say that I’m just being overly protective of my job, but I would argue that no AI will completely replace a human’s capability of critical thinking.

“The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka’ but ‘That’s funny’.” - Isaac Asimov

Blogging

This post is licensed under CC BY 4.0 by the author.