Why 30 Days of Agent Review Beats Prompt Engineering

SaaStr AI 2026 produced a rare artifact: a detailed, unglamorous operational playbook for running a B2B company with three humans and 21 AI agents. Jason Lemkin, SaaStr's founder, and the Replit team (CEO Amjad Masad, CRO Cody, and President) laid out exactly how they built agents that compound in capability over time. The takeaway for any B2B operator is that the real moat is not prompt engineering. It is the accumulated labeling of agent outputs over weeks.

The energy in the room felt like 2015 again. Lemkin argued that is a market signal. Pre-AI, you could measure any B2B startup by story points. The winners showed story points compounding faster than headcount, quarter after quarter. The compounding only started in January through March of 2026, after Claude 4.5 and 4.7. Before December 2025, most of mainstream tech was not taking agentic engineering seriously. The early teams, including Replit, were already showing radical productivity gains. The broader market had not woken up. Even the most agile teams are pulling ahead exponentially today, and the gap is widening every week.

If your CTO cannot show radical, compounding productivity gains right now, you are going to lose in the market. The level of competition is unprecedented. Replit is in one of the most competitive spaces in tech. The only way to survive is to amplify human output by 10x or 100x with agents.

The Replit Agent Stack: 10K, QBee, and the GTM Layer

Replit's internal agent stack is built on its own platform. The two primary agents are 10K, an autonomous VP of Marketing with 14,230 lines of code, and QBee, an autonomous VP of Customer Success built by Amelia. 10K runs Monday standups, sends campaigns, and pulls metrics. QBee manages 100+ sponsors, delivering daily QBRs (quarterly business reviews done daily) and surfacing churn risks.

The full GTM stack includes Qualified for inbound, Artisan for outbound, Agentforce for re-engagement, Momentum for revenue intel, and Monaco for deal coordination.

The results: $1M+ in revenue from AI-qualified leads, 72% open rates on Agentforce win-back campaigns, and 70% fewer customer success hours than a comparable B2B media operation.

The 10K Email That Destroyed Matching Software

Lemkin described a specific moment weeks before SaaStr AI Annual. VC attendance was light relative to overall attendance, which was tracking at 143% of prior year. He asked 10K to investigate. The agent's first reply was deflective. After pushing back, 10K returned with a material gap: 152 VCs registered vs. the prior year's count.

"I said: write the world's best email to these VCs to come. Start with Bloomberg Beta – they were an early Replit investor, let me see what you can do."

What 10K produced was not just a great email. It assembled every adjacent investor attending, every Replit competitor attending, every portfolio company of Bloomberg Beta attending, and constructed an argument that was unanswerable. Lemkin then asked it to do the same for every CEO of every company he has invested in. It went through all 8,000 attendees and built personalized matching for each one. It destroyed the existing matching software.

One output went to a founder. The email uncovered that eight of his top competitor's entire management team were attending, plus every adjacent player in his market. Then 10K reasoned through whether it was worth his time given vertical SaaS was lightly attended. The founder wrote back: "That's the best marketing email I've ever received."

Why Product Fluency Is the New Sales Leading Indicator

Cody, Replit's CRO, correlated internal Replit usage by each sales rep against quota attainment. The correlation was 1:1. Reps using the product themselves were hitting quota. Reps who were not, were not.

The mechanism is straightforward. When the product changes weekly, a rep who built on Replit that morning can answer a customer's question that afternoon. A rep who has not touched the product in two weeks gives answers based on a version that no longer exists.

Lemkin noted that Cody is the best salesperson he has encountered in this category. Cody tells you why something works, why something does not, why something might change next quarter, and why a specific integration is weaker than another. In a world where software did not change for 10 years, you did not need a sales rep to explain why. You just needed them to take the order. In a world where the product changes weekly, "why" is the entire sales motion.

Practical rule: Most reps still say 'let me check on that.' The best ones say: 'honestly, the Snowflake integration won't be at parity with Databricks for at least two quarters, here's why architecturally, and here are the 10 customers who've made it work anyway.'

Every B2B sales leader should measure product fluency on their team as rigorously as they measure pipeline. It is the new leading indicator.

The Labeling Moat: Why 30 Days of Review Beats Prompt Engineering

Lemkin and the Replit team disagreed on a key point during the live conversation. Lemkin said: do not rebuild any SaaS that already exists. If a great tool exists, buy it. Cody pushed back: Replit's internal teams have someone on every functional team tasked with asking "can we build this ourselves before we buy it?" That discipline forces the team to understand where the actual gaps in their stack are.

The real insight is about how agents learn. Lemkin described the process with 10K. Every time the agent writes something great, he tells it that. Every time it writes something terrible, he tells it that too, sometimes in all caps. Over hundreds of interactions, something happens in the memory layer, in the context window, in whatever Replit is doing under the hood. The agent is learning what "good" means on a per-action basis.

Key insight: We're not prompt engineering. We're inadvertently labeling training data, one interaction at a time. In data science, this is called labeling. Every 'yes good / no bad' you give an agent is a label. Do it 100 times on the same type of task and the agent's output on that task crosses a threshold.

That is the moat. A one-sentence prompt today produces output that took 50 prompts to produce three months ago. The accumulated positive and negative reinforcement has compounded inside the session memory.

The practical implication for teams: spend 30 days reviewing every output before it goes out the door, marking yes/no, and fixing not the individual output the process that produces the outputs. By day 30 you have something that reliably hits 80% of what your best person produces. By day 60, a new model drops and you are at 90%. Teams that skip the labeling phase never get there.

N=1 Apps: When to Build vs. Buy in the Agent Era

Lemkin runs a headless Salesforce inside 10K. Not because Salesforce is bad. Logging into Salesforce to find a dashboard that has not been updated in three weeks is friction he will not tolerate. He built dashboards inside 10K that pull from Salesforce's API and show every sponsorship, every deal, every ticket in real time. For the first time in his career, he has dashboards that are current.

Another example: Visible, SaaStr's 15-year-old event CRM, has not shipped a feature since 2019. They were at the edge of churning. 10K interfaced with the Visible API and found that the documented API could not do everything needed. The undocumented endpoints could. They are renewing Visible and investing more in it. The UI is dated. The API works. That is all that matters.

The parking pass example: someone on the team used to spend a week printing parking passes for 5,000 SaaStr attendees manually. Slicing a 5,000-page PDF, finding the right page, mailing the right one to the right person. The Replit agent did it in a few hours. Takes the PDF, slices it, matches each page to the right attendee, routes it correctly. That is an N=1 app no one will build as a product because the use case is too specific. For SaaStr, it is worth thousands of hours.

Social/Agentic Engagement: Where Agents Exceed Humans

Cody mentioned someone on his team was building an agent to engage with LinkedIn comments on Amjad's posts. Lemkin warned about the terrible version: someone comments "great post" and gets a templated reply that says "thanks, would you like to try our product?" That is spam.

The transformative version: someone posts "I'm vibe-coding on Replit using the non-native Clerk integration, can't get the private key to work, very frustrated." An agent detects that they are in your ICP, reads their post carefully, and responds with: "Yes, that specific Clerk integration has three known issues. Here's exactly what's happening and here are the three fixes that have worked for other builders. Happy to jump on a call if any of these don't resolve it."

Practical rule: That's not 80% of a human. That's 120% of a human. No human salesperson has the context, the time, or the technical depth to do that for every single ICP comment on every single platform every single day.

Lemkin keeps saying agents can hit 80% of your best human at most sales tasks. Social/agentic engagement is the one area where agents can exceed humans. The volume and personalization required is beyond what any human can sustain. Every vibe-coder on the planet has issues. If your agent can credibly answer their questions in real time across every platform, you have built a sales motion no competitor can match. The first B2B company to nail this owns their category.

Programming in English: The API Consumption Shift

Amjad has been saying "programming in English" for a while. Two months ago Lemkin would have said that is CEO marketing. Today he thinks it is literally true. He is building production apps on Replit with one-sentence prompts. He does not design prompts anymore. He does not write specs. He has enough context accumulated in the agent that a single sentence produces working software. That is not best practice for someone new. Once you have put in the reps with a specific agent, English is the language.

Amelia is less technical than Lemkin. She knows some HTML. She has not built what he has built on the software side. Because of APIs and agents, she can build anything. APIs until six months ago were for developers. Any operator with judgment and context can build on top of any API today. The democratization is not of software development. It is of API consumption. Any data that was locked in any system can be liberated and reshaped by anyone with an agent and an API key.

The next 18 to 24 months will be the best of our careers, Lemkin argued. In two years, swarms of agents will be commonplace. The line between software and "just talking to an agent" will blur. We are not heading toward the democratization of software. We are heading toward post-software, where agents invisibly spin up whatever they need behind the scenes and you never know it happened. For B2B operators, the window to build the labeling moat is now.

Why 30 Days of Agent Review Beats Prompt Engineering

The Replit Agent Stack: 10K, QBee, and the GTM Layer

The 10K Email That Destroyed Matching Software

Why Product Fluency Is the New Sales Leading Indicator

The Labeling Moat: Why 30 Days of Review Beats Prompt Engineering

N=1 Apps: When to Build vs. Buy in the Agent Era

Social/Agentic Engagement: Where Agents Exceed Humans

Programming in English: The API Consumption Shift

Explore More

More from AlphaScala

Trading Q&A

Related Tools & Research