Ops lead at Scale AI on using Claude Cowork & Codex for QC automation and multi-tool debugging at scale

Background

We spoke with an ops lead at Scale AI running Claude Cowork, Codex, and Cursor across QC automation, computer-use agent evals, and internal tooling at scale.

The conversation covers how agentic workflows have evolved from unreliable experiments to production-grade systems, where multi-tool chains still break, and what it would take for non-technical teammates to adopt them broadly.

Key points via Sacra AI:

Agentic coding tools are moving from reactive chat to autonomous multi-tool orchestration, but reliability degrades sharply past four or five connected systems—and the gap between power users and the rest of the org is mostly a UX and permissions problem, not a capability one. "Cowork didn't have permissions to go into GitHub and do a pull request automatically—and it asked the user to grant access through GitHub settings. That user had never used GitHub before, didn't know where to find those settings, and eventually gave up. It was a really good idea that could have saved a lot of time for many people, but because the questions were too technical for someone who had never coded before, they stopped using that automation entirely."
Claude Cowork and Codex have crossed a reliability threshold for single-tool and two-to-three-tool workflows—reaching 85%+ accuracy on QC flagging after rubric-level prompt engineering—but multi-tool chains of four or five systems (Linear, Airtable, Monday, Slack, internal hub) still break in ways that take days to debug, because one hallucination or API failure propagates downstream before any agent catches it. "When it handles four or five-plus tools at the same time, where each tool has to communicate with the others, mistakes can happen—one hallucination or error gets passed to the next tool, and the chain continues from there."
The mechanic that unlocked 85% QC accuracy wasn't better prompts alone—it was decomposing a binary pass/fail judgment into 25–30 rubrics, so the model could localize errors to a specific dimension rather than reasoning over a raw prompt-and-answer pair, then routing flagged items to specialists via Slack with task ID, error category, and spec-doc comparison attached. "With twenty-five to thirty rubrics, you can go through each one, analyze it, and figure out exactly where the error is. Then you can train it on typical rubric errors, which are easier to fix than trying to fix an entire task when you only have SFT data with just a prompt and a final answer."

Questions

Between Claude Cowork and OpenAI's Codex app, which have you actually used or tried? And what else is in your regular workflow—Claude Code, Cursor, ChatGPT, Claude web, VS Code, GitHub?
When did you first try Claude Cowork and Codex, and how often are you using them now? And what kind of work were you originally hoping they'd take over?
Could you walk me through the last few things you personally delegated to Claude Cowork or Codex end to end? What did you ask it to do, what context did you give it, and what came back?
Can you walk me through that QC workflow in a bit more detail? Where do the task submissions and audit feedback live, what does Cowork actually compare, and what does the flagged output look like when your team reviews it?
When it flags something, how reliable has that been in practice? Can you remember the last time it caught a real QC issue correctly, or a time it flagged something that turned out to be wrong?
What changed the most in getting there: better prompt instructions, cleaner CSV structure, better QC spec docs, examples of past judgments, or something else?
Walk me through the handoff after Cowork flags those rubric-level issues. What does the specialist actually see, and do they work inside your internal platform, a spreadsheet, Slack, or somewhere else?
Could you walk me through the last time a specialist disagreed with Cowork's flag? What was the mismatch, and did that lead you to change the rubric, the prompt, or the workflow?
How do you decide whether a task goes to Claude Cowork versus Codex versus Claude Code, Cursor, or just a normal ChatGPT or Claude chat? What are the signals that make you pick one over the other?
When you look across your team's usage, what kinds of tasks have moved from "we wouldn't delegate this to an agent" to "we now hand this off pretty naturally"? Use cases first, with a rough sense of frequency if that's easy.
Yes—what are one or two concrete workflows that, say, three or six months ago felt too risky or annoying to delegate, but now your team hands to Cowork or Codex pretty routinely?
Can you walk me through one recent computer-use agent eval end to end? What was the task, what did Claude Code or Codex generate or run, and what did your final cross-check catch?
Where does the agent still need the most human judgment in those evals? Is it defining the ground truth, interpreting screenshots, deciding pass/fail, or debugging when the VM run diverges?
Where have you seen these agents produce something that isn't really just code—like an operational dashboard, report, workflow, onboarding doc, or analysis—where code is more of the execution layer underneath?
On the taxonomy chatbot, can you walk me through the last change you asked it to make? What did you type in, what did it inspect, and what did it show you before you approved it?
When that agent proposes a taxonomy change, what does the review step look like for you? Are you reviewing a diff, a preview of the flow, a generated summary, or actually clicking through the workflow to verify it?
When you compare that to using a normal chatbot or Claude or ChatGPT in the browser, what feels different about using Cowork or Codex as an agentic app for this kind of workflow?
Where does that fire-and-come-back-later pattern work best for you today, and where does it still fall apart?
Can you walk me through a specific time a four—or five-tool workflow broke? At a high level is fine: what were the tools involved, where did the first mistake happen, and how did you catch it?
When you were debugging that, what would you have wanted the agent to show you? Like, a step-by-step trace, intermediate payloads between tools, confidence flags, rollback options—what would have made it feel controllable instead of chaotic?
On the context side, what does the agent usually need to know to run these workflows well? Like specs, API docs, internal terminology, examples of past runs—and where does that context live today?
Has there been a time where the agent made a mistake because it was using the wrong version of a spec, didn't know an internal term, missed a Slack decision, or didn't have access to some doc—like it changed the wrong workflow step because the taxonomy had been updated elsewhere?
Sure—for example, has the agent ever changed the wrong workflow step because the taxonomy had been updated somewhere else and it was working off an outdated version?
When that happens, how do you want the agent to behave instead? Should it pause and ask you, show possible recovery paths, retry with another tool, or escalate to a human after some threshold?
How has that changed your review process? After Cowork or Codex hands something back, what do you now inspect most carefully before you trust it?
What would an ideal audit trail look like for you there? Would you want commit-by-commit diffs, a narrative of decisions, tool-call logs, or some kind of replay where you can see exactly what changed at each step?
How much of this is still individual power-user behavior versus a normal workflow your broader team is adopting together?
What would need to change for those weekly or occasional users to adopt it more broadly? Is it templates, better debugging, safer permissions, more guided setup, or something else?
Can you walk me through one workflow a non-technical teammate tried or wanted to try, but got blocked by that technical surface area? What were they trying to accomplish, and where did they drop off?
What would the ideal version have looked like for that ops teammate? Should Cowork have handled the GitHub setup behind the scenes, asked an admin for approval, generated a shareable approval request, or presented it as "connect this dashboard backend" instead of talking about repos and PRs?

Interview

Between Claude Cowork and OpenAI's Codex app, which have you actually used or tried? And what else is in your regular workflow—Claude Code, Cursor, ChatGPT, Claude web, VS Code, GitHub?

We use Claude Code and Codex pretty similarly. Claude Code and Cowork account for about sixty percent of our usage, Codex is around thirty percent, and the remaining ten percent is Cursor.

When did you first try Claude Cowork and Codex, and how often are you using them now? And what kind of work were you originally hoping they'd take over?

We started using them for work, and for personal reasons, in October 2025. We use Claude Cowork for various reasons. One is on the operations side: building materials, especially onboarding materials for new projects, creating quizzes, creating instructions, running tests, running dashboards—QMS scores, QC scores, quality scores of taskers. We also use Claude Cowork for marketing and advertising, though that's not necessarily my team; one team at scale uses it for that. And then we use Claude Code for a lot of agentic work. Whenever we run trajectories or computer use agents, we test and heavily manage those workflows, and the same holds for Codex. We use it for agentic trajectories as well as computer use agent training, where we work with OpenAI or Anthropic to test their models against public benchmarks. For Cursor, we use it as an additional tool alongside computer use agent projects—people can choose between Claude Code and Codex depending on the project, and those who love the Cursor IDE can use it as their interface while running evals.

Could you walk me through the last few things you personally delegated to Claude Cowork or Codex end to end? What did you ask it to do, what context did you give it, and what came back?

One thing I did was social media automation. Whenever we want to use developer platforms like Hugging Face, we tell it the typical topics we want to post about, give it a portfolio of topics, ask it to post on specific days—usually three times a week, on Tuesday, Wednesday, and Friday—pick the time, specify that it should use particular graphics and visuals to make things more appealing, and then ask Cowork to create the automation and schedule it. We give it a start time, usually 7 AM, and it automatically handles those posts from there.

Another use is QC automation. Whenever we have certain task submissions and receive QC feedback on those tasks, we use Claude Cowork to run a comparison between the attempt and the audit, automatically flagging where things don't make sense. Think of it as the first layer of audit reviews once we get initial QC feedback.

The marketing team also uses Claude Cowork to post on LinkedIn and X. They feed it topics, campaigns, graphics, and visuals, and let it pick between those and post regularly on those platforms.

Can you walk me through that QC workflow in a bit more detail? Where do the task submissions and audit feedback live, what does Cowork actually compare, and what does the flagged output look like when your team reviews it?

We have our own internal platform. Whenever tasks are submitted, we receive initial QC feedback, then transfer that into a CSV file and feed it to Claude Cowork, since it doesn't have direct access to our internal tool. We feed it the CSV file with the task submissions and the provided feedback, and we also feed it the instructions and the QC spec doc so it knows exactly what the rules are. It then applies those rules against the QC feedback. If anything in the feedback doesn't match the spec doc and instructions, it flags those things and says they don't make sense or aren't correct per the spec doc. We want to flag everything initially before our specialists and experts look deeper. Once those items are flagged, they get automatically routed to our specialists, who can either agree or disagree with Cowork's judgment. That saves us time and makes the whole QC process more efficient.

When it flags something, how reliable has that been in practice? Can you remember the last time it caught a real QC issue correctly, or a time it flagged something that turned out to be wrong?

Initially the success rate—correctly flagged, without false positives or negatives—was around forty percent. We're now closer to eighty-five percent, over the past three months, through a bunch of iterations modifying prompts and parameters. It has improved significantly, but initially the human-in-the-loop element was very critical because we only got forty, maybe fifty or sixty percent right. There were a lot of issues. Now at eighty-five percent plus, we can actually rely on it heavily.

What changed the most in getting there: better prompt instructions, cleaner CSV structure, better QC spec docs, examples of past judgments, or something else?

We cleaned up the spec doc, because there was some ambiguity in it—that was an easy fix. Another thing we created was rubrics. The rubrics help the core of the LLM figure out issues not just by looking at what the QC said, but by looking at the rubrics themselves to see if any have problems on their own. That helps pinpoint where the exact issue lies. If you only have a prompt and an answer and someone says the answer is wrong, it's hard to quantify. But with twenty-five to thirty rubrics, you can go through each one, analyze it, and figure out exactly where the error is. Then you can train it on typical rubric errors, which are easier to fix than trying to fix an entire task when you only have SFT data with just a prompt and a final answer.

Walk me through the handoff after Cowork flags those rubric-level issues. What does the specialist actually see, and do they work inside your internal platform, a spreadsheet, Slack, or somewhere else?

We use Slack. We have a dedicated channel where the results from the Cowork audit are automatically pasted, and it tags the appropriate team to look at it. It posts the task ID, summarizes the main issues the QC flagged, categorizes them in terms of what type of errors they are, and compares them against the spec doc, showing whether something is or isn't consistent with it. That makes it easy for our specialists to use this Slack automation and look into specific mismatches between the initial submission and what the Cowork audit found.

Could you walk me through the last time a specialist disagreed with Cowork's flag? What was the mismatch, and did that lead you to change the rubric, the prompt, or the workflow?

That happens all the time—out of ten submissions, at least one or two are disagreed on by the specialist teams. These projects are very complex, so it's not always black and white; there's always a fair amount of judgment involved. The last time it falsely flagged something as an error where, per the actual spec, it actually wasn't an error—that was maybe two months ago. It's gotten pretty good at not hallucinating and not falsely flagging things that aren't actually wrong.

How do you decide whether a task goes to Claude Cowork versus Codex versus Claude Code, Cursor, or just a normal ChatGPT or Claude chat? What are the signals that make you pick one over the other?

Sometimes it's simply which project it's for. If we're running an Anthropic project, we obviously use their Cowork or coding agents. If it's for OpenAI, we use Codex. Sometimes we use benchmarks to compare, but if it's for a certain client, there's usually not much of a choice. There are cases where you can pick and choose—that's the flexibility taskers have. In those cases, if someone uses Cursor as their IDE, it's up to them whether to use Codex or Claude Code, depending on the specific project and personal preference. For refactoring, debugging, or Q&A, some people prefer Claude Code over Codex. It's hard to quantify and say Claude Code is always better for certain cases—it really depends on the person using it.

When you look across your team's usage, what kinds of tasks have moved from "we wouldn't delegate this to an agent" to "we now hand this off pretty naturally"? Use cases first, with a rough sense of frequency if that's easy.

In terms of use cases?

Yes—what are one or two concrete workflows that, say, three or six months ago felt too risky or annoying to delegate, but now your team hands to Cowork or Codex pretty routinely?

A good example is when we use computer use agents—we call them CUAs. Three months ago, whenever we wanted to run internal evals on the machine, the Codex and Claude Code results were subpar. We couldn't rely on them. Today, at least eighty to ninety percent of programmatic evals run through Claude Code or Codex because they've gotten so good. We're still using final evals to cross-check and sample, but they've improved enough that anything we do on computer use agent projects now runs on Claude Code or Codex as the first layer of evals.

Can you walk me through one recent computer-use agent eval end to end? What was the task, what did Claude Code or Codex generate or run, and what did your final cross-check catch?

These are confidential, so it's hard to deep dive. Think of typical coding, refactoring, and testing problems that you'd ask an agent to do either on a phone or in Windows. You create the initial task via the agent—be it Claude Code or Codex—then use a virtual machine to simulate it. You run the evaluation through Claude Code or Codex to see how much of the output matches the ground truth file you create, and you take screenshots along the way to confirm it's happening as expected. For confidentiality reasons, I can't go into concrete specifics.

Where does the agent still need the most human judgment in those evals? Is it defining the ground truth, interpreting screenshots, deciding pass/fail, or debugging when the VM run diverges?

It depends. Debugging always has issues. Programmatic evals, especially with four or five layers or different eval levels, are where you still catch agents making mistakes. Also, for the ground truth file, if it's very nitpicky—file naming, number formatting, things like that—you still see a lot of mistakes. You need a human in the loop to fix those.

Where have you seen these agents produce something that isn't really just code—like an operational dashboard, report, workflow, onboarding doc, or analysis—where code is more of the execution layer underneath?

We use a lot of dashboards and reporting at scale, usually done via Redash queries—SQL in the background—and most of those are now driven by LLMs. We use Gemini, Claude Code, and Codex depending on the complexity. For simpler topics you wouldn't use Opus—Sonnet or even Haiku would be fine. Another example is taxonomy. Whenever we have content taxonomy and want to change certain things, we used to have to do everything manually. Now we have a chatbot using whichever coding agent we have—you tell it what you want, and it automatically screens through the taxonomy steps, finds the appropriate ones, suggests them to you, and asks, "Is this what you want me to do?" You can agree or disagree. In my personal experience over the past two or three months, it's been very accurate—over ninety percent of the time, it actually does what you wanted. Those are two examples where we use code not necessarily as running code, but as the underlying baseline to produce reporting, dashboards, or taxonomy text changes.

On the taxonomy chatbot, can you walk me through the last change you asked it to make? What did you type in, what did it inspect, and what did it show you before you approved it?

It was part of a computer use agent project—we wanted to add two new text boxes and upload opportunities for the ground truth file. I can't go much deeper because it's confidential, but the agent created that upload functionality plus the text boxes at exactly the right step within the whole taxonomy flow. It did it perfectly.

When that agent proposes a taxonomy change, what does the review step look like for you? Are you reviewing a diff, a preview of the flow, a generated summary, or actually clicking through the workflow to verify it?

It shows me, not visually but via text, what it wants to do. It essentially says, "This is before and this is after," and shows the exact location where the change will be made. I can then click on that, and it opens a preview window showing the suggested change live. I can then decide whether to keep it or not. If we like it, we say "implement." If not, we say we need to redefine that step.

When you compare that to using a normal chatbot or Claude or ChatGPT in the browser, what feels different about using Cowork or Codex as an agentic app for this kind of workflow?

It definitely feels more active because it actually executes on its own. It's not purely reactive, where you have to force-feed it a prompt and then it gives you an answer you implement. You tell it what you want to achieve as the end result, let it run on its own and figure out the best path, and then it suggests that path to you. With a normal ChatGPT window, it's purely reactive: you have to force-feed everything and then implement it yourself.

Where does that fire-and-come-back-later pattern work best for you today, and where does it still fall apart?

Anything that's a single-use tool works fairly well. When multiple tools have to talk to each other for a workflow to work, personally I'd say any time it's more than four or five tools simultaneously, that's where things break and mistakes happen. With one or two, maybe three tools, we usually don't see issues. When it handles four or five-plus tools at the same time, where each tool has to communicate with the others, mistakes can happen—one hallucination or error gets passed to the next tool, and the chain continues from there.

Can you walk me through a specific time a four—or five-tool workflow broke? At a high level is fine: what were the tools involved, where did the first mistake happen, and how did you catch it?

The tools were Linear, Airtable, Monday, our internal GenAI ops hub, and Slack. We wanted an automation based on issue tracking and project management—all these tools working with each other, creating the issue tracker, adding it to a project management tool, notifying someone on Slack, and creating a dashboard on Airtable. That workflow broke many times, and we had to figure out which tool was causing it—it wasn't clear whether it was Airtable or Slack. We had multiple errors, and it took almost four or five days to find the exact root cause and fix it. Even the fixing part wasn't easy: each tool has a different UI, different requirements, different APIs. The agents talking to each other had to not lose any of the initial content, since information travels from tool one to tool five and you have to make sure nothing breaks or gets lost along the way.

When you were debugging that, what would you have wanted the agent to show you? Like, a step-by-step trace, intermediate payloads between tools, confidence flags, rollback options—what would have made it feel controllable instead of chaotic?

It wasn't exactly chaotic, but we want step-by-step tracing and ideally a visual that shows exactly where in the chain the issue is. And ideally, an automated suggestion for how to fix it—rather than constantly having to ask it what the error is, watch it investigate, go down a rabbit hole, and realize it wasn't actually that. We need more reliability from the agent: when it flags something, that should actually be what's causing the issue. It needs to better visualize where in the chain things are breaking and then ideally come up with automated suggestions for how to solve it, so for the user it's essentially just a couple of clicks to fix, rather than having to enter debugging mode, bring in a developer, and go through each single step to find where the chain broke.

On the context side, what does the agent usually need to know to run these workflows well? Like specs, API docs, internal terminology, examples of past runs—and where does that context live today?

All of those things, plus internal documentation. That's also an issue sometimes—not everything is in a user-friendly central place. Sometimes it's siloed between different tools, and the formatting is different: sometimes it's a sheet, a CSV, a presentation. The agent has to normalize all those formats so it can rely on one consistent format that can travel between tools. And the APIs have different requirements and different UX, so the agent figuring out where to click, what to type, where the bounding box is—all of that takes time, and there's a lot of trial and error before you can get something like this working reliably.

Has there been a time where the agent made a mistake because it was using the wrong version of a spec, didn't know an internal term, missed a Slack decision, or didn't have access to some doc—like it changed the wrong workflow step because the taxonomy had been updated elsewhere?

Could you redefine the question? I'm not a hundred percent sure I understand it.

Sure—for example, has the agent ever changed the wrong workflow step because the taxonomy had been updated somewhere else and it was working off an outdated version?

The taxonomy example is a frequent case, and the same happens with onboarding specs. We constantly update materials, but sometimes people use a different link—they might have created a copy of the original onboarding document and by accident updated the other version, so if you use the old link, the information is outdated there.

The other thing we see is our internal tool breaking sometimes because it's overloaded—we have tens of thousands of people using it every day. During mini-outages, when the agent is trying to perform a connection and one action doesn't work, agents are sometimes not good at recovery. A key metric you always look for in agentic trajectories is how good the recovery is when a certain step doesn't work—where information isn't there, the wrong information was provided, a link doesn't work, or an API is broken at that specific moment. How good is the agent at going back and finding path B when path A clearly doesn't work? Sometimes the agent just gets hung up and stops, and then you have to figure out why it's not moving. Sometimes it assumes information because the actual information isn't there, rather than being honest and saying it doesn't know what to do. Because it wants to be helpful, it sometimes hallucinates or comes up with alternative information and passes that to the next agent, which then causes downstream issues.

When that happens, how do you want the agent to behave instead? Should it pause and ask you, show possible recovery paths, retry with another tool, or escalate to a human after some threshold?

We essentially have three levels. For a simple action, we want it to come up with its own solution. For a medium-complexity task, we want it to clarify with you and say, "I'd like to offer this solution—is this what you want?" In those cases we want human in the loop, so it should let you know what the issue is and how it wants to resolve it. If it can't—if something is highly complex or broken—we want it to be completely honest and say, "I lost the information at exactly this step. I don't know what other path to take. Can you guide me?" Obviously you don't want this to happen in production, because then what's the purpose of the agent? But this step is important: before it starts hallucinating or suggesting useless options, we want it to do a hard stop and tell the human it has reached the end of what it can do on its own, so we can course-correct. Those are the three levels of escalation we like to see.

How has that changed your review process? After Cowork or Codex hands something back, what do you now inspect most carefully before you trust it?

Anything involving information being handled from one tool to another, especially when the formatting and versioning are different—we want people to double-check those things. The other area is code. Code can get broken or manipulated from one step to another because the agent may have started down one path, decided it wasn't good, and switched to another path, but may have already modified part of the code in its initial attempt. When it decided to go a different route, it didn't reverse the initial changes—so now you have different versions of the code and don't know exactly which part belongs to which step. That can cause a lot of issues. We need a versioning history of code changes so a developer can go in after the agentic workflow and figure out exactly what modification happened at each step, so we can identify where something broke.

What would an ideal audit trail look like for you there? Would you want commit-by-commit diffs, a narrative of decisions, tool-call logs, or some kind of replay where you can see exactly what changed at each step?

We definitely want commit-by-commit diffs and tool-call logs as the standard initial approach, because that helps our developers the most.

How much of this is still individual power-user behavior versus a normal workflow your broader team is adopting together?

Still mostly power users. Following something like an 80/20 rule—for us it's probably even less than twenty percent, maybe fifteen percent of users, who produce seventy-five to eighty percent of all the output in Claude Cowork. These are project managers, product managers, or developers who heavily automate processes. Because it relies on multiple tools, you need someone who's technically savvy and can course-correct at any given time. Most people give up because it's not as easy as just asking a question and getting an answer—you have to create a whole agentic workflow, and the more complex the workflow, the more things can break. Most people don't have the patience or expertise to figure out what went wrong.

With weekly users, whatever works for them, they stick with it and might modify it slightly but don't go outside of that. Occasional users have two or three main workflows they always use and never diverge from. Daily users are the ones who constantly come up with new automations, new workflows, new tool connections, additions, or subtractions.

What would need to change for those weekly or occasional users to adopt it more broadly? Is it templates, better debugging, safer permissions, more guided setup, or something else?

One thing is the questions Cowork asks. If someone works in marketing or finance, you can't ask them about GitHub repositories or pull requests—they've never heard those terms and wouldn't know what they mean. Maybe they want to build a dashboard, a report, or a simple website, but you have to speak their language. You can't ask them to do a pull request or set up a GitHub repository. So one thing is appropriate language.

Another thing I've already seen improve in Cowork is the visualization of different steps—almost like a checklist you can visually cross-reference and track. It needs to be more visually appealing, almost like a project management tool that works on its own, requiring as few decisions from non-technical users as possible. If you ask too many questions and require too many decisions, people get tired and wonder why they're using it at all. But you also have to find the balance—you don't want to overdo autonomy to the point that if something isn't what the user wanted, they don't have the time or capability to go back and fix it. Those are probably the most obvious factors that would need to change for weekly and occasional users to start adopting it more broadly.

Can you walk me through one workflow a non-technical teammate tried or wanted to try, but got blocked by that technical surface area? What were they trying to accomplish, and where did they drop off?

One example was an ops member who wanted to create a dashboard for what we call internally a ramp plan—whenever you want to launch a new project, you create a ramp plan, usually in a Google Sheet or Excel file, where you set up how many attempts, how many reviews, how many per day, what the error rate is, and so on, so you can reliably set something up that works. This member wanted to create the ramp plan as a living web dashboard where you only enter certain figures and it automatically calculates all the other numbers and keeps updating them every day.

Initially the front end looked good. But then when things started breaking or the numbers didn't show up with the right graphics, the feedback from Cowork was too technical. It told them to go to GitHub and update a pull request. They just said, "You do it for me." At that point, Cowork didn't have permissions to go into GitHub and do a pull request automatically—and it asked the user to grant access through GitHub settings. That user had never used GitHub before, didn't know where to find those settings, and eventually gave up. It was a really good idea that could have saved a lot of time for many people, but because the questions were too technical for someone who had never coded before, they stopped using that automation entirely.

What would the ideal version have looked like for that ops teammate? Should Cowork have handled the GitHub setup behind the scenes, asked an admin for approval, generated a shareable approval request, or presented it as "connect this dashboard backend" instead of talking about repos and PRs?

Exactly—it should have asked for permission to use the tool on the back end and done everything it needed to do within GitHub, but kept all of that away from the user's judgment, because they don't understand what's happening there. It should have simply asked, "Are you okay with me running this on the back end of GitHub—adjusting the pull request and giving myself the necessary access?" That would have been the easier way, because the user would have said, "Yeah, whatever, I don't know what that stuff is anyway, just do it," and it could have moved on and continued.

Disclaimers

This transcript is for information purposes only and does not constitute advice of any type or trade recommendation and should not form the basis of any investment decision. Sacra accepts no liability for the transcript or for any errors, omissions or inaccuracies in respect of it. The views of the experts expressed in the transcript are those of the experts and they are not endorsed by, nor do they represent the opinion of Sacra. Sacra reserves all copyright, intellectual property rights in the transcript. Any modification, copying, displaying, distributing, transmitting, publishing, licensing, creating derivative works from, or selling any transcript is strictly prohibited.

Ops lead at Scale AI on using Claude Cowork & Codex for QC automation and multi-tool debugging at scale

Background

Questions

Interview

Disclaimers

Read more from

OpenAI

SOTA model nightclub hype cycle

Why Sora failed

Why OpenAI wants Windsurf

Read more from

Anthropic

UX lead at real estate firm on running a website redesign with Claude Cowork

Head of Product at SaaS startup on building a personal AI OS with Codex automations and Claude Cowork

Head of Product Marketing at SaaS startup on automating product marketing with Claude Cowork

Read more from
#ai

$100M/year Nielsen of LLMs

Arena revenue, growth, and valuation

$20M/year Replit for GCs

Create a free account, or log in.

Free article limit reached.

Standard membership required.

Standard membership required.

Background

Questions

Interview

Disclaimers

Read more from OpenAI

SOTA model nightclub hype cycle

Why Sora failed

Why OpenAI wants Windsurf

Read more from Anthropic

UX lead at real estate firm on running a website redesign with Claude Cowork

Head of Product at SaaS startup on building a personal AI OS with Codex automations and Claude Cowork

Head of Product Marketing at SaaS startup on automating product marketing with Claude Cowork

Read more from #ai

$100M/year Nielsen of LLMs

Arena revenue, growth, and valuation

$20M/year Replit for GCs

Read more from

OpenAI

Read more from

Anthropic

Read more from
#ai