I Bet Our Entire Sprint on GitHub Copilot Workspace — Here's What Broke

Last Tuesday, I made a call that my director called "reckless" and my team called "finally."

We're going all-in on GitHub Copilot Workspace's issue-to-PR automation. Across four squads. Thirty-seven engineers. And I made this decision knowing full well it would fail in ways I couldn't predict.

Here's the thing — I'm not some AI evangelist. I've been writing code since 2009, back when "automated PRs" meant copying and pasting from Stack Overflow and hoping nobody noticed. I've watched enough tooling fads come and go to develop a healthy skepticism that borders on paranoia.

But last quarter, my team spent 127 hours on boilerplate PR tasks. Writing descriptions. Linking issues. Updating changelogs. The kind of work that doesn't make anyone a better engineer — it just makes them tired.

127 hours.

That's three full engineering weeks. Gone. Forever.

So when Copilot Workspace promised to turn a simple issue into a complete, tested pull request with basically zero human touchpoints, I didn't just read the docs and call it a day. I ran a controlled experiment. My director thought I was being paranoid. Probably was. Don't care.

Here's what happened, what broke, and why I'm now scaling it across all four squads despite my better judgment.

The Promise vs. The Panic

If you haven't played with Copilot Workspace yet — and honestly, as of January 2025, most teams I talk to still haven't — the flow is deceptively simple.

You open an issue. Click "Start Workspace." Copilot reasons through your codebase, generates a spec, writes the code, runs tests, opens a PR. All inside a sandboxed environment.

Sounds magical.

And frankly, that terrified me.

I've been doing this long enough to know that "magic" in tooling usually means "unexplainable failures at 4:47 PM on a Friday when you're supposed to be at your kid's soccer game." I've got the gray hair to prove it — though my wife says it's "distinguished," which I think is spouse-speak for "you look tired."

So before rolling this out, I needed hard data. Not vibes. Not "it worked in the demo." Actual numbers.

I picked three real issues from our backlog and ran them through the workspace with two senior engineers auditing every step. One low-complexity bug that'd been sitting there for two weeks because nobody wanted to touch it. One mid-tier feature tweak. And one cross-service refactor that I knew — I knew — would give the tool fits.

What the Numbers Actually Told Us

Here's the scorecard after two weeks of testing. I'm still processing some of this.

Bug fix (typo in auth middleware): Workspace went from issue to merged PR in 11 minutes. Eleven. Human time: 4 minutes for review. Previous average for similar bugs in our team: 47 minutes. That's... not a small difference. I triple-checked the stopwatch because I didn't believe it.

Feature tweak (adding a filter to our dashboard API): Workspace generated a correct spec on the first try. Wrote about 90% of the code. But — and this is where it gets real — it completely fumbled the database migration. Like, proposed a migration that would've dropped a column we still needed. Human intervention: 22 minutes. Previous average for this kind of work: 3.5 hours. Still a win, but you can't just trust it. I don't care what the marketing page says.

Cross-service refactor (updating shared error handling): This one. Oof. Workspace struggled hard. It missed a downstream dependency in our notification service — the one Priya rewrote last sprint, which, fair, the tool probably hadn't indexed yet — and proposed a breaking change that would've taken down our alert pipeline. Human time to fix: 1.8 hours. Still faster than the 6-hour estimate. But absolutely not the "set it and forget it" dream.

Net result? 68% reduction in time-to-PR across the three tasks. I think. Math might be off slightly — I did it on my phone between meetings while eating a sad desk salad.

But here's the thing the numbers don't capture.

Both senior engineers reported that reviewing AI-generated code felt less mentally taxing than writing from scratch. Sarah (not her real name, she'd kill me if I put her actual name in a LinkedIn post) said, and I quote: "I could actually think about architecture instead of remembering where we put the damn error handling utility."

That hit me harder than the productivity stats.

The Leadership Lesson I Almost Missed

I initially framed this whole thing as a productivity play. Faster PRs, fewer hours wasted, better velocity numbers to show the CTO. Classic VP-of-Engineering spreadsheet thinking. I've been that guy for years and I'm not proud of it.

But watching my team interact with Workspace revealed something I wasn't looking for.

It changed who could contribute.

One of our mid-level engineers — okay, fine, her name is Priya and she's going to be embarrassed I mentioned her — picked up a backend issue she'd normally avoid. It touched that legacy auth service from 2019 that nobody understands anymore because the original author left for a startup and took all the context with them. She'd told me before, in a 1:1, that she found that part of the codebase "intimidating." Her word, not mine.

Workspace gave her a starting point. A draft PR that was maybe 70% correct. She told me afterward, "I wouldn't have taken this ticket before. Now I feel like I have a senior dev sitting next to me."

Well. That's complicated.

It's not a senior dev. It doesn't have judgment. It doesn't know our business logic or why we made that weird architectural decision in Q3 2023 that we're still paying for. But it gave her something to work with instead of a blank file and mounting anxiety.

That's not just efficiency. That's capability expansion. And as a leader, that's the metric I care about most — even if I can't put it in a spreadsheet and show it to the board.

Where It Breaks (So You Don't Have To)

Look, I'm not here to sell you on Copilot Workspace. GitHub doesn't pay me — though if someone from GitHub is reading this, I wouldn't say no to a free t-shirt. I'm here to tell you where it fails so you can plan accordingly, because I didn't plan and I paid for it with a very tense Saturday morning.

1. Context window limits are real

Workspace operates on the files it can see. If your issue spans multiple repos — and whose doesn't these days — or requires understanding of external APIs, it will confidently propose wrong solutions.

Not "maybe wrong." Confidently, authoritatively wrong.

We saw it suggest a fix for our payment service that would've worked perfectly if our actual payment processor hadn't changed their API response format in November. The workspace didn't know that. How could it? It's not reading Stripe's changelog over morning coffee.

2. Tests are a double-edged sword

Yes, Workspace generates tests. That's great in theory. But it also trusts its own tests too much.

We caught two cases — two! — where it wrote a test that passed because the test itself contained the same logic error as the code. It's like asking a student to grade their own homework and they just... don't notice they used the wrong formula.

Always audit the tests. Not just the implementation. The tests.

3. Onboarding cost is not zero

I don't care what the docs say. Engineers need to learn how to write good issues for this to work. Clear acceptance criteria. Relevant file paths. Expected behavior. Garbage in, garbage out, same as it ever was.

We spent two hours in a team workshop on issue-writing best practices — two hours I initially grumbled about because I had "more important things to do" (I didn't) — and it paid off immediately. Like, same-day payoff.

How We're Rolling It Out (The Practical Playbook)

If you're considering this for your team, here's the phased approach I'm using. Copied from my actual Notion doc, slightly cleaned up because my original notes had too many typos.

Week 1-2: Opt-in experimentation. Let curious engineers try it on low-risk bugs. Collect anecdotes and objections. Do NOT mandate anything. The second you make it mandatory, you've lost the psychological safety you need for honest feedback. I learned this the hard way with a different initiative in 2022 that I won't name publicly.

Week 3: Define your "Workspace-worthy" criteria. We created a simple rubric on a whiteboard that someone definitely erased by accident: issues that are single-service, well-spec'd, and have clear test paths are green-lit for Workspace first. Cross-cutting or ambiguous issues stay human-led. No exceptions yet.

Week 4: Measure what matters. We're tracking three KPIs: time-to-PR, review-cycle count (how many back-and-forths before merge), and — this one's critical — engineer satisfaction scores. Productivity without morale is just burnout in disguise.

Month 2: Expand to all squads, with guardrails. Every Workspace-generated PR requires a human reviewer. No auto-merge. That's non-negotiable for now. Might revisit in Q3, but honestly? Probably not.

What This Means for Engineering Careers

I've been thinking a lot about something Gene Kim wrote in The Unicorn Project — and I'm paraphrasing here because I lent my copy to someone who never returned it (you know who you are): "The goal is not to do more work faster; it's to do the right work."

Copilot Workspace won't replace engineers.

I don't think.

Ask me again in five years, I guess.

But it will replace the engineers who refuse to use AI as a force multiplier. That I'm pretty confident about. The developers who thrive in the next five years won't be the ones who write the most lines of code. That game is over. They'll be the ones who can decompose problems into clear specs, review AI output with sharp critical thinking, and focus their creativity on the hard problems that machines can't touch — yet.

As a VP of Engineering, my job is to build a culture where that shift feels like an opportunity. Not a threat. And I'll be honest, I'm still figuring out how to do that well.

TL;DR

Copilot Workspace reduced our time-to-PR by 68% across three test cases
It fails hard on cross-service changes and external dependencies — don't trust it blindly
The biggest win wasn't speed — it was letting junior engineers tackle scary parts of the codebase
Good issues are everything. Bad issues = bad PRs. Invest in issue-writing workshops
Never auto-merge AI-generated PRs. Never. I don't care how confident you are

My Question for You

I'm genuinely curious — and I read every single comment even if I don't always respond: if you could automate one part of your development workflow tomorrow, no technical constraints, what would it be?

Code review? Testing? Documentation? That one colleague's PRs that always need three rounds of feedback? (We all have that colleague.)

Drop your answer in the comments. I'll share the most interesting responses in a follow-up post, probably in a couple weeks once I've dug through them all.

Edit: A few people DMed me asking about our exact Copilot Workspace config. We're on the Enterprise plan, using the January 2025 release (version 2.3.1), with the "strict" context mode enabled after the cross-service incident I mentioned. Your mileage will vary. Test it on something low-risk first. Seriously. I'm not kidding. Low. Risk.

EngineeringLeadership #GitHubCopilot #AIinTech #DeveloperProductivity #FutureOfWork

I Bet Our Entire Sprint on GitHub Copilot Workspace — Here's What Broke

I Bet Our Entire Sprint on GitHub Copilot Workspace — Here's What Broke

The Promise vs. The Panic

What the Numbers Actually Told Us

The Leadership Lesson I Almost Missed

Where It Breaks (So You Don't Have To)

1. Context window limits are real

2. Tests are a double-edged sword

3. Onboarding cost is not zero

How We're Rolling It Out (The Practical Playbook)

What This Means for Engineering Careers

TL;DR

My Question for You

EngineeringLeadership #GitHubCopilot #AIinTech #DeveloperProductivity #FutureOfWork

Cael Lee

Ready to get started?