Skip to main content

I stopped counting Chrome processes and started owning them

In the last post I spent an embarrassing amount of time getting one formula right: (sessions × 5) + 8, the number of Chrome processes my scraper would tolerate before deciding something had leaked and killing them. It worked. It ran in production for about four months. And it was the wrong thing to be doing.

The formula was a workaround for a single missing capability. ChromeDriver would not tell me which Chrome processes belonged to which session. So I counted them, I guessed, and when the guess said "too many" I killed all of them and hoped the live ones weren't in the pile. Every problem in that post, the swarm-counting, the shared-profile lock, the nuclear cleanup, grew out of that one blind spot.

At some point the obvious question finally landed: what if I just owned the browser?

The thing ChromeDriver sits in the middle of

When you drive Chrome through Wallaby, the stack is taller than it looks. Your code talks to Wallaby, Wallaby talks HTTP to ChromeDriver, ChromeDriver is a separate daemon that launches and supervises Chrome, and Chrome is the swarm of processes from last time. ChromeDriver owns the browser. You get a session id and some polite requests.

That middle layer is where my blind spot lived. ChromeDriver launched the Chrome processes, so ChromeDriver knew their PIDs. I didn't. And because every session pointed at the same --user-data-dir, a leftover process from one session could lock the profile and lock out the next one. I was downstream of a daemon that wasn't going to tell me anything useful.

But Chrome doesn't actually need ChromeDriver. ChromeDriver is a WebDriver-to-CDP translator, and CDP, the Chrome DevTools Protocol, is the thing I wanted to talk to anyway. It's the same protocol the devtools panel in your browser uses. You launch Chrome with --remote-debugging-port, it prints a WebSocket URL, and over that socket you can navigate, evaluate JavaScript, take screenshots, watch the network, all of it.

The important part, for me, was the launch. If I spawn Chrome, I hold the OS process. I connect to it over the WebSocket. When I'm done, I kill the process I started. No daemon in between deciding things on my behalf. No shared profile, because each launch gets its own throwaway one. The "which PID is mine" question stops being a question, because all of them are mine and I have a handle to the thing that owns them.

Why I wrote a library instead of using one

This is the part you should be most suspicious of, so I'll lead with the least flattering reason: I wanted to. On a side project, "wrap someone else's library" and "write the library" are not equally fun, and that's a real input even if it never makes it onto an architecture diagram.

But there was a gap underneath the fun, and it's the actual argument. The existing options split two ways. Wallaby is built for feature tests, and I'd already forked it once (last post). Playwright, Puppeteer, and chrome-remote-interface are mature and they're Node or JavaScript. A couple of older Elixir CDP experiments exist, but nothing I'd hang production scraping on.

Here's the part that decided it. The CDP transport was never the hard bit. Talking the protocol is JSON over a WebSocket with request IDs and an event stream: fiddly, but a solved, well-documented problem, and if a solid Elixir client had existed I'd have used it. The thing I actually needed sat one layer up: a browser as a supervised resource, a process that owns the OS Chrome and is guaranteed to reap it when that process or its owner dies. That layer (the terminate/2 guarantee and the with_page wrapper below) was the real work, and I'd have had to build it on top of any of those options anyway. The transport is the easy 20% I'd have gotten for free. The lifecycle was the 80% that was the whole point.

So: part preference, part a real gap that the preference carried me through quickly. I'd make the same call again. It's called cdp_ex, it's on Hex, and it's CDP over Mint.WebSocket with no ChromeDriver and no Node anywhere in the picture.

The core idea is one guarantee

A cdp_ex browser is a GenServer that owns the Chrome OS process and its connections. Its terminate/2 always runs Chrome.stop/1. That's the no-orphan guarantee, and almost everything else is built on it: if the browser process dies, for any reason, the OS process dies with it.

In practice you rarely touch the GenServer directly. The common shape is a throwaway browser per unit of work, which is exactly one function:

CDPEx.with_page([], fn page ->
{:ok, _} = CDPEx.Page.navigate(page, "https://example.com")
CDPEx.Page.text(page, "h1")
end)

with_page/3 launches Chrome, hands you a page, runs your function, and tears the whole thing down afterwards, even if your function raises.

The teardown is the part worth slowing down on, because it is the no-orphan claim, and it depends on two things being true at once. First, a browser that crashes mid-call must not take your process down with it. So with_page traps exits for the duration and turns a browser crash into an ordinary {:error, _} you can match on, instead of a link exit that kills your caller. Second, if your process is the one that dies, the browser still has to be reaped. So with_page keeps the link to the browser rather than downgrading to a monitor, which means the browser's own terminate/2 fires and takes Chrome with it. Trap the exit so a crash can't propagate up, keep the link so a caller's death still cleans up. You get resource safety in both directions, and you get it without thinking about it.

My scraper's fetch path is essentially that, with a navigation and a wait wrapped around it:

defp run_in_page(url, opts) do
CDPEx.with_page(
CdpConfig.launch_opts(opts),
fn page -> fetch_page(page, url, opts) end,
prevent_alerts: true
)
end

Chrome is launched, driven, and reaped per fetch. There is no pool to babysit, no session registry, no idea of a "leaked" browser, because a browser that isn't currently inside a with_page call doesn't exist.

What this deletes

Here's the part that made the whole detour worth it. Go back and look at last post's ChromeManager: the two timers, the (sessions × 5) + 8 threshold, the zero-session special case, the snapshot-before-mutate dance, the nuclear cleanup that killed every Chrome because it couldn't tell mine from leaked. All of it.

Gone. The entire module, plus the Oban sweep, plus the WallabyRestarter, plus the ChromeDriver supervisor. Deleted.

Not "improved." There is nothing left to count, because there's no shared pool of ambiguous processes. There's no shared --user-data-dir to lock, because every fetch gets its own temp profile. The check I used to SSH in and run looks different now:

$ pgrep -f chrome | wc -l
0

Zero between fetches, and a small handful during one. Not because a sweep ran on a timer, but because there's nothing to sweep. The class of bug from the last post didn't get tuned down. It stopped being possible.

How the cutover actually went

I didn't rip and replace, and "I rewrote the browser layer and everything was great" would not be a true sentence. The bumps are the useful part.

cdp_ex ran alongside Wallaby behind a per-host environment toggle, so I could send one ticketing site through the new engine while everything else stayed on the old one, with Wallaby as the fallback the moment anything looked wrong. I cut over one host at a time and watched. Three things were worth the scars.

Production is not my laptop. A single Chrome launch that took about a second locally took six or more on the cold Fly machine under load, and sometimes blew past cdp_ex's launch timeout entirely. The failure surfaced as :debug_url_not_found: Chrome was still coming up when I gave up waiting for its debug URL. The fix was unglamorous, a 45-second launch ceiling instead of the optimistic default, but the lesson is the usual one. Your timeouts are calibrated for the machine you wrote them on.

The reaper from the last post tried to murder its own replacement. This is my favorite bug of the whole project. The new CDP browsers registered no Wallaby session, so as far as ChromeManager was concerned, session_count was 0. And zombie_threshold(0) is 0. So the instant a sweep fired, the nuclear cleanup looked at the brand-new CDP Chrome doing real work, decided that any Chrome with zero sessions was by definition a zombie, and SIGTERM'd it mid-fetch. The WebSocket dropped, the fetch failed with a connection-closed error, and it took me longer than I'd like to admit to realize the call was coming from inside the house.

The stopgap was to teach the old reaper to leave the new browsers alone. cdp_ex launches with its own temp profile, so it's identifiable by its --user-data-dir, and I added it to the exclude list with a comment that is basically this post in miniature:

 @excluded_process_patterns [
"claude", "electron", "cursor.app", "vscode", "code helper",
"chrome_crashpad_handler",
+ # cdp_ex Chrome reaps itself (CDPEx.with_page on teardown), so the sweep must
+ # skip it: a cdp fetch registers no session, so session_count is 0, so
+ # zombie_threshold/1 is 0, and the nuclear cleanup would SIGTERM every Chrome.
+ "cdp_ex"
]

Once every host was cut over, the exclude line went away with the rest of the module.

Cloudflare was an anticlimax. The site I was most nervous about sits behind Cloudflare's bot checks, and I'd budgeted time for the usual cat-and-mouse: stealth plugins, fingerprint patching, the works. The first real CDP fetch rendered the page in full, no challenge. The durable lesson here isn't "CDP beats Cloudflare," because it doesn't, and Cloudflare's posture shifts month to month. It's that a lot of basic bot checks are really checks for whether you're a real browser running the page's JavaScript, and CDP drives exactly that: real Chrome, real page. For my sites that was enough on its own. For a hardened target it won't be, and cdp_ex is deliberately not a stealth toolkit. If you try this and hit a challenge wall, that's the expected outcome, not a regression.

Where it landed

The browser layer is roughly half the moving parts it used to be. The manager and everything around it went with the cutover, and the failure mode that started this whole thing, zombie Chrome eating a 2 GB box, can't recur, because nobody is accumulating browsers anymore.

cdp_ex is open source and on Hex. It does the things I needed: launch and a warm Pool, navigation with real readiness waits, JavaScript evaluation, screenshots and PDFs, network observation and request interception, HTTP and proxy auth (including auto-answering an authenticated proxy), and :telemetry so you can see what it's doing in production.

It is also young, and I'd rather tell you what it isn't. It's Chrome and Chromium only, because it speaks CDP. It is not a stealth or anti-detection framework, and I have no plans to make it one. If you need cross-browser support or a mature feature-test DSL, Wallaby and Playwright are still the right tools and I'm not trying to talk you out of them. But if you want to drive Chrome from Elixir as a supervised, self-reaping resource, with no ChromeDriver daemon and no Node runtime hanging off the side of your release, that's the entire reason it exists.

The lesson from the last post was "respect the process tree you live in." This was the logical end of that. Once the browser is just another process you own, all the counting and guessing and nuclear cleanup quietly stops being your problem. You delete it, and nothing misses it.

cdp_ex on GitHub · docs