Thirty-seven Chromes: taming zombie Chrome on the BEAM

I run livein.city, a concert-listings site for Bangkok. A chunk of its plumbing is a scraper that drives headless Chrome to render JavaScript-heavy ticketing pages, after which an LLM pulls out the event data. The whole thing lives on a single 2 GB Fly.io machine.

For a while, scraping would just… stop. New concerts dried up. The logs were a wall of invalid session id. And when I SSH'd in:

$ ps aux | grep -c chrome
37

Thirty-seven Chrome processes (give or take the grep itself) on a box that should have had a handful. They'd eaten the memory, the kernel had started OOM-killing things, and every new browser session died on arrival.

There were a few root causes: shared-memory limits, Chrome's memory appetite, a startup race. But the most embarrassing one was self-inflicted. The code I'd written to clean up zombie Chrome processes was killing my live sessions.

This is the story of getting that code right, and of everything I had to do before I even earned the right to have that problem. So it goes in that order: first the toll I paid just to keep Chrome alive at all, then the bug I caused trying to keep it tidy.

Why Wallaby, anyway?

Honest answer: I don't really remember deciding.

Wallaby is the browser-automation library in Elixir, the one built for feature tests, the one every forum thread points at. I wasn't writing feature tests. I needed to load a ticketing page, let its JavaScript run, and read the rendered DOM. But Wallaby already knew how to talk to ChromeDriver, and I had zero interest in hand-rolling that part. So I reached for the thing that existed, bent a testing tool into a scraper, and got back to the actual problem.

That mismatch, a tool designed for short-lived test sessions doing long-running production scraping, is the seam everything else in this post tore along. I don't think it was the wrong call at the time. It just came with a bill, and this is me paying it.

The one fork I couldn't avoid

The ticketing sites I scrape are not fast. Some pages take the better part of a minute to settle: slow backends, a lot of client-side rendering, the occasional anti-bot interstitial.

Wallaby talks to ChromeDriver over HTTP, and somewhere along the way it moved to Erlang's built-in :httpc. The default timeouts there are tighter than the old HTTPoison ones it used to use, and on a slow render the request to ChromeDriver would give up before the page had finished loading. From the outside it looked like a scrape failure. Underneath it was an HTTP client timing out on its own driver.

I couldn't configure my way out of it, so I forked Wallaby. The change is eleven lines in one file:

  defp httpc_http_options(url) do
-    [
-      autoredirect: false,
-      ssl: ssl_options(url)
-    ]
+    user_opts = Application.get_env(:wallaby, :httpc_options, [])
+
+    Keyword.merge(
+      [
+        autoredirect: false,
+        timeout: 240_000,
+        connect_timeout: 30_000,
+        ssl: ssl_options(url)
+      ],
+      user_opts
+    )
  end

240 seconds for the request, 30 for the connect, generous enough that a slow page is just slow rather than a failure. I made it overridable with config :wallaby, :httpc_options so I wasn't maintaining a hard fork for the sake of one knob, and so the change had a shot at being useful to someone else.

In mix.exs it's just a pinned git dep:

{:wallaby, github: "patrols/wallaby", ref: "ed04b6e"},

Forking a dependency always feels heavier than it is. In practice it's a ref in a file and a note-to-self to watch upstream.

Getting Chrome to survive on a 2 GB box

Before any of the zombie drama, I had to get Chrome to stay alive in a container at all. Several separate fights, all of which ended up as flags or config. You can skim this part if you came for the concurrency bug. None of it is clever. It's just the toll.

Shared memory and OOM. Chrome uses /dev/shm for a lot of its scratch space, containers hand it a tiny 64 MB default, Chrome blows past it, and the renderer crashes (which surfaces to you as, yes, invalid session id). On a heavy page the renderer will also happily try to eat the whole machine until the kernel ends the argument. The fix for both was a pile of flags: send shared memory to ordinary temp storage, cap V8's heap, limit renderers, shrink the caches to almost nothing.

--disable-dev-shm-usage
--renderer-process-limit=1
--js-flags=--max-old-space-size=512
--disk-cache-size=1
--media-cache-size=1

I deliberately skipped --single-process and disabling site isolation. Tempting on a small box, but they trade away Chrome's security boundaries, and I'd rather buy the memory back some other way.

A second batch of flags exists purely to stop Chrome spawning processes I don't need. This is also what later keeps the "how many Chromes is too many" math sane:

--no-zygote
--disable-breakpad
--disable-component-update
--disable-background-networking
--disable-extensions

A writable HOME. This one cost me an afternoon. Chrome wants a writable home directory for ~/.pki/nssdb, and the nobody user it runs as has HOME=/nonexistent, so it falls over before doing anything interesting. One line, in two places:

# Dockerfile
ENV HOME=/tmp

# fly.toml
[env]
HOME = '/tmp'

The 4:00:00 PM race. Some scrapes run on Oban cron at exact times, and the ones scheduled on the dot, like 16:00:00, would fail with invalid session id. The cause: my "is Wallaby up?" check only verified that Wallaby.Driver.Supervisor existed, not that ChromeDriver was actually initialized and ready. Start a session in that window and you lose before you've done anything wrong. I added an explicit readiness wait, and because Wallaby-as-a-long-running-service isn't really what Wallaby is for, I also wrapped it in a small supervisor that health-checks it every couple of minutes and restarts it if it wedges, capped at five restarts an hour so a broken state can't turn into a loop.

# Before: a supervisor existing is not the same as it being ready
_pid -> :ok

# After: actually wait for ChromeDriver to come up
_pid ->
  case BrowserPool.wait_until_ready() do
    :ok -> :ok
    error -> error
  end

And then the change that mattered more than all the flags combined. I'd started on a single shared CPU to save a few dollars:

# fly.toml
[[vm]]
memory = '2gb'
cpus = 2

Going from one CPU to two took scroll batches from about 72 seconds down to about 5.

I'd been blaming the sites.

One "Chrome" is not one process

Now the actual subject.

The first thing that breaks your intuition: a single headless Chrome isn't a process, it's a swarm. Main process, renderer, GPU process, a couple of utility processes, crashpad handler, zygote, network service. One browser legitimately spawns 8–12 OS processes. Fewer once you start cutting them with the flags above, but never one.

So the obvious zombie check, "if there are more than N Chrome processes, something leaked," has no fixed N. The healthy number scales with how many sessions are actually live. Set the threshold too high and zombies pile up until you OOM. Set it too low and you execute your own working browser mid-scrape.

And before you can threshold the count, you have to take it, which is its own small adventure, because pgrep matches on the command line. So the manager runs pgrep for candidates, then ps to read each full command line, then filters down to actual browser processes. (That filter is load-bearing. Hold that thought.)

# Simplified; the real one has the error handling you'd expect.
# count_chrome_processes/0 is just length(chrome_browser_pids()).
defp chrome_browser_pids do
  {output, 0} = System.cmd("pgrep", ["-f", "chrome"])
  pids = String.split(output, "\n", trim: true)

  # pgrep only gives PIDs, so re-query ps for each full command line
  {ps_output, 0} =
    System.cmd("ps", ["-p", Enum.join(pids, ","), "-o", "pid=,command="])

  ps_output
  |> String.split("\n", trim: true)
  |> Enum.reject(&should_exclude_process?/1)        # drop the impostors
  |> Enum.map(fn line ->                            # the PID is the first column
    line |> String.trim() |> String.split(~r/\s+/, parts: 2) |> List.first()
  end)
end

Attempt #1: clean up after each session (this made it worse)

My first instinct was the tidy one: when a scraping session ends, check for leftover Chrome and reap it. Clean up your own mess.

It was a disaster, and the reason is pure concurrency. The scraping queue runs jobs back-to-back, so the sequence was:

Session A finishes. A cleanup task spawns to check that A's processes are gone.
Session B starts immediately, and its Chrome processes begin spawning.
A's cleanup task looks at the process table and sees a pile of Chrome: A's still dying, plus B's just being born.
"That's way too many." It kills all of them, including B, which was perfectly healthy.
Session B: invalid session id.

I eventually wrote the epitaph straight into the module doc, so I'd never be tempted again:

Post-session verification was removed because it caused race conditions when scraping jobs ran back-to-back. The verification task would see overlapping Chrome processes (old ones still dying + new ones starting) and kill ALL processes, including the active session.

The lesson that stuck: event-triggered cleanup races with whatever happens next. What I actually wanted was a periodic loop that reconciles reality against known state on its own clock, closer to how a Kubernetes controller thinks than to a finally {} block.

Attempt #2: a threshold that scales with sessions

So cleanup moved into a ChromeManager GenServer that tracks how many sessions are registered and runs two timers: a stale-session sweep every minute and a health check every ten. The one-minute sweep does the real work (it also ages out stale sessions); the ten-minute check is just a coarser backstop. That does mean a leak could sit for up to ten minutes before the slow timer notices, but the fast sweep almost always gets there first, and for a background scraper I could live with the worst case. Both timers ask the same question, given how many sessions are live, are there too many Chrome processes, and the answer is a formula.

The GenServer arms both timers at startup and re-arms each as it fires. A plain loop that never blocks:

def init(_opts) do
  schedule_cleanup()       # stale-session sweep, every 1 min
  schedule_health_check()  # zombie check, every 10 min
  {:ok, %__MODULE__{}}
end

def handle_info(:health_check, state) do
  perform_health_check(state)
  schedule_health_check()   # re-arm
  {:noreply, state}
end

defp perform_health_check(state) do
  chrome_count  = count_chrome_processes()
  session_count = map_size(state.sessions)

  if chrome_count > zombie_threshold(session_count) do
    spawn_cleanup_task(state.sessions)
  end
end

defp schedule_health_check do
  Process.send_after(self(), :health_check, to_timeout(minute: 10))
end
# schedule_cleanup/0 is the 1-minute twin (it also ages out stale sessions).

That zombie_threshold/1 is the whole ballgame, and I got it wrong before I got it right. My first version was (sessions × 3) + 3. Still too aggressive: a single healthy session can transiently sit at 10–12 processes, which sails past a budget of 6 and triggers a nuke. Back to invalid session id, just less often.

The version that finally held:

# Chrome legitimately spawns 8-12 processes per instance
# (main, renderer, GPU, utility×2, crashpad, zygote, network service, …).
# Previous 3x was too aggressive and killed active sessions.
@health_check_zombie_multiplier 5
@health_check_min_processes 8

def zombie_threshold(session_count) when session_count > 0 do
  session_count * @health_check_zombie_multiplier + @health_check_min_processes
end

Which works out to:

Active sessions	Cleanup fires above
1	13 processes
2	18 processes
3	23 processes

I should be straight about where those numbers come from: they aren't derived from the 8–12 figure. With the reduction flags from earlier, a browser actually sits closer to five processes in steady state, but it spikes higher during launch and teardown. So I tuned the 8 and the 5× to the worst transient I actually watched happen, not to a clean model of the swarm. With at least one session live, that gives a browser its full footprint plus generous headroom. It detects zombies a little more slowly, but it never guillotines live work. For a background scraper, "slightly slow" beats "destroys the thing it's protecting."

And it held. The invalid session id storms stopped, and I never had to touch the numbers again. (More on exactly how long "held" turned out to be at the end.)

The zero-session case is the whole game

There's one clause that matters more than the rest:

def zombie_threshold(0), do: 0

When no sessions are registered, the tolerance is zero. Any lingering Chrome is, by definition, a zombie. Skip this and the math quietly betrays you: with zero sessions the threshold would still be 8, so up to eight orphaned processes can sit there forever, never crossing the line. Worse, they hold the lock on the shared --user-data-dir, so the next session can't even start. The accumulation is invisible right up until everything is wedged.

The nuclear option, owned honestly

Here's the ugly part I had to make peace with: Wallaby and ChromeDriver don't expose a session-to-PID mapping. I know I have two live sessions. I cannot tell you which of the 37 Chrome PIDs belong to them.

So when the threshold trips, cleanup kills every Chrome browser process, not just the leaked ones:

# DESIGN LIMITATION: We cannot identify which specific Chrome PIDs belong to
# which sessions (Wallaby/ChromeDriver don't expose session-to-PID mappings).
# Therefore this uses a "nuclear option" approach — if the threshold is
# exceeded, ALL Chrome processes are killed, including potentially active ones.
# zombie_threshold/1 is the trade-off between false-positive kills and leaving
# zombies behind. Future enhancement: track PIDs per session.

The generous threshold is what buys the right to be this blunt. By the time you're over budget, the odds you're looking at zombies rather than legitimate load are high. I'd rather have one honest, well-commented blunt instrument than a clever PID-tracking scheme that's subtly wrong. (Per-session PID tracking is the obvious better answer, but it's a real project, not a patch. More on that at the end.)

The BEAM-specific bits

This is where running Chrome from Elixir stops being incidental and starts mattering.

Don't kill ChromeDriver. Cleanup reaps Chrome browser processes and deliberately leaves ChromeDriver alone. ChromeDriver is owned by a Wallaby Port, which is owned by a GenServer in my supervision tree. SIGTERM it and that GenServer crashes with {:exit_status, 143}, which trips the restarter into a full Application.stop/start cycle. I'd be nuking my own app to tidy up a browser. The processes that actually block new sessions are the Chrome ones squatting on the --user-data-dir lock, so those are all I touch.

Don't block the GenServer. Killing means SIGTERM, a 3-second grace wait, then SIGKILL on the survivors:

Enum.each(pids, &System.cmd("kill", ["-15", &1]))   # SIGTERM all
Process.sleep(3000)                                  # one grace period

survivors =
  Enum.filter(pids, fn pid ->
    match?({_, 0}, System.cmd("ps", ["-p", pid]))    # exit 0 = still running
  end)

Enum.each(survivors, &System.cmd("kill", ["-9", &1]))

That sleep(3000) is poison inside a GenServer. Every session_count call would queue up behind it and time out. So the ChromeManager only ever decides. The actual killing happens in a supervised Task:

defp spawn_cleanup_task(tracked_sessions) do
  task_fn = fn ->
    before = count_chrome_processes()
    killed = kill_orphaned_processes(tracked_sessions)
    if killed > 0 do
      Logger.info("cleanup: #{before} -> #{count_chrome_processes()} Chrome procs")
    end
  end

  # On a supervised Task so the 3s+ kill path never blocks the GenServer
  # (and a crash in the kill path can't take the manager down with it).
  Task.Supervisor.start_child(Pulse.TaskSupervisor, task_fn)
end

The manager stays responsive; the slow, dirty syscalls happen off to the side. (kill_orphaned_processes/1, the threshold re-check plus the bulk kill above, is elided here for length, along with its force-cleanup twin.)

Snapshot before you mutate. The manual force_cleanup_all grabs the PIDs to kill before clearing its session map. Otherwise a session that registers during the async cleanup would have its brand-new processes swept by a task that started life believing the world was empty. Attempt #1's race, in miniature:

def handle_call(:force_cleanup_all, _from, state) do
  # Snapshot PIDs *before* clearing sessions. If we cleared first and fetched
  # PIDs inside the async task, any session that registers mid-cleanup would
  # get its fresh Chrome processes killed.
  chrome_pids = chrome_browser_pids()
  spawn_force_cleanup_task(chrome_pids)
  {:reply, :ok, %{state | sessions: %{}}}
end

Read the world, then forget it. In that order.

The footgun that cost me a few editor windows

One last thing, because it bit me on my own laptop. The naive way to find Chrome is pkill -f chrome. On a dev machine, -f matches the full command line, and a startling number of things have "chrome" or "electron" in theirs. VS Code. Cursor. Claude. "Code Helper." I closed my own editor more than once before adding an exclude list:

@excluded_process_patterns [
  "claude", "electron", "cursor.app",
  "vscode", "code helper", "chrome_crashpad_handler"
]

That's the filter chrome_browser_pids reached for earlier. It's also where ChromeDriver gets spared. Both the editor apps and the driver I must-not-touch are excluded by one predicate:

def should_exclude_process?(command_line) do
  downcased = String.downcase(command_line)

  String.contains?(downcased, "chromedriver") ||
    Enum.any?(@excluded_process_patterns, &String.contains?(downcased, &1))
end

It's a denylist, with the fragility that implies. If you're going to pkill by pattern, enumerate what you must never match first, and accept that you'll forget one. (There's also a separate periodic Oban sweep as a cruder backstop, on the theory that two dumb safety nets beat one clever one.)

What I'd take to the next system

When you can't attribute a resource precisely, budget generously, and police the empty case ruthlessly. (sessions × 5) + 8 for the normal case, hard zero when nothing should be running.
Prefer periodic reconciliation over event-triggered cleanup. Reacting to "a session ended" races with "a session started." A loop that reconciles against known state doesn't.
Keep slow, blocking, failure-prone syscalls out of your GenServers. Decide in the process, act in a Task.
Respect the process tree you live in. On the BEAM, the thing you're tempted to kill might be held by a Port that's held by a supervisor that's holding up your whole app.

The constants are calibrated for my workload, one session at a time on a small box, so measure your own. But the shape of the answer (budget for the swarm, zero-tolerance the void, reconcile on a timer, never block the process keeping score) held up.

None of this was wasted, even though I later deleted every line of it. The formula shipped in February and ran in production for about four months without me touching it again. It was a patch, not a rewrite, and that's the point: it kept the scraper alive long enough that I could fix the actual problem calmly instead of at 3 AM.

Because the real problem was never the threshold. It was that ChromeDriver wouldn't tell me which PIDs were mine, so I was stuck counting processes and guessing. The actual fix was to stop guessing: drive Chrome over the DevTools Protocol directly, where each browser is something I launch, own, and reap myself, with no shared profile lock and nothing to sweep. That meant leaving Wallaby behind entirely, fork and all.

That's the next post.