Patrick Rendal's Blog

Hot-deploying Phoenix on Fly.io without restarting the world

Tue, 16 Jun 2026 13:29:42 +0700

A while back I watched Chris McCord's demo where he pushes a change to a running Phoenix app in production and it just... takes effect. No restart, no reconnect, the LiveView session sitting right there keeps its state. I filed it under "neat, someday" and moved on.

Then I actually set it up on the app I run, and it's quietly become how I ship small changes. So here's the whole thing: install, config, daily use, and the couple of footguns that cost me an afternoon so they don't cost you one.

Everything below is checked against fly_deploy 0.4.2 (the version I'm running). One heads-up worth its own mention: at the time of writing the library's README lags its own code in a few places — it still documents an older setup function and an old version pin. I'll flag those as I go, but in general trust the moduledoc / @deprecated annotations over the README, and double-check against whatever version you install.

What this actually buys you

A normal deploy on Fly builds a new image and swaps your machines out. The BEAM restarts, in-flight requests get cut, and every connected LiveView reconnects from zero. For most apps that's fine. For a LiveView-heavy one it's a visible little hiccup every single time you ship.

A hot deploy skips all of that. It builds a release, ships the compiled .beam files to the VM that's already running, and the machine loads the modules that changed in place — briefly suspending only the processes that use them (typically under a second). The OS process never dies. Open LiveViews keep their state. From the outside nothing happened, except your fix is live.

So it's a great fit for the boring 95%: a template tweak, some LiveView logic, a context function, a bug fix in a module that already exists. It is not for structural stuff, but I'll get to that.

Getting it running

You'll need an app already on Fly (with fly deploy working) and a bucket to hold the compiled releases. On Fly that's Tigris, and it's one command.

Add the dep:

# mix.exs
{:fly_deploy, "~> 0.4"}

(The README example still shows ~> 0.1.0 — that's one of the stale spots; 0.4.x is current.)

Provision the bucket and set the secrets:

# Creates a Tigris bucket and sets AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,
# BUCKET_NAME, AWS_ENDPOINT_URL_S3 and AWS_REGION as app secrets.
fly storage create

# The orchestrator needs a token to list machines and trigger the upgrade RPC.
# This one is easy to miss and the hot deploy fails without it.
fly secrets set FLY_API_TOKEN="$(fly tokens create machine-exec)"

Then turn it on in config/runtime.exs. I gate it behind the bucket being present so it stays completely dormant in dev and anywhere the storage isn't wired up:

if System.get_env("BUCKET_NAME") do
  config :fly_deploy,
    bucket: System.get_env("BUCKET_NAME"),
    # Make sure this matches YOUR bucket's endpoint. The library defaults to
    # https://t3.storage.dev; my Tigris bucket lives at fly.storage.tigris.dev,
    # and leaving the default gave me 403s on the orchestrator's deploy lock.
    # `fly storage create` sets AWS_ENDPOINT_URL_S3 — pin it here so the two
    # never drift.
    aws_endpoint_url_s3: System.get_env("AWS_ENDPOINT_URL_S3")

  config :my_app, fly_deploy: true
end

Last piece — start the poller under your supervision tree. Put it at the top of your children:

# lib/my_app/application.ex
defp fly_deploy_children do
  if Application.get_env(:my_app, :fly_deploy, false) do
    [{FlyDeploy, otp_app: :my_app}]
  else
    []
  end
end

and prepend fly_deploy_children() to your children list. The "top" matters: the poller blocks in its init/1 to apply any pending upgrade before the rest of the tree starts, so a machine that restarts (after a crash, scaling, or a deploy) loads the current code before any of your processes come up.

The README tells you to call FlyDeploy.startup_reapply_current/1 from Application.start/2 instead. That function still exists but is @deprecated in 0.4.2 in favor of the {FlyDeploy, otp_app: :my_app} child spec above — a good example of trusting the code over the README.

That's the install. Because it touches mix.exs and the supervision tree, you need one more regular fly deploy to get it onto your machines. After that, you're hot.

Using it day to day

From your laptop:

mix fly_deploy.hot

It builds a release, uploads the tarball, flips the "current" pointer in the bucket, and tells each running machine to load the new code. No restart. If you run more than one environment, point it at the right config:

mix fly_deploy.hot --config fly.staging.toml
mix fly_deploy.hot --dry-run   # see what it'd do first

And to see what's actually running where:

mix fly_deploy.status

I don't really trust a deploy until I've looked, so from a remote console I check:

FlyDeploy.get_current_hot_upgrade_info(:my_app)   # %{hot_upgrade_applied:, deployed_at:, ...}
FlyDeploy.current_vsn()                           # %{base_image_ref:, hot_ref:, version:, fingerprint:}

deployed_at should be a few seconds ago, and the image ref should be the one you expect. You can also subscribe to the lifecycle (FlyDeploy.subscribe/0, then handle {:fly_deploy, :hot_upgrade_complete, meta}) and forward that to Slack or a #deploys channel, which is nicer than remembering to check.

When to reach for it, and when not

The rule of thumb is simple: if your change can be expressed as "load a newer version of this module," hot-deploy it. If it can't, do a normal deploy.

Hot is great for templates and HEEx, LiveView and component code, context and business logic, and bug fixes in existing modules. Skip it and just fly deploy for anything that changes the shape of the system: new or removed deps, runtime config, the supervision tree (the library can't add/remove supervised children at runtime), background-job queues, database migrations, an Elixir or OTP bump, anything with NIFs/ports. And static assets, which I'll come back to.

When I'm not sure, I cold deploy. It's never the wrong answer, just the slower one.

The stuff that actually bit me

The setup is easy. These are the things I learned the annoying way.

The big one: your normal deploys need unique image tags. This cost me most of an afternoon and a genuinely confused debugging session, so let me spell it out.

On boot, fly_deploy re-applies the "current" hot upgrade recorded in the bucket — that's the restart-resilience feature, and it's good. To decide whether a booting machine is the same generation (re-apply the upgrade) or a fresh cold deploy (forget it and run the new image), it compares the machine's image ref (FLY_IMAGE_REF) against the one stored in the bucket metadata. Match → apply the upgrade; mismatch → reset.

If all your cold deploys ship the same tag, the classic :latest, those refs always match. So a brand-new image looks exactly like a restart, and on boot the machine cheerfully re-applies an old hot tarball on top of your fresh deploy, silently reverting your code. The worst part is that your release version still reports the new build while the code actually running is old. I stared at "but I literally just deployed this" for far too long.

The fix is boring: tag every deploy image uniquely (the git SHA works fine) so each cold deploy has a distinct ref and fly_deploy can tell a new generation from a restart. If you only ever deploy with fly deploy from the CLI you're probably already getting a unique image. This mostly bites hand-rolled CI that pushes :latest.

Two smaller ones:

Hot deploys ship .beam files, not your compiled assets. The tarball is built from your release's .beam files; your priv/static bundle (the fingerprinted CSS/JS, images) isn't in it, so changed asset bytes don't ride along. The library does ship a FlyDeploy.Components.hot_reload_css component that nudges connected clients to re-fetch the stylesheet when the asset manifest changes — handy for the CSS case — but for an actual asset rebuild a cold deploy is the reliable path. Markup in your HEEx updates fine.

Your error tracker's release version goes stale. A hot deploy doesn't rebuild the image, so anything stamped in at build time, like a Sentry release or an OTel tag, keeps pointing at the last cold deploy. Errors after a hot deploy get filed under the old version. Not a big deal once you know, but it'll confuse you in the dashboard otherwise.

Worth it?

For me, easily. Shipping a small fix went from a restart-and-watch ritual to something instant and invisible, and not blowing away every connected LiveView on each deploy is genuinely nice. The setup is a dependency, a bucket, a couple of secrets, and one child at the top of your tree. Keep the hot/cold split in your head, give your deploys unique tags, and it mostly disappears into the background, which is exactly what you want from infrastructure.

I stopped counting Chrome processes and started owning them

Sun, 07 Jun 2026 11:28:30 +0700

In the last post I spent an embarrassing amount of time getting one formula right: (sessions × 5) + 8, the number of Chrome processes my scraper would tolerate before deciding something had leaked and killing them. It worked. It ran in production for about four months. And it was the wrong thing to be doing.

The formula was a workaround for a single missing capability. ChromeDriver would not tell me which Chrome processes belonged to which session. So I counted them, I guessed, and when the guess said "too many" I killed all of them and hoped the live ones weren't in the pile. Every problem in that post, the swarm-counting, the shared-profile lock, the nuclear cleanup, grew out of that one blind spot.

At some point the obvious question finally landed: what if I just owned the browser?

The thing ChromeDriver sits in the middle of

When you drive Chrome through Wallaby, the stack is taller than it looks. Your code talks to Wallaby, Wallaby talks HTTP to ChromeDriver, ChromeDriver is a separate daemon that launches and supervises Chrome, and Chrome is the swarm of processes from last time. ChromeDriver owns the browser. You get a session id and some polite requests.

That middle layer is where my blind spot lived. ChromeDriver launched the Chrome processes, so ChromeDriver knew their PIDs. I didn't. And because every session pointed at the same --user-data-dir, a leftover process from one session could lock the profile and lock out the next one. I was downstream of a daemon that wasn't going to tell me anything useful.

But Chrome doesn't actually need ChromeDriver. ChromeDriver is a WebDriver-to-CDP translator, and CDP, the Chrome DevTools Protocol, is the thing I wanted to talk to anyway. It's the same protocol the devtools panel in your browser uses. You launch Chrome with --remote-debugging-port, it prints a WebSocket URL, and over that socket you can navigate, evaluate JavaScript, take screenshots, watch the network, all of it.

The important part, for me, was the launch. If I spawn Chrome, I hold the OS process. I connect to it over the WebSocket. When I'm done, I kill the process I started. No daemon in between deciding things on my behalf. No shared profile, because each launch gets its own throwaway one. The "which PID is mine" question stops being a question, because all of them are mine and I have a handle to the thing that owns them.

Why I wrote a library instead of using one

This is the part you should be most suspicious of, so I'll lead with the least flattering reason: I wanted to. On a side project, "wrap someone else's library" and "write the library" are not equally fun, and that's a real input even if it never makes it onto an architecture diagram.

But there was a gap underneath the fun, and it's the actual argument. The existing options split two ways. Wallaby is built for feature tests, and I'd already forked it once (last post). Playwright, Puppeteer, and chrome-remote-interface are mature and they're Node or JavaScript. A couple of older Elixir CDP experiments exist, but nothing I'd hang production scraping on.

Here's the part that decided it. The CDP transport was never the hard bit. Talking the protocol is JSON over a WebSocket with request IDs and an event stream: fiddly, but a solved, well-documented problem, and if a solid Elixir client had existed I'd have used it. The thing I actually needed sat one layer up: a browser as a supervised resource, a process that owns the OS Chrome and is guaranteed to reap it when that process or its owner dies. That layer (the terminate/2 guarantee and the with_page wrapper below) was the real work, and I'd have had to build it on top of any of those options anyway. The transport is the easy 20% I'd have gotten for free. The lifecycle was the 80% that was the whole point.

So: part preference, part a real gap that the preference carried me through quickly. I'd make the same call again. It's called cdp_ex, it's on Hex, and it's CDP over Mint.WebSocket with no ChromeDriver and no Node anywhere in the picture.

The core idea is one guarantee

A cdp_ex browser is a GenServer that owns the Chrome OS process and its connections. Its terminate/2 always runs Chrome.stop/1. That's the no-orphan guarantee, and almost everything else is built on it: if the browser process dies, for any reason, the OS process dies with it.

In practice you rarely touch the GenServer directly. The common shape is a throwaway browser per unit of work, which is exactly one function:

CDPEx.with_page([], fn page ->
  {:ok, _} = CDPEx.Page.navigate(page, "https://example.com")
  CDPEx.Page.text(page, "h1")
end)

with_page/3 launches Chrome, hands you a page, runs your function, and tears the whole thing down afterwards, even if your function raises.

The teardown is the part worth slowing down on, because it is the no-orphan claim, and it depends on two things being true at once. First, a browser that crashes mid-call must not take your process down with it. So with_page traps exits for the duration and turns a browser crash into an ordinary {:error, _} you can match on, instead of a link exit that kills your caller. Second, if your process is the one that dies, the browser still has to be reaped. So with_page keeps the link to the browser rather than downgrading to a monitor, which means the browser's own terminate/2 fires and takes Chrome with it. Trap the exit so a crash can't propagate up, keep the link so a caller's death still cleans up. You get resource safety in both directions, and you get it without thinking about it.

My scraper's fetch path is essentially that, with a navigation and a wait wrapped around it:

defp run_in_page(url, opts) do
  CDPEx.with_page(
    CdpConfig.launch_opts(opts),
    fn page -> fetch_page(page, url, opts) end,
    prevent_alerts: true
  )
end

Chrome is launched, driven, and reaped per fetch. There is no pool to babysit, no session registry, no idea of a "leaked" browser, because a browser that isn't currently inside a with_page call doesn't exist.

What this deletes

Here's the part that made the whole detour worth it. Go back and look at last post's ChromeManager: the two timers, the (sessions × 5) + 8 threshold, the zero-session special case, the snapshot-before-mutate dance, the nuclear cleanup that killed every Chrome because it couldn't tell mine from leaked. All of it.

Gone. The entire module, plus the Oban sweep, plus the WallabyRestarter, plus the ChromeDriver supervisor. Deleted.

Not "improved." There is nothing left to count, because there's no shared pool of ambiguous processes. There's no shared --user-data-dir to lock, because every fetch gets its own temp profile. The check I used to SSH in and run looks different now:

$ pgrep -f chrome | wc -l
0

Zero between fetches, and a small handful during one. Not because a sweep ran on a timer, but because there's nothing to sweep. The class of bug from the last post didn't get tuned down. It stopped being possible.

How the cutover actually went

I didn't rip and replace, and "I rewrote the browser layer and everything was great" would not be a true sentence. The bumps are the useful part.

cdp_ex ran alongside Wallaby behind a per-host environment toggle, so I could send one ticketing site through the new engine while everything else stayed on the old one, with Wallaby as the fallback the moment anything looked wrong. I cut over one host at a time and watched. Three things were worth the scars.

Production is not my laptop. A single Chrome launch that took about a second locally took six or more on the cold Fly machine under load, and sometimes blew past cdp_ex's launch timeout entirely. The failure surfaced as :debug_url_not_found: Chrome was still coming up when I gave up waiting for its debug URL. The fix was unglamorous, a 45-second launch ceiling instead of the optimistic default, but the lesson is the usual one. Your timeouts are calibrated for the machine you wrote them on.

The reaper from the last post tried to murder its own replacement. This is my favorite bug of the whole project. The new CDP browsers registered no Wallaby session, so as far as ChromeManager was concerned, session_count was 0. And zombie_threshold(0) is 0. So the instant a sweep fired, the nuclear cleanup looked at the brand-new CDP Chrome doing real work, decided that any Chrome with zero sessions was by definition a zombie, and SIGTERM'd it mid-fetch. The WebSocket dropped, the fetch failed with a connection-closed error, and it took me longer than I'd like to admit to realize the call was coming from inside the house.

The stopgap was to teach the old reaper to leave the new browsers alone. cdp_ex launches with its own temp profile, so it's identifiable by its --user-data-dir, and I added it to the exclude list with a comment that is basically this post in miniature:

 @excluded_process_patterns [
   "claude", "electron", "cursor.app", "vscode", "code helper",
   "chrome_crashpad_handler",
+  # cdp_ex Chrome reaps itself (CDPEx.with_page on teardown), so the sweep must
+  # skip it: a cdp fetch registers no session, so session_count is 0, so
+  # zombie_threshold/1 is 0, and the nuclear cleanup would SIGTERM every Chrome.
+  "cdp_ex"
 ]

Once every host was cut over, the exclude line went away with the rest of the module.

Cloudflare was an anticlimax. The site I was most nervous about sits behind Cloudflare's bot checks, and I'd budgeted time for the usual cat-and-mouse: stealth plugins, fingerprint patching, the works. The first real CDP fetch rendered the page in full, no challenge. The durable lesson here isn't "CDP beats Cloudflare," because it doesn't, and Cloudflare's posture shifts month to month. It's that a lot of basic bot checks are really checks for whether you're a real browser running the page's JavaScript, and CDP drives exactly that: real Chrome, real page. For my sites that was enough on its own. For a hardened target it won't be, and cdp_ex is deliberately not a stealth toolkit. If you try this and hit a challenge wall, that's the expected outcome, not a regression.

Where it landed

The browser layer is roughly half the moving parts it used to be. The manager and everything around it went with the cutover, and the failure mode that started this whole thing, zombie Chrome eating a 2 GB box, can't recur, because nobody is accumulating browsers anymore.

cdp_ex is open source and on Hex. It does the things I needed: launch and a warm Pool, navigation with real readiness waits, JavaScript evaluation, screenshots and PDFs, network observation and request interception, HTTP and proxy auth (including auto-answering an authenticated proxy), and :telemetry so you can see what it's doing in production.

It is also young, and I'd rather tell you what it isn't. It's Chrome and Chromium only, because it speaks CDP. It is not a stealth or anti-detection framework, and I have no plans to make it one. If you need cross-browser support or a mature feature-test DSL, Wallaby and Playwright are still the right tools and I'm not trying to talk you out of them. But if you want to drive Chrome from Elixir as a supervised, self-reaping resource, with no ChromeDriver daemon and no Node runtime hanging off the side of your release, that's the entire reason it exists.

The lesson from the last post was "respect the process tree you live in." This was the logical end of that. Once the browser is just another process you own, all the counting and guessing and nuclear cleanup quietly stops being your problem. You delete it, and nothing misses it.

cdp_ex on GitHub · docs

Thirty-seven Chromes: taming zombie Chrome on the BEAM

Sat, 06 Jun 2026 16:07:06 +0700

I run livein.city, a concert-listings site for Bangkok. A chunk of its plumbing is a scraper that drives headless Chrome to render JavaScript-heavy ticketing pages, after which an LLM pulls out the event data. The whole thing lives on a single 2 GB Fly.io machine.

For a while, scraping would just… stop. New concerts dried up. The logs were a wall of invalid session id. And when I SSH'd in:

$ ps aux | grep -c chrome
37

Thirty-seven Chrome processes (give or take the grep itself) on a box that should have had a handful. They'd eaten the memory, the kernel had started OOM-killing things, and every new browser session died on arrival.

There were a few root causes: shared-memory limits, Chrome's memory appetite, a startup race. But the most embarrassing one was self-inflicted. The code I'd written to clean up zombie Chrome processes was killing my live sessions.

This is the story of getting that code right, and of everything I had to do before I even earned the right to have that problem. So it goes in that order: first the toll I paid just to keep Chrome alive at all, then the bug I caused trying to keep it tidy.

Why Wallaby, anyway?

Honest answer: I don't really remember deciding.

Wallaby is the browser-automation library in Elixir, the one built for feature tests, the one every forum thread points at. I wasn't writing feature tests. I needed to load a ticketing page, let its JavaScript run, and read the rendered DOM. But Wallaby already knew how to talk to ChromeDriver, and I had zero interest in hand-rolling that part. So I reached for the thing that existed, bent a testing tool into a scraper, and got back to the actual problem.

That mismatch, a tool designed for short-lived test sessions doing long-running production scraping, is the seam everything else in this post tore along. I don't think it was the wrong call at the time. It just came with a bill, and this is me paying it.

The one fork I couldn't avoid

The ticketing sites I scrape are not fast. Some pages take the better part of a minute to settle: slow backends, a lot of client-side rendering, the occasional anti-bot interstitial.

Wallaby talks to ChromeDriver over HTTP, and somewhere along the way it moved to Erlang's built-in :httpc. The default timeouts there are tighter than the old HTTPoison ones it used to use, and on a slow render the request to ChromeDriver would give up before the page had finished loading. From the outside it looked like a scrape failure. Underneath it was an HTTP client timing out on its own driver.

I couldn't configure my way out of it, so I forked Wallaby. The change is eleven lines in one file:

  defp httpc_http_options(url) do
-    [
-      autoredirect: false,
-      ssl: ssl_options(url)
-    ]
+    user_opts = Application.get_env(:wallaby, :httpc_options, [])
+
+    Keyword.merge(
+      [
+        autoredirect: false,
+        timeout: 240_000,
+        connect_timeout: 30_000,
+        ssl: ssl_options(url)
+      ],
+      user_opts
+    )
  end

240 seconds for the request, 30 for the connect, generous enough that a slow page is just slow rather than a failure. I made it overridable with config :wallaby, :httpc_options so I wasn't maintaining a hard fork for the sake of one knob, and so the change had a shot at being useful to someone else.

In mix.exs it's just a pinned git dep:

{:wallaby, github: "patrols/wallaby", ref: "ed04b6e"},

Forking a dependency always feels heavier than it is. In practice it's a ref in a file and a note-to-self to watch upstream.

Getting Chrome to survive on a 2 GB box

Before any of the zombie drama, I had to get Chrome to stay alive in a container at all. Several separate fights, all of which ended up as flags or config. You can skim this part if you came for the concurrency bug. None of it is clever. It's just the toll.

Shared memory and OOM. Chrome uses /dev/shm for a lot of its scratch space, containers hand it a tiny 64 MB default, Chrome blows past it, and the renderer crashes (which surfaces to you as, yes, invalid session id). On a heavy page the renderer will also happily try to eat the whole machine until the kernel ends the argument. The fix for both was a pile of flags: send shared memory to ordinary temp storage, cap V8's heap, limit renderers, shrink the caches to almost nothing.

--disable-dev-shm-usage
--renderer-process-limit=1
--js-flags=--max-old-space-size=512
--disk-cache-size=1
--media-cache-size=1

I deliberately skipped --single-process and disabling site isolation. Tempting on a small box, but they trade away Chrome's security boundaries, and I'd rather buy the memory back some other way.

A second batch of flags exists purely to stop Chrome spawning processes I don't need. This is also what later keeps the "how many Chromes is too many" math sane:

--no-zygote
--disable-breakpad
--disable-component-update
--disable-background-networking
--disable-extensions

A writable HOME. This one cost me an afternoon. Chrome wants a writable home directory for ~/.pki/nssdb, and the nobody user it runs as has HOME=/nonexistent, so it falls over before doing anything interesting. One line, in two places:

# Dockerfile
ENV HOME=/tmp

# fly.toml
[env]
HOME = '/tmp'

The 4:00:00 PM race. Some scrapes run on Oban cron at exact times, and the ones scheduled on the dot, like 16:00:00, would fail with invalid session id. The cause: my "is Wallaby up?" check only verified that Wallaby.Driver.Supervisor existed, not that ChromeDriver was actually initialized and ready. Start a session in that window and you lose before you've done anything wrong. I added an explicit readiness wait, and because Wallaby-as-a-long-running-service isn't really what Wallaby is for, I also wrapped it in a small supervisor that health-checks it every couple of minutes and restarts it if it wedges, capped at five restarts an hour so a broken state can't turn into a loop.

# Before: a supervisor existing is not the same as it being ready
_pid -> :ok

# After: actually wait for ChromeDriver to come up
_pid ->
  case BrowserPool.wait_until_ready() do
    :ok -> :ok
    error -> error
  end

And then the change that mattered more than all the flags combined. I'd started on a single shared CPU to save a few dollars:

# fly.toml
[[vm]]
memory = '2gb'
cpus = 2

Going from one CPU to two took scroll batches from about 72 seconds down to about 5.

I'd been blaming the sites.

One "Chrome" is not one process

Now the actual subject.

The first thing that breaks your intuition: a single headless Chrome isn't a process, it's a swarm. Main process, renderer, GPU process, a couple of utility processes, crashpad handler, zygote, network service. One browser legitimately spawns 8–12 OS processes. Fewer once you start cutting them with the flags above, but never one.

So the obvious zombie check, "if there are more than N Chrome processes, something leaked," has no fixed N. The healthy number scales with how many sessions are actually live. Set the threshold too high and zombies pile up until you OOM. Set it too low and you execute your own working browser mid-scrape.

And before you can threshold the count, you have to take it, which is its own small adventure, because pgrep matches on the command line. So the manager runs pgrep for candidates, then ps to read each full command line, then filters down to actual browser processes. (That filter is load-bearing. Hold that thought.)

# Simplified; the real one has the error handling you'd expect.
# count_chrome_processes/0 is just length(chrome_browser_pids()).
defp chrome_browser_pids do
  {output, 0} = System.cmd("pgrep", ["-f", "chrome"])
  pids = String.split(output, "\n", trim: true)

  # pgrep only gives PIDs, so re-query ps for each full command line
  {ps_output, 0} =
    System.cmd("ps", ["-p", Enum.join(pids, ","), "-o", "pid=,command="])

  ps_output
  |> String.split("\n", trim: true)
  |> Enum.reject(&should_exclude_process?/1)        # drop the impostors
  |> Enum.map(fn line ->                            # the PID is the first column
    line |> String.trim() |> String.split(~r/\s+/, parts: 2) |> List.first()
  end)
end

Attempt #1: clean up after each session (this made it worse)

My first instinct was the tidy one: when a scraping session ends, check for leftover Chrome and reap it. Clean up your own mess.

It was a disaster, and the reason is pure concurrency. The scraping queue runs jobs back-to-back, so the sequence was:

Session A finishes. A cleanup task spawns to check that A's processes are gone.
Session B starts immediately, and its Chrome processes begin spawning.
A's cleanup task looks at the process table and sees a pile of Chrome: A's still dying, plus B's just being born.
"That's way too many." It kills all of them, including B, which was perfectly healthy.
Session B: invalid session id.

I eventually wrote the epitaph straight into the module doc, so I'd never be tempted again:

Post-session verification was removed because it caused race conditions when scraping jobs ran back-to-back. The verification task would see overlapping Chrome processes (old ones still dying + new ones starting) and kill ALL processes, including the active session.

The lesson that stuck: event-triggered cleanup races with whatever happens next. What I actually wanted was a periodic loop that reconciles reality against known state on its own clock, closer to how a Kubernetes controller thinks than to a finally {} block.

Attempt #2: a threshold that scales with sessions

So cleanup moved into a ChromeManager GenServer that tracks how many sessions are registered and runs two timers: a stale-session sweep every minute and a health check every ten. The one-minute sweep does the real work (it also ages out stale sessions); the ten-minute check is just a coarser backstop. That does mean a leak could sit for up to ten minutes before the slow timer notices, but the fast sweep almost always gets there first, and for a background scraper I could live with the worst case. Both timers ask the same question, given how many sessions are live, are there too many Chrome processes, and the answer is a formula.

The GenServer arms both timers at startup and re-arms each as it fires. A plain loop that never blocks:

def init(_opts) do
  schedule_cleanup()       # stale-session sweep, every 1 min
  schedule_health_check()  # zombie check, every 10 min
  {:ok, %__MODULE__{}}
end

def handle_info(:health_check, state) do
  perform_health_check(state)
  schedule_health_check()   # re-arm
  {:noreply, state}
end

defp perform_health_check(state) do
  chrome_count  = count_chrome_processes()
  session_count = map_size(state.sessions)

  if chrome_count > zombie_threshold(session_count) do
    spawn_cleanup_task(state.sessions)
  end
end

defp schedule_health_check do
  Process.send_after(self(), :health_check, to_timeout(minute: 10))
end
# schedule_cleanup/0 is the 1-minute twin (it also ages out stale sessions).

That zombie_threshold/1 is the whole ballgame, and I got it wrong before I got it right. My first version was (sessions × 3) + 3. Still too aggressive: a single healthy session can transiently sit at 10–12 processes, which sails past a budget of 6 and triggers a nuke. Back to invalid session id, just less often.

The version that finally held:

# Chrome legitimately spawns 8-12 processes per instance
# (main, renderer, GPU, utility×2, crashpad, zygote, network service, …).
# Previous 3x was too aggressive and killed active sessions.
@health_check_zombie_multiplier 5
@health_check_min_processes 8

def zombie_threshold(session_count) when session_count > 0 do
  session_count * @health_check_zombie_multiplier + @health_check_min_processes
end

Which works out to:

Active sessions	Cleanup fires above
1	13 processes
2	18 processes
3	23 processes

I should be straight about where those numbers come from: they aren't derived from the 8–12 figure. With the reduction flags from earlier, a browser actually sits closer to five processes in steady state, but it spikes higher during launch and teardown. So I tuned the 8 and the 5× to the worst transient I actually watched happen, not to a clean model of the swarm. With at least one session live, that gives a browser its full footprint plus generous headroom. It detects zombies a little more slowly, but it never guillotines live work. For a background scraper, "slightly slow" beats "destroys the thing it's protecting."

And it held. The invalid session id storms stopped, and I never had to touch the numbers again. (More on exactly how long "held" turned out to be at the end.)

The zero-session case is the whole game

There's one clause that matters more than the rest:

def zombie_threshold(0), do: 0

When no sessions are registered, the tolerance is zero. Any lingering Chrome is, by definition, a zombie. Skip this and the math quietly betrays you: with zero sessions the threshold would still be 8, so up to eight orphaned processes can sit there forever, never crossing the line. Worse, they hold the lock on the shared --user-data-dir, so the next session can't even start. The accumulation is invisible right up until everything is wedged.

The nuclear option, owned honestly

Here's the ugly part I had to make peace with: Wallaby and ChromeDriver don't expose a session-to-PID mapping. I know I have two live sessions. I cannot tell you which of the 37 Chrome PIDs belong to them.

So when the threshold trips, cleanup kills every Chrome browser process, not just the leaked ones:

# DESIGN LIMITATION: We cannot identify which specific Chrome PIDs belong to
# which sessions (Wallaby/ChromeDriver don't expose session-to-PID mappings).
# Therefore this uses a "nuclear option" approach — if the threshold is
# exceeded, ALL Chrome processes are killed, including potentially active ones.
# zombie_threshold/1 is the trade-off between false-positive kills and leaving
# zombies behind. Future enhancement: track PIDs per session.

The generous threshold is what buys the right to be this blunt. By the time you're over budget, the odds you're looking at zombies rather than legitimate load are high. I'd rather have one honest, well-commented blunt instrument than a clever PID-tracking scheme that's subtly wrong. (Per-session PID tracking is the obvious better answer, but it's a real project, not a patch. More on that at the end.)

The BEAM-specific bits

This is where running Chrome from Elixir stops being incidental and starts mattering.

Don't kill ChromeDriver. Cleanup reaps Chrome browser processes and deliberately leaves ChromeDriver alone. ChromeDriver is owned by a Wallaby Port, which is owned by a GenServer in my supervision tree. SIGTERM it and that GenServer crashes with {:exit_status, 143}, which trips the restarter into a full Application.stop/start cycle. I'd be nuking my own app to tidy up a browser. The processes that actually block new sessions are the Chrome ones squatting on the --user-data-dir lock, so those are all I touch.

Don't block the GenServer. Killing means SIGTERM, a 3-second grace wait, then SIGKILL on the survivors:

Enum.each(pids, &System.cmd("kill", ["-15", &1]))   # SIGTERM all
Process.sleep(3000)                                  # one grace period

survivors =
  Enum.filter(pids, fn pid ->
    match?({_, 0}, System.cmd("ps", ["-p", pid]))    # exit 0 = still running
  end)

Enum.each(survivors, &System.cmd("kill", ["-9", &1]))

That sleep(3000) is poison inside a GenServer. Every session_count call would queue up behind it and time out. So the ChromeManager only ever decides. The actual killing happens in a supervised Task:

defp spawn_cleanup_task(tracked_sessions) do
  task_fn = fn ->
    before = count_chrome_processes()
    killed = kill_orphaned_processes(tracked_sessions)
    if killed > 0 do
      Logger.info("cleanup: #{before} -> #{count_chrome_processes()} Chrome procs")
    end
  end

  # On a supervised Task so the 3s+ kill path never blocks the GenServer
  # (and a crash in the kill path can't take the manager down with it).
  Task.Supervisor.start_child(Pulse.TaskSupervisor, task_fn)
end

The manager stays responsive; the slow, dirty syscalls happen off to the side. (kill_orphaned_processes/1, the threshold re-check plus the bulk kill above, is elided here for length, along with its force-cleanup twin.)

Snapshot before you mutate. The manual force_cleanup_all grabs the PIDs to kill before clearing its session map. Otherwise a session that registers during the async cleanup would have its brand-new processes swept by a task that started life believing the world was empty. Attempt #1's race, in miniature:

def handle_call(:force_cleanup_all, _from, state) do
  # Snapshot PIDs *before* clearing sessions. If we cleared first and fetched
  # PIDs inside the async task, any session that registers mid-cleanup would
  # get its fresh Chrome processes killed.
  chrome_pids = chrome_browser_pids()
  spawn_force_cleanup_task(chrome_pids)
  {:reply, :ok, %{state | sessions: %{}}}
end

Read the world, then forget it. In that order.

The footgun that cost me a few editor windows

One last thing, because it bit me on my own laptop. The naive way to find Chrome is pkill -f chrome. On a dev machine, -f matches the full command line, and a startling number of things have "chrome" or "electron" in theirs. VS Code. Cursor. Claude. "Code Helper." I closed my own editor more than once before adding an exclude list:

@excluded_process_patterns [
  "claude", "electron", "cursor.app",
  "vscode", "code helper", "chrome_crashpad_handler"
]

That's the filter chrome_browser_pids reached for earlier. It's also where ChromeDriver gets spared. Both the editor apps and the driver I must-not-touch are excluded by one predicate:

def should_exclude_process?(command_line) do
  downcased = String.downcase(command_line)

  String.contains?(downcased, "chromedriver") ||
    Enum.any?(@excluded_process_patterns, &String.contains?(downcased, &1))
end

It's a denylist, with the fragility that implies. If you're going to pkill by pattern, enumerate what you must never match first, and accept that you'll forget one. (There's also a separate periodic Oban sweep as a cruder backstop, on the theory that two dumb safety nets beat one clever one.)

What I'd take to the next system

When you can't attribute a resource precisely, budget generously, and police the empty case ruthlessly. (sessions × 5) + 8 for the normal case, hard zero when nothing should be running.
Prefer periodic reconciliation over event-triggered cleanup. Reacting to "a session ended" races with "a session started." A loop that reconciles against known state doesn't.
Keep slow, blocking, failure-prone syscalls out of your GenServers. Decide in the process, act in a Task.
Respect the process tree you live in. On the BEAM, the thing you're tempted to kill might be held by a Port that's held by a supervisor that's holding up your whole app.

The constants are calibrated for my workload, one session at a time on a small box, so measure your own. But the shape of the answer (budget for the swarm, zero-tolerance the void, reconcile on a timer, never block the process keeping score) held up.

None of this was wasted, even though I later deleted every line of it. The formula shipped in February and ran in production for about four months without me touching it again. It was a patch, not a rewrite, and that's the point: it kept the scraper alive long enough that I could fix the actual problem calmly instead of at 3 AM.

Because the real problem was never the threshold. It was that ChromeDriver wouldn't tell me which PIDs were mine, so I was stuck counting processes and guessing. The actual fix was to stop guessing: drive Chrome over the DevTools Protocol directly, where each browser is something I launch, own, and reap myself, with no shared profile lock and nothing to sweep. That meant leaving Wallaby behind entirely, fork and all.

That's the next post.

Running Paperclip on a Hetzner VPS: A Proper Setup Guide

Mon, 30 Mar 2026 19:35:38 +0700

This post covers how I set up Paperclip, an open-source AI agent orchestration platform, on a dedicated Hetzner VPS. It picks up where my previous Hetzner guide left off, so if you haven't read that one, start there. By the end, you'll have Paperclip running persistently behind Caddy with HTTPS, locked down with basic auth, and accessible from anywhere via a subdomain.

Why a Dedicated VPS

You can run Paperclip locally, and for experimenting that's fine. But agents running autonomously on a schedule — waking up every 30 seconds, checking their inbox, executing tasks — don't belong on your laptop. A few reasons to give Paperclip its own server:

Isolation. Agents can spike CPU and memory unpredictably. You don't want that affecting other services you're running.

Persistence. Paperclip tracks agent sessions, conversation history, and task state. None of that survives a reboot or a closed terminal without a proper process manager.

Always-on. The whole point of autonomous agents is that they work while you're not watching. That requires a server.

I put it on a separate VPS rather than sharing with my existing setup. A Hetzner CX22 (2 vCPU, 4 GB RAM, Singapore region for latency) costs around €6/month.

Provisioning the Server

The full provisioning and hardening steps — SSH lockdown, fail2ban, Tailscale, UFW — are covered in my Hetzner VPS setup guide. Follow that first and come back here once you have a hardened server with a deploy user and Tailscale running.

A few things specific to this setup worth noting:

Use a dedicated VPS, not a shared one. Paperclip agents can be resource-hungry. Keep it isolated from other services.

Region: Pick whatever's geographically closest to you. I chose Singapore for lower latency from Bangkok.

Size: Go with Regular Performance CX22 (2 vCPU, 4 GB RAM). Claude Code is memory-hungry — the 2 GB CX11 would be tight. Skip Cost-Optimized (older hardware) and Dedicated Resources (overkill for this use case).

Create a deploy user before proceeding. Don't run Paperclip as root. The previous guide covers this — make sure you have a deploy user with sudo access and your SSH key copied across before continuing.

Installing Node.js and Claude Code

Switch to the deploy user and create a workspace:

su - deploy
mkdir ~/paperclip
cd ~/paperclip

Paperclip runs on Node.js. Install it from the NodeSource repository:

curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs

Install Claude Code globally:

sudo npm install -g @anthropic-ai/claude-code

Authenticate Claude Code. This opens a browser OAuth flow. Complete it on your local machine:

claude

Follow the prompts to trust the directory and complete login.

Configure MCP Servers (Optional)

If you want your agents to have access to tools like Linear, or other services, add them now as the deploy user. The --scope user flag makes the config available to all Claude Code sessions running as this user, including Paperclip's agents:

claude mcp add linear --transport sse https://mcp.linear.app/sse \
  --header "Authorization: Bearer YOUR_LINEAR_API_KEY" \
  --scope user

Installing Paperclip

Run the interactive setup:

npx paperclipai onboard --yes

This creates an embedded PostgreSQL database, generates secrets, and writes a config to ~/.paperclip/. When it's done, you'll see a summary confirming everything passed.

Start the server once to verify it works:

npm paperclipai run

You should see the Paperclip ASCII banner and Server listening on 127.0.0.1:3100. Kill it with Ctrl+C once confirmed.

Paperclip starting up — the ASCII banner confirms the server is running

Setting Up Caddy

Install Caddy:

sudo apt install -y caddy

Add a DNS A record for your subdomain pointing to the server's public IP. I used paperclip.rendal.me. Set it to DNS-only (not proxied) in Cloudflare.

Generate a hashed password for basic auth:

caddy hash-password

Enter your password when prompted and copy the hash it outputs. Edit the Caddyfile:

sudo nano /etc/caddy/Caddyfile

Replace the entire contents with:

paperclip.your-domain.com {
    basicauth {
        yourname HASHED_PASSWORD_HERE
    }
    reverse_proxy 127.0.0.1:3100
}

Restart Caddy:

sudo systemctl restart caddy

Caddy will automatically provision a TLS certificate.

Running Paperclip as a Service

Right now Paperclip dies when you close the SSH session. Fix that with systemd:

sudo nano /etc/systemd/system/paperclip.service

[Unit]
Description=Paperclip AI Orchestration
After=network.target

[Service]
User=deploy
WorkingDirectory=/home/deploy/paperclip
ExecStart=/usr/bin/npx paperclipai run
Restart=always
RestartSec=10
Environment=HOME=/home/deploy

[Install]
WantedBy=multi-user.target

Enable and start it:

sudo systemctl daemon-reload
sudo systemctl enable paperclip
sudo systemctl start paperclip

Check it's running:

sudo systemctl status paperclip

Paperclip now starts automatically on boot and restarts if it crashes.

Visit https://paperclip.your-domain.com – You should see a login prompt, and after authenticating, the Paperclip onboarding screen.

The Paperclip dashboard after a successful setup

What You Now Have

After all of this:

Paperclip is running persistently as a systemd service under the deploy user
HTTPS is handled by Caddy with automatic certificate renewal
Basic auth protects the UI from the public internet
SSH is invisible to the internet — accessible only through Tailscale
Claude Code is authenticated and optionally connected to MCP servers
Agents can run autonomously 24/7 without your laptop being open

The next step is configuring your LLM provider in the Paperclip UI, creating your first company, and giving your agents something real to do. That's a separate post.

Setting Up a Hetzner VPS: Provisioning, Firewalls, and Locking It Down

Mon, 23 Mar 2026 18:53:58 +0700

This post covers how I set up and hardened the VPS that runs rendal.me. By the end, you'll have a provisioned Hetzner server with SSH locked down through Tailscale, fail2ban running as a safety net, and a clear mental model of why each layer exists.

Why Hetzner

I wanted a simple VPS for a personal Rails site — nothing managed, nothing abstracted away, just a server I control. Hetzner is the obvious choice if you're in Europe or fine with European data centers: the price-to-hardware ratio is genuinely hard to beat. A CX22 (2 vCPUs, 4 GB RAM, 40 GB NVMe) costs around €4/month. The equivalent on DigitalOcean or AWS is two to three times that.

The Cloud Console is clean and fast, the network is reliable, and they have a straightforward firewall product at the network level. No complaints after running it for months.

Provisioning the Server

Log into the Hetzner Cloud Console and create a new project. Projects are just organizational containers — you can put all your servers, firewalls, and SSH keys for a given thing in one place.

Inside the project, click Add Server. The choices that matter:

Location: Pick whatever's geographically closest to your users. For a personal site with no strong preference, Helsinki or Falkenstein are both fine.

Image: Ubuntu 24.04. Stable, well-documented, apt just works, and most tutorials you'll find for server tooling assume Debian/Ubuntu.

Type: For a personal Rails app, the CX22 is plenty. You can always resize later if you need to.

SSH keys: This is important to get right upfront. Add your public key here before the server is created. Hetzner will inject it into the authorized_keys for the root user, so your first login is already key-authenticated. If you don't have an Ed25519 key pair yet:

ssh-keygen -t ed25519 -C "your@email.com"

Paste the contents of ~/.ssh/id_ed25519.pub into the Hetzner SSH key field.

Networking: Leave the public IPv4 enabled for now. You'll tighten firewall rules shortly.

Create the server. Hetzner provisions it in about 30 seconds.

The Hetzner Cloud Firewall

Before you SSH in and touch anything, set up a firewall at the network level. This is separate from any firewall running on the server itself — it operates in Hetzner's infrastructure and blocks traffic before it even reaches your instance.

Go to Firewalls in the left sidebar and create a new one. The default inbound rules allow everything. Replace them with:

Direction	Protocol	Port	Source
Inbound	TCP	22	Any
Inbound	TCP	80	Any
Inbound	TCP	443	Any

Port 22 stays open for now — you need SSH access to configure everything else. You'll close it later once Tailscale is running. Outbound rules can stay as "allow all."

Apply the firewall to your server under the Resources tab.

Hetzner firewall with port 22

The principle here is default-deny at the network perimeter. Nothing reaches your server except the specific ports you've explicitly allowed. Every other port is just gone from the internet's perspective — not rejected, not filtered, simply unreachable.

First Login and Baseline Hardening

SSH in as root using the server's public IP:

ssh root@

First, update everything:

apt update && apt upgrade -y

Lock Down SSH

Edit /etc/ssh/sshd_config:

PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin prohibit-password

Then restart:

systemctl restart ssh

Key-only authentication means brute-force password attacks are pointless. Bots don't know that though — they'll keep hammering port 22 regardless. You'll see this if you check the auth log after even a few hours on a public IP:

journalctl -u ssh --since "24 hours ago" | grep "Failed password" | wc -l

The number is always higher than you expect. Which brings us to the next layer.

Create a Deploy User

Running everything as root works, but it's unnecessarily risky. Create a dedicated user for day-to-day operations and deployments:

useradd -m -s /bin/bash deploy
usermod -aG sudo deploy
mkdir -p /home/deploy/.ssh
cp ~/.ssh/authorized_keys /home/deploy/.ssh/
chown -R deploy:deploy /home/deploy/.ssh
chmod 700 /home/deploy/.ssh
chmod 600 /home/deploy/.ssh/authorized_keys
echo "deploy ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/deploy

Open a new terminal and verify SSH works as the deploy user before continuing:

ssh deploy@

Once the deploy user is confirmed working, you can tighten further by setting PermitRootLogin no in sshd_config to disable root SSH entirely. For most setups, prohibit-password is sufficient.

fail2ban

fail2ban watches your log files for repeated failed authentication attempts and temporarily bans offending IPs using firewall rules. It's not your primary defense — key-only SSH is — but it's a useful safety net for the cases where you temporarily slip (debugging with password auth enabled, a misconfigured sshd_config, etc.).

Install it:

apt install fail2ban -y

Create a local config (the default jail.conf gets overwritten on updates — always work in jail.local):

cp /etc/fail2ban/jail.conf /etc/fail2ban/jail.local

The key settings in /etc/fail2ban/jail.local:

[DEFAULT]
bantime = 1h
findtime = 10m
maxretry = 3

[sshd]
enabled = true
port = ssh
logpath = %(sshd_log)s
backend = %(sshd_backend)s
maxretry = 3

Three failed attempts within 10 minutes gets an IP banned for an hour. You can be more aggressive — some people set bantime to 24 hours or use bantime = -1 for permanent bans. I find an hour is enough to make automated attacks non-viable without the operational overhead of managing permanent bans.

Start and enable it:

systemctl enable fail2ban
systemctl start fail2ban

Check what it's doing:

fail2ban-client status sshd

Within a few hours, you'll see IPs showing up in the banned list. It's grimly satisfying.

Tailscale

This is where the setup gets genuinely interesting. Tailscale creates a private WireGuard-based mesh network between your devices. Your server gets a stable private IP in the 100.x.y.z range that's only reachable from your other Tailscale-connected machines.

The goal: stop SSH-ing to the server's public IP entirely. Once Tailscale is running, you SSH to the private Tailscale IP, then close port 22 on the public firewall completely. The server stays reachable for web traffic on 80 and 443, but SSH is invisible to the internet. Bots can't brute-force a port that doesn't exist as far as they're concerned.

Install Tailscale on the Server

curl -fsSL https://tailscale.com/install.sh | sh
tailscale up

This gives you a URL to authenticate the server with your Tailscale account. Once authenticated, it joins your tailnet and gets a private IP.

Check it:

tailscale ip -4

Note that IP — you'll use it for everything going forward.

Install Tailscale on Your Machine

Install the Tailscale client on your laptop too. It's available for macOS, Linux, Windows, iOS, and Android. Once both devices are authenticated to the same tailnet, they can reach each other at their Tailscale IPs regardless of what network either is on.

Test SSH over Tailscale:

ssh root@100.x.y.z

If that works, you're ready to close the public port.

Close Port 22 on the Hetzner Firewall

Go back to the Hetzner Cloud Console and remove the inbound rule for port 22.

Hetzner firewall without port 22

That's it. SSH is now only accessible through the Tailscale network. The server's public IP still answers on 80 and 443, but port 22 doesn't exist from the internet's perspective.

SSH Config for Convenience

Add the server to your local SSH config so you don't have to remember the Tailscale IP:

# ~/.ssh/config
Host rendal
  HostName 100.x.y.z
  User deploy
  IdentityFile ~/.ssh/id_ed25519

Now ssh rendal just works, routing through Tailscale automatically.

Kamal and Tailscale

Kamal deploys over SSH, so it also needs to connect through Tailscale now that port 22 is closed publicly. Update config/deploy.yml to use the Tailscale IP:

servers:
  web:
    - 100.x.y.z

This means you can only deploy from a machine that's on your tailnet, which for a personal project is a non-issue — you're always deploying from your own laptop. If you needed CI/CD deployments, you'd run Tailscale on the CI runner too, or scope a separate firewall rule to the CI provider's IP range.

UFW: Belt and Suspenders

The Hetzner firewall operates at the network level, outside your server. I also set up UFW on the server itself as a second layer — defense in depth, not redundancy.

ufw default deny incoming
ufw default allow outgoing
ufw allow 80/tcp
ufw allow 443/tcp
ufw allow in on tailscale0 to any port 22
ufw enable

The key line is ufw allow in on tailscale0 to any port 22. This allows SSH only on the Tailscale network interface. Even if someone bypassed the Hetzner network firewall somehow, the server itself would reject SSH connections arriving on the public interface.

Check the rules are applied correctly:

ufw status verbose

What You Now Have

After all of this:

SSH is invisible to the public internet. Port 22 is closed at both the network level (Hetzner firewall) and the host level (UFW). The only way in is through Tailscale.
fail2ban is a safety net for the rare case you need to temporarily expose SSH publicly.
Web traffic flows normally on ports 80 and 443.
Deployments run over the Tailscale network from your laptop.

The bot traffic didn't stop — it just can't reach anything anymore. In practice, the auth log went from hundreds of daily failed attempts to nothing.

Each of these layers took about 10 minutes to set up. None of them are exotic. The value is in understanding how they compose: network-level firewall for the perimeter, key-only SSH to make password attacks irrelevant, fail2ban as a behavioral safety net, Tailscale to eliminate the public attack surface entirely, UFW as a host-level backstop. Remove any one layer and the others cover it. That's the point.