<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Patrick Rendal's Blog</title>
    <description>Thoughts on software, engineering, and experiments</description>
    <link>https://rendal.me/blog/</link>
    <atom:link href="https://rendal.me/blog/feed" rel="self" type="application/rss+xml"/>
    <language>en-us</language>
    <lastBuildDate>Tue, 16 Jun 2026 13:29:42 +0700</lastBuildDate>
    <item>
      <title>Hot-deploying Phoenix on Fly.io without restarting the world</title>
      <description>
        <![CDATA[<div data-controller="syntax-highlight" class="lexxy-content">
  <p>A while back I watched Chris McCord's demo where he pushes a change to a running Phoenix app in production and it just... takes effect. No restart, no reconnect, the LiveView session sitting right there keeps its state. I filed it under "neat, someday" and moved on.</p><p>Then I actually set it up on the app I run, and it's quietly become how I ship small changes. So here's the whole thing: install, config, daily use, and the couple of footguns that cost me an afternoon so they don't cost you one.</p><p>Everything below is checked against <a href="https://github.com/chrismccord/fly_deploy"><code>fly_deploy</code></a> <strong>0.4.2</strong> (the version I'm running). One heads-up worth its own mention: at the time of writing the library's README lags its own code in a few places — it still documents an older setup function and an old version pin. I'll flag those as I go, but in general trust the moduledoc / <code>@deprecated</code> annotations over the README, and double-check against whatever version you install.</p><h2>What this actually buys you</h2><p>A normal deploy on Fly builds a new image and swaps your machines out. The BEAM restarts, in-flight requests get cut, and every connected LiveView reconnects from zero. For most apps that's fine. For a LiveView-heavy one it's a visible little hiccup every single time you ship.</p><p>A hot deploy skips all of that. It builds a release, ships the compiled <code>.beam</code> files to the VM that's already running, and the machine loads the modules that changed in place — briefly suspending only the processes that use them (typically under a second). The OS process never dies. Open LiveViews keep their state. From the outside nothing happened, except your fix is live.</p><p>So it's a great fit for the boring 95%: a template tweak, some LiveView logic, a context function, a bug fix in a module that already exists. It is not for structural stuff, but I'll get to that.</p><h2>Getting it running</h2><p>You'll need an app already on Fly (with <code>fly deploy</code> working) and a bucket to hold the compiled releases. On Fly that's Tigris, and it's one command.</p><p>Add the dep:</p><pre data-language="elixir"># mix.exs<br>{:fly_deploy, "~&gt; 0.4"}</pre><p>(The README example still shows <code>~&gt; 0.1.0</code> — that's one of the stale spots; <code>0.4.x</code> is current.)</p><p>Provision the bucket and set the secrets:</p><pre data-language="bash"># Creates a Tigris bucket and sets AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,<br># BUCKET_NAME, AWS_ENDPOINT_URL_S3 and AWS_REGION as app secrets.<br>fly storage create<br><br># The orchestrator needs a token to list machines and trigger the upgrade RPC.<br># This one is easy to miss and the hot deploy fails without it.<br>fly secrets set FLY_API_TOKEN="$(fly tokens create machine-exec)"</pre><p>Then turn it on in <code>config/runtime.exs</code>. I gate it behind the bucket being present so it stays completely dormant in dev and anywhere the storage isn't wired up:</p><pre data-language="elixir">if System.get_env("BUCKET_NAME") do<br>  config :fly_deploy,<br>    bucket: System.get_env("BUCKET_NAME"),<br>    # Make sure this matches YOUR bucket's endpoint. The library defaults to<br>    # https://t3.storage.dev; my Tigris bucket lives at fly.storage.tigris.dev,<br>    # and leaving the default gave me 403s on the orchestrator's deploy lock.<br>    # `fly storage create` sets AWS_ENDPOINT_URL_S3 — pin it here so the two<br>    # never drift.<br>    aws_endpoint_url_s3: System.get_env("AWS_ENDPOINT_URL_S3")<br><br>  config :my_app, fly_deploy: true<br>end</pre><p>Last piece — start the poller under your supervision tree. Put it at the <strong>top</strong> of your children:</p><pre data-language="elixir"># lib/my_app/application.ex<br>defp fly_deploy_children do<br>  if Application.get_env(:my_app, :fly_deploy, false) do<br>    [{FlyDeploy, otp_app: :my_app}]<br>  else<br>    []<br>  end<br>end</pre><p>and prepend <code>fly_deploy_children()</code> to your children list. The "top" matters: the poller blocks in its <code>init/1</code> to apply any pending upgrade <em>before</em> the rest of the tree starts, so a machine that restarts (after a crash, scaling, or a deploy) loads the current code before any of your processes come up.</p><blockquote><p>The README tells you to call <code>FlyDeploy.startup_reapply_current/1</code> from <code>Application.start/2</code> instead. That function still exists but is <code><strong>@deprecated</strong></code> in 0.4.2 in favor of the <code>{FlyDeploy, otp_app: :my_app}</code> child spec above — a good example of trusting the code over the README.</p></blockquote><p>That's the install. Because it touches <code>mix.exs</code> and the supervision tree, you need one more regular <code>fly deploy</code> to get it onto your machines. After that, you're hot.</p><h2>Using it day to day</h2><p>From your laptop:</p><pre data-language="bash">mix fly_deploy.hot</pre><p>It builds a release, uploads the tarball, flips the "current" pointer in the bucket, and tells each running machine to load the new code. No restart. If you run more than one environment, point it at the right config:</p><pre data-language="bash">mix fly_deploy.hot --config fly.staging.toml<br>mix fly_deploy.hot --dry-run   # see what it'd do first</pre><p>And to see what's actually running where:</p><pre data-language="bash">mix fly_deploy.status</pre><p>I don't really trust a deploy until I've looked, so from a remote console I check:</p><pre data-language="elixir">FlyDeploy.get_current_hot_upgrade_info(:my_app)   # %{hot_upgrade_applied:, deployed_at:, ...}<br>FlyDeploy.current_vsn()                           # %{base_image_ref:, hot_ref:, version:, fingerprint:}</pre><p><code>deployed_at</code> should be a few seconds ago, and the image ref should be the one you expect. You can also subscribe to the lifecycle (<code>FlyDeploy.subscribe/0</code>, then handle <code>{:fly_deploy, :hot_upgrade_complete, meta}</code>) and forward that to Slack or a #deploys channel, which is nicer than remembering to check.</p><h2>When to reach for it, and when not</h2><p>The rule of thumb is simple: if your change can be expressed as "load a newer version of this module," hot-deploy it. If it can't, do a normal deploy.</p><p>Hot is great for templates and HEEx, LiveView and component code, context and business logic, and bug fixes in existing modules. Skip it and just <code>fly deploy</code> for anything that changes the shape of the system: new or removed deps, runtime config, the supervision tree (the library can't add/remove supervised children at runtime), background-job queues, database migrations, an Elixir or OTP bump, anything with NIFs/ports. And static assets, which I'll come back to.</p><p>When I'm not sure, I cold deploy. It's never the wrong answer, just the slower one.</p><h2>The stuff that actually bit me</h2><p>The setup is easy. These are the things I learned the annoying way.</p><p><strong>The big one: your normal deploys need unique image tags.</strong> This cost me most of an afternoon and a genuinely confused debugging session, so let me spell it out.</p><p>On boot, <code>fly_deploy</code> re-applies the "current" hot upgrade recorded in the bucket — that's the restart-resilience feature, and it's good. To decide whether a booting machine is the same generation (re-apply the upgrade) or a fresh cold deploy (forget it and run the new image), it compares the machine's image ref (<code>FLY_IMAGE_REF</code>) against the one stored in the bucket metadata. Match → apply the upgrade; mismatch → reset.</p><p>If all your cold deploys ship the same tag, the classic <code>:latest</code>, those refs always match. So a brand-new image looks exactly like a restart, and on boot the machine cheerfully re-applies an old hot tarball on top of your fresh deploy, silently reverting your code. The worst part is that your release version still reports the new build while the code actually running is old. I stared at "but I literally just deployed this" for far too long.</p><p>The fix is boring: tag every deploy image uniquely (the git SHA works fine) so each cold deploy has a distinct ref and <code>fly_deploy</code> can tell a new generation from a restart. If you only ever deploy with <code>fly deploy</code> from the CLI you're probably already getting a unique image. This mostly bites hand-rolled CI that pushes <code>:latest</code>.</p><p>Two smaller ones:</p><p><strong>Hot deploys ship </strong><code><strong>.beam</strong></code><strong> files, not your compiled assets.</strong> The tarball is built from your release's <code>.beam</code> files; your <code>priv/static</code> bundle (the fingerprinted CSS/JS, images) isn't in it, so changed asset <em>bytes</em> don't ride along. The library does ship a <code>FlyDeploy.Components.hot_reload_css</code> component that nudges connected clients to re-fetch the stylesheet when the asset manifest changes — handy for the CSS case — but for an actual asset rebuild a cold deploy is the reliable path. Markup in your HEEx updates fine.</p><p><strong>Your error tracker's release version goes stale.</strong> A hot deploy doesn't rebuild the image, so anything stamped in at build time, like a Sentry release or an OTel tag, keeps pointing at the last cold deploy. Errors after a hot deploy get filed under the old version. Not a big deal once you know, but it'll confuse you in the dashboard otherwise.</p><h2>Worth it?</h2><p>For me, easily. Shipping a small fix went from a restart-and-watch ritual to something instant and invisible, and not blowing away every connected LiveView on each deploy is genuinely nice. The setup is a dependency, a bucket, a couple of secrets, and one child at the top of your tree. Keep the hot/cold split in your head, give your deploys unique tags, and it mostly disappears into the background, which is exactly what you want from infrastructure.</p>
</div>
]]>
      </description>
      <link>https://rendal.me/blog/hot-deploying-phoenix-on-fly</link>
      <guid isPermaLink="true">https://rendal.me/blog/hot-deploying-phoenix-on-fly</guid>
      <pubDate>Tue, 16 Jun 2026 13:29:42 +0700</pubDate>
    </item>
    <item>
      <title>I stopped counting Chrome processes and started owning them</title>
      <description>
        <![CDATA[<div data-controller="syntax-highlight" class="lexxy-content">
  <p>In the <a href="https://rendal.me/blog/taming-zombie-chrome">last post</a> I spent an embarrassing amount of time getting one formula right: <code>(sessions × 5) + 8</code>, the number of Chrome processes my scraper would tolerate before deciding something had leaked and killing them. It worked. It ran in production for about four months. And it was the wrong thing to be doing.</p><p>The formula was a workaround for a single missing capability. ChromeDriver would not tell me which Chrome processes belonged to which session. So I counted them, I guessed, and when the guess said "too many" I killed all of them and hoped the live ones weren't in the pile. Every problem in that post, the swarm-counting, the shared-profile lock, the nuclear cleanup, grew out of that one blind spot.</p><p>At some point the obvious question finally landed: what if I just owned the browser?</p><h2>The thing ChromeDriver sits in the middle of</h2><p>When you drive Chrome through Wallaby, the stack is taller than it looks. Your code talks to Wallaby, Wallaby talks HTTP to ChromeDriver, ChromeDriver is a separate daemon that launches and supervises Chrome, and Chrome is the swarm of processes from last time. ChromeDriver owns the browser. You get a session id and some polite requests.</p><p>That middle layer is where my blind spot lived. ChromeDriver launched the Chrome processes, so ChromeDriver knew their PIDs. I didn't. And because every session pointed at the same <code>--user-data-dir</code>, a leftover process from one session could lock the profile and lock out the next one. I was downstream of a daemon that wasn't going to tell me anything useful.</p><p>But Chrome doesn't actually need ChromeDriver. ChromeDriver is a WebDriver-to-CDP translator, and CDP, the Chrome DevTools Protocol, is the thing I wanted to talk to anyway. It's the same protocol the devtools panel in your browser uses. You launch Chrome with <code>--remote-debugging-port</code>, it prints a WebSocket URL, and over that socket you can navigate, evaluate JavaScript, take screenshots, watch the network, all of it.</p><p>The important part, for me, was the launch. If <em>I</em> spawn Chrome, I hold the OS process. I connect to it over the WebSocket. When I'm done, I kill the process I started. No daemon in between deciding things on my behalf. No shared profile, because each launch gets its own throwaway one. The "which PID is mine" question stops being a question, because all of them are mine and I have a handle to the thing that owns them.</p><h2>Why I wrote a library instead of using one</h2><p>This is the part you should be most suspicious of, so I'll lead with the least flattering reason: I wanted to. On a side project, "wrap someone else's library" and "write the library" are not equally fun, and that's a real input even if it never makes it onto an architecture diagram.</p><p>But there was a gap underneath the fun, and it's the actual argument. The existing options split two ways. Wallaby is built for feature tests, and I'd already forked it once (last post). Playwright, Puppeteer, and <code>chrome-remote-interface</code> are mature and they're Node or JavaScript. A couple of older Elixir CDP experiments exist, but nothing I'd hang production scraping on.</p><p>Here's the part that decided it. The CDP transport was never the hard bit. Talking the protocol is JSON over a WebSocket with request IDs and an event stream: fiddly, but a solved, well-documented problem, and if a solid Elixir client had existed I'd have used it. The thing I actually needed sat one layer up: a browser as a <em>supervised resource</em>, a process that owns the OS Chrome and is guaranteed to reap it when that process or its owner dies. That layer (the <code>terminate/2</code> guarantee and the <code>with_page</code> wrapper below) was the real work, and I'd have had to build it on top of any of those options anyway. The transport is the easy 20% I'd have gotten for free. The lifecycle was the 80% that was the whole point.</p><p>So: part preference, part a real gap that the preference carried me through quickly. I'd make the same call again. It's called <a href="https://github.com/patrols/cdp_ex">cdp_ex</a>, it's on Hex, and it's CDP over <code>Mint.WebSocket</code> with no ChromeDriver and no Node anywhere in the picture.</p><h2>The core idea is one guarantee</h2><p>A cdp_ex browser is a GenServer that owns the Chrome OS process and its connections. Its <code>terminate/2</code> always runs <code>Chrome.stop/1</code>. That's the no-orphan guarantee, and almost everything else is built on it: if the browser process dies, for any reason, the OS process dies with it.</p><p>In practice you rarely touch the GenServer directly. The common shape is a throwaway browser per unit of work, which is exactly one function:</p><pre data-language="elixir">CDPEx.with_page([], fn page -&gt;<br>  {:ok, _} = CDPEx.Page.navigate(page, "https://example.com")<br>  CDPEx.Page.text(page, "h1")<br>end)</pre><p><code>with_page/3</code> launches Chrome, hands you a page, runs your function, and tears the whole thing down afterwards, even if your function raises.</p><p>The teardown is the part worth slowing down on, because it <em>is</em> the no-orphan claim, and it depends on two things being true at once. First, a browser that crashes mid-call must not take your process down with it. So <code>with_page</code> traps exits for the duration and turns a browser crash into an ordinary <code>{:error, _}</code> you can match on, instead of a link exit that kills your caller. Second, if <em>your</em> process is the one that dies, the browser still has to be reaped. So <code>with_page</code> keeps the link to the browser rather than downgrading to a monitor, which means the browser's own <code>terminate/2</code> fires and takes Chrome with it. Trap the exit so a crash can't propagate up, keep the link so a caller's death still cleans up. You get resource safety in both directions, and you get it without thinking about it.</p><p>My scraper's fetch path is essentially that, with a navigation and a wait wrapped around it:</p><pre data-language="elixir">defp run_in_page(url, opts) do<br>  CDPEx.with_page(<br>    CdpConfig.launch_opts(opts),<br>    fn page -&gt; fetch_page(page, url, opts) end,<br>    prevent_alerts: true<br>  )<br>end</pre><p>Chrome is launched, driven, and reaped per fetch. There is no pool to babysit, no session registry, no idea of a "leaked" browser, because a browser that isn't currently inside a <code>with_page</code> call doesn't exist.</p><h2>What this deletes</h2><p>Here's the part that made the whole detour worth it. Go back and look at last post's <code>ChromeManager</code>: the two timers, the <code>(sessions × 5) + 8</code> threshold, the zero-session special case, the snapshot-before-mutate dance, the nuclear cleanup that killed every Chrome because it couldn't tell mine from leaked. All of it.</p><p>Gone. The entire module, plus the Oban sweep, plus the <code>WallabyRestarter</code>, plus the ChromeDriver supervisor. Deleted.</p><p>Not "improved." There is nothing left to count, because there's no shared pool of ambiguous processes. There's no shared <code>--user-data-dir</code> to lock, because every fetch gets its own temp profile. The check I used to SSH in and run looks different now:</p><pre data-language="bash">$ pgrep -f chrome | wc -l<br>0</pre><p>Zero between fetches, and a small handful during one. Not because a sweep ran on a timer, but because there's nothing to sweep. The class of bug from the last post didn't get tuned down. It stopped being possible.</p><h2>How the cutover actually went</h2><p>I didn't rip and replace, and "I rewrote the browser layer and everything was great" would not be a true sentence. The bumps are the useful part.</p><p>cdp_ex ran alongside Wallaby behind a per-host environment toggle, so I could send one ticketing site through the new engine while everything else stayed on the old one, with Wallaby as the fallback the moment anything looked wrong. I cut over one host at a time and watched. Three things were worth the scars.</p><p><strong>Production is not my laptop.</strong> A single Chrome launch that took about a second locally took six or more on the cold Fly machine under load, and sometimes blew past cdp_ex's launch timeout entirely. The failure surfaced as <code>:debug_url_not_found</code>: Chrome was still coming up when I gave up waiting for its debug URL. The fix was unglamorous, a 45-second launch ceiling instead of the optimistic default, but the lesson is the usual one. Your timeouts are calibrated for the machine you wrote them on.</p><p><strong>The reaper from the last post tried to murder its own replacement.</strong> This is my favorite bug of the whole project. The new CDP browsers registered no Wallaby session, so as far as <code>ChromeManager</code> was concerned, <code>session_count</code> was 0. And <code>zombie_threshold(0)</code> is 0. So the instant a sweep fired, the nuclear cleanup looked at the brand-new CDP Chrome doing real work, decided that any Chrome with zero sessions was by definition a zombie, and SIGTERM'd it mid-fetch. The WebSocket dropped, the fetch failed with a connection-closed error, and it took me longer than I'd like to admit to realize the call was coming from inside the house.</p><p>The stopgap was to teach the old reaper to leave the new browsers alone. cdp_ex launches with its own temp profile, so it's identifiable by its <code>--user-data-dir</code>, and I added it to the exclude list with a comment that is basically this post in miniature:</p><pre data-language="elixir"> @excluded_process_patterns [<br>   "claude", "electron", "cursor.app", "vscode", "code helper",<br>   "chrome_crashpad_handler",<br>+  # cdp_ex Chrome reaps itself (CDPEx.with_page on teardown), so the sweep must<br>+  # skip it: a cdp fetch registers no session, so session_count is 0, so<br>+  # zombie_threshold/1 is 0, and the nuclear cleanup would SIGTERM every Chrome.<br>+  "cdp_ex"<br> ]</pre><p>Once every host was cut over, the exclude line went away with the rest of the module.</p><p><strong>Cloudflare was an anticlimax.</strong> The site I was most nervous about sits behind Cloudflare's bot checks, and I'd budgeted time for the usual cat-and-mouse: stealth plugins, fingerprint patching, the works. The first real CDP fetch rendered the page in full, no challenge. The durable lesson here isn't "CDP beats Cloudflare," because it doesn't, and Cloudflare's posture shifts month to month. It's that a lot of basic bot checks are really checks for whether you're a real browser running the page's JavaScript, and CDP drives exactly that: real Chrome, real page. For my sites that was enough on its own. For a hardened target it won't be, and cdp_ex is deliberately not a stealth toolkit. If you try this and hit a challenge wall, that's the expected outcome, not a regression.</p><h2>Where it landed</h2><p>The browser layer is roughly half the moving parts it used to be. The manager and everything around it went with the cutover, and the failure mode that started this whole thing, zombie Chrome eating a 2 GB box, can't recur, because nobody is accumulating browsers anymore.</p><p>cdp_ex is open source and on Hex. It does the things I needed: launch and a warm <code>Pool</code>, navigation with real readiness waits, JavaScript evaluation, screenshots and PDFs, network observation and request interception, HTTP and proxy auth (including auto-answering an authenticated proxy), and <code>:telemetry</code> so you can see what it's doing in production.</p><p>It is also young, and I'd rather tell you what it isn't. It's Chrome and Chromium only, because it speaks CDP. It is not a stealth or anti-detection framework, and I have no plans to make it one. If you need cross-browser support or a mature feature-test DSL, Wallaby and Playwright are still the right tools and I'm not trying to talk you out of them. But if you want to drive Chrome from Elixir as a supervised, self-reaping resource, with no ChromeDriver daemon and no Node runtime hanging off the side of your release, that's the entire reason it exists.</p><p>The lesson from the last post was "respect the process tree you live in." This was the logical end of that. Once the browser is just another process you own, all the counting and guessing and nuclear cleanup quietly stops being your problem. You delete it, and nothing misses it.</p><p><a href="https://github.com/patrols/cdp_ex">cdp_ex on GitHub</a> · <a href="https://hexdocs.pm/cdp_ex">docs</a></p>
</div>
]]>
      </description>
      <link>https://rendal.me/blog/owning-chrome-from-elixir</link>
      <guid isPermaLink="true">https://rendal.me/blog/owning-chrome-from-elixir</guid>
      <pubDate>Sun, 07 Jun 2026 11:28:30 +0700</pubDate>
    </item>
    <item>
      <title>Thirty-seven Chromes: taming zombie Chrome on the BEAM</title>
      <description>
        <![CDATA[<div data-controller="syntax-highlight" class="lexxy-content">
  <p>I run <a href="https://livein.city">livein.city</a>, a concert-listings site for Bangkok. A chunk of its plumbing is a scraper that drives headless Chrome to render JavaScript-heavy ticketing pages, after which an LLM pulls out the event data. The whole thing lives on a single 2 GB Fly.io machine.</p><p>For a while, scraping would just… stop. New concerts dried up. The logs were a wall of <code>invalid session id</code>. And when I SSH'd in:</p><pre data-language="bash">$ ps aux | grep -c chrome<br>37</pre><p>Thirty-seven Chrome processes (give or take the <code>grep</code> itself) on a box that should have had a handful. They'd eaten the memory, the kernel had started OOM-killing things, and every new browser session died on arrival.</p><p>There were a few root causes: shared-memory limits, Chrome's memory appetite, a startup race. But the most embarrassing one was self-inflicted. The code I'd written to <em>clean up</em> zombie Chrome processes was killing my live sessions.</p><p>This is the story of getting that code right, and of everything I had to do before I even earned the right to have that problem. So it goes in that order: first the toll I paid just to keep Chrome alive at all, then the bug I caused trying to keep it tidy.</p><h2>Why Wallaby, anyway?</h2><p>Honest answer: I don't really remember deciding.</p><p>Wallaby is <em>the</em> browser-automation library in Elixir, the one built for feature tests, the one every forum thread points at. I wasn't writing feature tests. I needed to load a ticketing page, let its JavaScript run, and read the rendered DOM. But Wallaby already knew how to talk to ChromeDriver, and I had zero interest in hand-rolling that part. So I reached for the thing that existed, bent a testing tool into a scraper, and got back to the actual problem.</p><p>That mismatch, a tool designed for short-lived test sessions doing long-running production scraping, is the seam everything else in this post tore along. I don't think it was the wrong call at the time. It just came with a bill, and this is me paying it.</p><h2>The one fork I couldn't avoid</h2><p>The ticketing sites I scrape are not fast. Some pages take the better part of a minute to settle: slow backends, a lot of client-side rendering, the occasional anti-bot interstitial.</p><p>Wallaby talks to ChromeDriver over HTTP, and somewhere along the way it moved to Erlang's built-in <code>:httpc</code>. The default timeouts there are tighter than the old HTTPoison ones it used to use, and on a slow render the request <em>to ChromeDriver</em> would give up before the page had finished loading. From the outside it looked like a scrape failure. Underneath it was an HTTP client timing out on its own driver.</p><p>I couldn't configure my way out of it, so I forked Wallaby. The change is eleven lines in one file:</p><pre data-language="elixir">  defp httpc_http_options(url) do<br>-    [<br>-      autoredirect: false,<br>-      ssl: ssl_options(url)<br>-    ]<br>+    user_opts = Application.get_env(:wallaby, :httpc_options, [])<br>+<br>+    Keyword.merge(<br>+      [<br>+        autoredirect: false,<br>+        timeout: 240_000,<br>+        connect_timeout: 30_000,<br>+        ssl: ssl_options(url)<br>+      ],<br>+      user_opts<br>+    )<br>  end</pre><p>240 seconds for the request, 30 for the connect, generous enough that a slow page is just slow rather than a failure. I made it overridable with <code>config :wallaby, :httpc_options</code> so I wasn't maintaining a hard fork for the sake of one knob, and so the change had a shot at being useful to someone else.</p><p>In <code>mix.exs</code> it's just a pinned git dep:</p><pre data-language="elixir">{:wallaby, github: "patrols/wallaby", ref: "ed04b6e"},</pre><p>Forking a dependency always feels heavier than it is. In practice it's a ref in a file and a note-to-self to watch upstream.</p><h2>Getting Chrome to survive on a 2 GB box</h2><p>Before any of the zombie drama, I had to get Chrome to <em>stay alive</em> in a container at all. Several separate fights, all of which ended up as flags or config. You can skim this part if you came for the concurrency bug. None of it is clever. It's just the toll.</p><p><strong>Shared memory and OOM.</strong> Chrome uses <code>/dev/shm</code> for a lot of its scratch space, containers hand it a tiny 64 MB default, Chrome blows past it, and the renderer crashes (which surfaces to you as, yes, <code>invalid session id</code>). On a heavy page the renderer will also happily try to eat the whole machine until the kernel ends the argument. The fix for both was a pile of flags: send shared memory to ordinary temp storage, cap V8's heap, limit renderers, shrink the caches to almost nothing.</p><pre data-language="bash">--disable-dev-shm-usage<br>--renderer-process-limit=1<br>--js-flags=--max-old-space-size=512<br>--disk-cache-size=1<br>--media-cache-size=1</pre><p>I deliberately skipped <code>--single-process</code> and disabling site isolation. Tempting on a small box, but they trade away Chrome's security boundaries, and I'd rather buy the memory back some other way.</p><p>A second batch of flags exists purely to stop Chrome spawning processes I don't need. This is also what later keeps the "how many Chromes is too many" math sane:</p><pre data-language="bash">--no-zygote<br>--disable-breakpad<br>--disable-component-update<br>--disable-background-networking<br>--disable-extensions</pre><p><strong>A writable HOME.</strong> This one cost me an afternoon. Chrome wants a writable home directory for <code>~/.pki/nssdb</code>, and the <code>nobody</code> user it runs as has <code>HOME=/nonexistent</code>, so it falls over before doing anything interesting. One line, in two places:</p><pre data-language="plain"># Dockerfile<br>ENV HOME=/tmp</pre><pre data-language="plain"># fly.toml<br>[env]<br>HOME = '/tmp'</pre><p><strong>The 4:00:00 PM race.</strong> Some scrapes run on Oban cron at exact times, and the ones scheduled on the dot, like <code>16:00:00</code>, would fail with <code>invalid session id</code>. The cause: my "is Wallaby up?" check only verified that <code>Wallaby.Driver.Supervisor</code> <em>existed</em>, not that ChromeDriver was actually initialized and ready. Start a session in that window and you lose before you've done anything wrong. I added an explicit readiness wait, and because Wallaby-as-a-long-running-service isn't really what Wallaby is for, I also wrapped it in a small supervisor that health-checks it every couple of minutes and restarts it if it wedges, capped at five restarts an hour so a broken state can't turn into a loop.</p><pre data-language="elixir"># Before: a supervisor existing is not the same as it being ready<br>_pid -&gt; :ok<br><br># After: actually wait for ChromeDriver to come up<br>_pid -&gt;<br>  case BrowserPool.wait_until_ready() do<br>    :ok -&gt; :ok<br>    error -&gt; error<br>  end</pre><p>And then the change that mattered more than all the flags combined. I'd started on a single shared CPU to save a few dollars:</p><pre data-language="plain"># fly.toml<br>[[vm]]<br>memory = '2gb'<br>cpus = 2</pre><p>Going from one CPU to two took scroll batches from about 72 seconds down to about 5.</p><p>I'd been blaming the sites.</p><h2>One "Chrome" is not one process</h2><p>Now the actual subject.</p><p>The first thing that breaks your intuition: a single headless Chrome isn't a process, it's a swarm. Main process, renderer, GPU process, a couple of utility processes, crashpad handler, zygote, network service. One browser legitimately spawns 8–12 OS processes. Fewer once you start cutting them with the flags above, but never one.</p><p>So the obvious zombie check, "if there are more than N Chrome processes, something leaked," has no fixed N. The healthy number scales with how many sessions are actually live. Set the threshold too high and zombies pile up until you OOM. Set it too low and you execute your own working browser mid-scrape.</p><p>And before you can threshold the count, you have to take it, which is its own small adventure, because <code>pgrep</code> matches on the command line. So the manager runs <code>pgrep</code> for candidates, then <code>ps</code> to read each full command line, then filters down to actual browser processes. (That filter is load-bearing. Hold that thought.)</p><pre data-language="elixir"># Simplified; the real one has the error handling you'd expect.<br># count_chrome_processes/0 is just length(chrome_browser_pids()).<br>defp chrome_browser_pids do<br>  {output, 0} = System.cmd("pgrep", ["-f", "chrome"])<br>  pids = String.split(output, "\n", trim: true)<br><br>  # pgrep only gives PIDs, so re-query ps for each full command line<br>  {ps_output, 0} =<br>    System.cmd("ps", ["-p", Enum.join(pids, ","), "-o", "pid=,command="])<br><br>  ps_output<br>  |&gt; String.split("\n", trim: true)<br>  |&gt; Enum.reject(&amp;should_exclude_process?/1)        # drop the impostors<br>  |&gt; Enum.map(fn line -&gt;                            # the PID is the first column<br>    line |&gt; String.trim() |&gt; String.split(~r/\s+/, parts: 2) |&gt; List.first()<br>  end)<br>end</pre><h2>Attempt #1: clean up after each session (this made it worse)</h2><p>My first instinct was the tidy one: when a scraping session ends, check for leftover Chrome and reap it. Clean up your own mess.</p><p>It was a disaster, and the reason is pure concurrency. The scraping queue runs jobs back-to-back, so the sequence was:</p><ol><li value="1">Session A finishes. A cleanup task spawns to check that A's processes are gone.</li><li value="2">Session B starts immediately, and its Chrome processes begin spawning.</li><li value="3">A's cleanup task looks at the process table and sees a pile of Chrome: A's still dying, plus B's just being born.</li><li value="4">"That's way too many." It kills all of them, including B, which was perfectly healthy.</li><li value="5">Session B: <code>invalid session id</code>.</li></ol><p>I eventually wrote the epitaph straight into the module doc, so I'd never be tempted again:</p><blockquote><p>Post-session verification was removed because it caused race conditions when scraping jobs ran back-to-back. The verification task would see overlapping Chrome processes (old ones still dying + new ones starting) and kill ALL processes, including the active session.</p></blockquote><p>The lesson that stuck: event-triggered cleanup races with whatever happens next. What I actually wanted was a periodic loop that reconciles reality against known state on its own clock, closer to how a Kubernetes controller thinks than to a <code>finally {}</code> block.</p><h2>Attempt #2: a threshold that scales with sessions</h2><p>So cleanup moved into a <code>ChromeManager</code> GenServer that tracks how many sessions are registered and runs two timers: a stale-session sweep every minute and a health check every ten. The one-minute sweep does the real work (it also ages out stale sessions); the ten-minute check is just a coarser backstop. That does mean a leak could sit for up to ten minutes before the slow timer notices, but the fast sweep almost always gets there first, and for a background scraper I could live with the worst case. Both timers ask the same question, <em>given how many sessions are live, are there too many Chrome processes</em>, and the answer is a formula.</p><p>The GenServer arms both timers at startup and re-arms each as it fires. A plain loop that never blocks:</p><pre data-language="elixir">def init(_opts) do<br>  schedule_cleanup()       # stale-session sweep, every 1 min<br>  schedule_health_check()  # zombie check, every 10 min<br>  {:ok, %__MODULE__{}}<br>end<br><br>def handle_info(:health_check, state) do<br>  perform_health_check(state)<br>  schedule_health_check()   # re-arm<br>  {:noreply, state}<br>end<br><br>defp perform_health_check(state) do<br>  chrome_count  = count_chrome_processes()<br>  session_count = map_size(state.sessions)<br><br>  if chrome_count &gt; zombie_threshold(session_count) do<br>    spawn_cleanup_task(state.sessions)<br>  end<br>end<br><br>defp schedule_health_check do<br>  Process.send_after(self(), :health_check, to_timeout(minute: 10))<br>end<br># schedule_cleanup/0 is the 1-minute twin (it also ages out stale sessions).</pre><p>That <code>zombie_threshold/1</code> is the whole ballgame, and I got it wrong before I got it right. My first version was <code>(sessions × 3) + 3</code>. Still too aggressive: a single healthy session can transiently sit at 10–12 processes, which sails past a budget of 6 and triggers a nuke. Back to <code>invalid session id</code>, just less often.</p><p>The version that finally held:</p><pre data-language="elixir"># Chrome legitimately spawns 8-12 processes per instance<br># (main, renderer, GPU, utility×2, crashpad, zygote, network service, …).<br># Previous 3x was too aggressive and killed active sessions.<br>@health_check_zombie_multiplier 5<br>@health_check_min_processes 8<br><br>def zombie_threshold(session_count) when session_count &gt; 0 do<br>  session_count * @health_check_zombie_multiplier + @health_check_min_processes<br>end</pre><p>Which works out to:</p><figure class="lexxy-content__table-wrapper"><table><tbody><tr><th class="lexxy-content__table-cell--header"><p>Active sessions</p></th><th class="lexxy-content__table-cell--header"><p>Cleanup fires above</p></th></tr><tr><td><p>1</p></td><td><p>13 processes</p></td></tr><tr><td><p>2</p></td><td><p>18 processes</p></td></tr><tr><td><p>3</p></td><td><p>23 processes</p></td></tr></tbody></table></figure><p>I should be straight about where those numbers come from: they aren't derived from the 8–12 figure. With the reduction flags from earlier, a browser actually sits closer to five processes in steady state, but it spikes higher during launch and teardown. So I tuned the 8 and the 5× to the worst transient I actually watched happen, not to a clean model of the swarm. With at least one session live, that gives a browser its full footprint plus generous headroom. It detects zombies a little more slowly, but it never guillotines live work. For a background scraper, "slightly slow" beats "destroys the thing it's protecting."</p><p>And it held. The <code>invalid session id</code> storms stopped, and I never had to touch the numbers again. (More on exactly how long "held" turned out to be at the end.)</p><h2>The zero-session case is the whole game</h2><p>There's one clause that matters more than the rest:</p><pre data-language="elixir">def zombie_threshold(0), do: 0</pre><p>When no sessions are registered, the tolerance is zero. Any lingering Chrome is, by definition, a zombie. Skip this and the math quietly betrays you: with zero sessions the threshold would still be 8, so up to eight orphaned processes can sit there forever, never crossing the line. Worse, they hold the lock on the shared <code>--user-data-dir</code>, so the <em>next</em> session can't even start. The accumulation is invisible right up until everything is wedged.</p><h2>The nuclear option, owned honestly</h2><p>Here's the ugly part I had to make peace with: Wallaby and ChromeDriver don't expose a session-to-PID mapping. I know I have two live sessions. I cannot tell you which of the 37 Chrome PIDs belong to them.</p><p>So when the threshold trips, cleanup kills <em>every</em> Chrome browser process, not just the leaked ones:</p><pre data-language="javascript"># DESIGN LIMITATION: We cannot identify which specific Chrome PIDs belong to<br># which sessions (Wallaby/ChromeDriver don't expose session-to-PID mappings).<br># Therefore this uses a "nuclear option" approach — if the threshold is<br># exceeded, ALL Chrome processes are killed, including potentially active ones.<br># zombie_threshold/1 is the trade-off between false-positive kills and leaving<br># zombies behind. Future enhancement: track PIDs per session.</pre><p>The generous threshold is what buys the right to be this blunt. By the time you're over budget, the odds you're looking at zombies rather than legitimate load are high. I'd rather have one honest, well-commented blunt instrument than a clever PID-tracking scheme that's subtly wrong. (Per-session PID tracking is the obvious better answer, but it's a real project, not a patch. More on that at the end.)</p><h2>The BEAM-specific bits</h2><p>This is where running Chrome <em>from Elixir</em> stops being incidental and starts mattering.</p><p><strong>Don't kill ChromeDriver.</strong> Cleanup reaps Chrome browser processes and deliberately leaves ChromeDriver alone. ChromeDriver is owned by a Wallaby Port, which is owned by a GenServer in my supervision tree. SIGTERM it and that GenServer crashes with <code>{:exit_status, 143}</code>, which trips the restarter into a full <code>Application.stop/start</code> cycle. I'd be nuking my own app to tidy up a browser. The processes that actually block new sessions are the Chrome ones squatting on the <code>--user-data-dir</code> lock, so those are all I touch.</p><p><strong>Don't block the GenServer.</strong> Killing means SIGTERM, a 3-second grace wait, then SIGKILL on the survivors:</p><pre data-language="elixir">Enum.each(pids, &amp;System.cmd("kill", ["-15", &amp;1]))   # SIGTERM all<br>Process.sleep(3000)                                  # one grace period<br><br>survivors =<br>  Enum.filter(pids, fn pid -&gt;<br>    match?({_, 0}, System.cmd("ps", ["-p", pid]))    # exit 0 = still running<br>  end)<br><br>Enum.each(survivors, &amp;System.cmd("kill", ["-9", &amp;1]))</pre><p>That <code>sleep(3000)</code> is poison inside a GenServer. Every <code>session_count</code> call would queue up behind it and time out. So the <code>ChromeManager</code> only ever <em>decides</em>. The actual killing happens in a supervised Task:</p><pre data-language="elixir">defp spawn_cleanup_task(tracked_sessions) do<br>  task_fn = fn -&gt;<br>    before = count_chrome_processes()<br>    killed = kill_orphaned_processes(tracked_sessions)<br>    if killed &gt; 0 do<br>      Logger.info("cleanup: #{before} -&gt; #{count_chrome_processes()} Chrome procs")<br>    end<br>  end<br><br>  # On a supervised Task so the 3s+ kill path never blocks the GenServer<br>  # (and a crash in the kill path can't take the manager down with it).<br>  Task.Supervisor.start_child(Pulse.TaskSupervisor, task_fn)<br>end</pre><p>The manager stays responsive; the slow, dirty syscalls happen off to the side. (<code>kill_orphaned_processes/1</code>, the threshold re-check plus the bulk kill above, is elided here for length, along with its force-cleanup twin.)</p><p><strong>Snapshot before you mutate.</strong> The manual <code>force_cleanup_all</code> grabs the PIDs to kill <em>before</em> clearing its session map. Otherwise a session that registers during the async cleanup would have its brand-new processes swept by a task that started life believing the world was empty. Attempt #1's race, in miniature:</p><pre data-language="elixir">def handle_call(:force_cleanup_all, _from, state) do<br>  # Snapshot PIDs *before* clearing sessions. If we cleared first and fetched<br>  # PIDs inside the async task, any session that registers mid-cleanup would<br>  # get its fresh Chrome processes killed.<br>  chrome_pids = chrome_browser_pids()<br>  spawn_force_cleanup_task(chrome_pids)<br>  {:reply, :ok, %{state | sessions: %{}}}<br>end</pre><p>Read the world, then forget it. In that order.</p><h2>The footgun that cost me a few editor windows</h2><p>One last thing, because it bit me on my own laptop. The naive way to find Chrome is <code>pkill -f chrome</code>. On a dev machine, <code>-f</code> matches the full command line, and a startling number of things have "chrome" or "electron" in theirs. VS Code. Cursor. Claude. "Code Helper." I closed my own editor more than once before adding an exclude list:</p><pre data-language="elixir">@excluded_process_patterns [<br>  "claude", "electron", "cursor.app",<br>  "vscode", "code helper", "chrome_crashpad_handler"<br>]</pre><p>That's the filter <code>chrome_browser_pids</code> reached for earlier. It's also where ChromeDriver gets spared. Both the editor apps and the driver I must-not-touch are excluded by one predicate:</p><pre data-language="elixir">def should_exclude_process?(command_line) do<br>  downcased = String.downcase(command_line)<br><br>  String.contains?(downcased, "chromedriver") ||<br>    Enum.any?(@excluded_process_patterns, &amp;String.contains?(downcased, &amp;1))<br>end</pre><p>It's a denylist, with the fragility that implies. If you're going to <code>pkill</code> by pattern, enumerate what you must never match first, and accept that you'll forget one. (There's also a separate periodic Oban sweep as a cruder backstop, on the theory that two dumb safety nets beat one clever one.)</p><h2>What I'd take to the next system</h2><ul><li value="1">When you can't attribute a resource precisely, budget generously, and police the empty case ruthlessly. <code>(sessions × 5) + 8</code> for the normal case, hard zero when nothing should be running.</li><li value="2">Prefer periodic reconciliation over event-triggered cleanup. Reacting to "a session ended" races with "a session started." A loop that reconciles against known state doesn't.</li><li value="3">Keep slow, blocking, failure-prone syscalls out of your GenServers. Decide in the process, act in a Task.</li><li value="4">Respect the process tree you live in. On the BEAM, the thing you're tempted to kill might be held by a Port that's held by a supervisor that's holding up your whole app.</li></ul><p>The constants are calibrated for my workload, one session at a time on a small box, so measure your own. But the shape of the answer (budget for the swarm, zero-tolerance the void, reconcile on a timer, never block the process keeping score) held up.</p><p>None of this was wasted, even though I later deleted every line of it. The formula shipped in February and ran in production for about four months without me touching it again. It was a patch, not a rewrite, and that's the point: it kept the scraper alive long enough that I could fix the actual problem calmly instead of at 3 AM.</p><p>Because the real problem was never the threshold. It was that ChromeDriver wouldn't tell me which PIDs were mine, so I was stuck counting processes and guessing. The actual fix was to stop guessing: drive Chrome over the DevTools Protocol directly, where each browser is something I launch, own, and reap myself, with no shared profile lock and nothing to sweep. That meant leaving Wallaby behind entirely, fork and all.</p><p>That's the next post.</p>
</div>
]]>
      </description>
      <link>https://rendal.me/blog/taming-zombie-chrome</link>
      <guid isPermaLink="true">https://rendal.me/blog/taming-zombie-chrome</guid>
      <pubDate>Sat, 06 Jun 2026 16:07:06 +0700</pubDate>
    </item>
    <item>
      <title>Running Paperclip on a Hetzner VPS: A Proper Setup Guide</title>
      <description>
        <![CDATA[<div data-controller="syntax-highlight" class="lexxy-content">
  <div class="lexxy-content">
  <p>This post covers how I set up <a href="https://paperclip.ing/">Paperclip</a>, an open-source AI agent orchestration platform, on a dedicated Hetzner VPS. It picks up where my <a href="https://www.rendal.me/blog/hetzner-vps-setup">previous Hetzner guide</a> left off, so if you haven't read that one, start there. By the end, you'll have Paperclip running persistently behind Caddy with HTTPS, locked down with basic auth, and accessible from anywhere via a subdomain.</p><h2>Why a Dedicated VPS</h2><p>You can run Paperclip locally, and for experimenting that's fine. But agents running autonomously on a schedule — waking up every 30 seconds, checking their inbox, executing tasks — don't belong on your laptop. A few reasons to give Paperclip its own server:</p><p><strong>Isolation.</strong> Agents can spike CPU and memory unpredictably. You don't want that affecting other services you're running.</p><p><strong>Persistence.</strong> Paperclip tracks agent sessions, conversation history, and task state. None of that survives a reboot or a closed terminal without a proper process manager.</p><p><strong>Always-on.</strong> The whole point of autonomous agents is that they work while you're not watching. That requires a server.</p><p>I put it on a separate VPS rather than sharing with my existing setup. A Hetzner CX22 (2 vCPU, 4 GB RAM, Singapore region for latency) costs around €6/month.</p><h2>Provisioning the Server</h2><p>The full provisioning and hardening steps — SSH lockdown, fail2ban, Tailscale, UFW — are covered in my <a href="https://www.rendal.me/blog/hetzner-vps-setup">Hetzner VPS setup guide</a>. Follow that first and come back here once you have a hardened server with a deploy user and Tailscale running.</p><p>A few things specific to this setup worth noting:</p><p><strong>Use a dedicated VPS, not a shared one.</strong> Paperclip agents can be resource-hungry. Keep it isolated from other services.</p><p><strong>Region:</strong> Pick whatever's geographically closest to you. I chose Singapore for lower latency from Bangkok.</p><p><strong>Size:</strong> Go with Regular Performance CX22 (2 vCPU, 4 GB RAM). Claude Code is memory-hungry — the 2 GB CX11 would be tight. Skip Cost-Optimized (older hardware) and Dedicated Resources (overkill for this use case).</p><p><strong>Create a deploy user before proceeding.</strong> Don't run Paperclip as root. The previous guide covers this — make sure you have a <code>deploy</code> user with sudo access and your SSH key copied across before continuing.</p><h2>Installing Node.js and Claude Code</h2><p>Switch to the deploy user and create a workspace:</p><pre data-language="bash">su - deploy<br>mkdir ~/paperclip<br>cd ~/paperclip</pre><p>Paperclip runs on Node.js. Install it from the NodeSource repository:</p><pre data-language="bash">curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -<br>sudo apt install -y nodejs</pre><p>Install Claude Code globally:</p><pre data-language="bash">sudo npm install -g @anthropic-ai/claude-code</pre><p>Authenticate Claude Code. This opens a browser OAuth flow. Complete it on your local machine:</p><pre data-language="bash">claude</pre><p>Follow the prompts to trust the directory and complete login.</p><h3>Configure MCP Servers (Optional)</h3><p>If you want your agents to have access to tools like Linear, or other services, add them now as the deploy user. The <code>--scope user</code> flag makes the config available to all Claude Code sessions running as this user, including Paperclip's agents:</p><pre data-language="javascript">claude mcp add linear --transport sse https://mcp.linear.app/sse \<br>  --header "Authorization: Bearer YOUR_LINEAR_API_KEY" \<br>  --scope user</pre><h2>Installing Paperclip</h2><p>Run the interactive setup:</p><pre data-language="bash">npx paperclipai onboard --yes</pre><p>This creates an embedded PostgreSQL database, generates secrets, and writes a config to <code>~/.paperclip/</code>. When it's done, you'll see a summary confirming everything passed.</p><p>Start the server once to verify it works:</p><pre data-language="bash">npm paperclipai run</pre><p>You should see the Paperclip ASCII banner and <code>Server listening on 127.0.0.1:3100</code>. Kill it with Ctrl+C once confirmed.</p><action-text-attachment sgid="eyJfcmFpbHMiOnsiZGF0YSI6ImdpZDovL3JlbmRhbC1tZS9BY3RpdmVTdG9yYWdlOjpCbG9iLzIxP2V4cGlyZXNfaW4iLCJwdXIiOiJhdHRhY2hhYmxlIn19--d462eca59ec5551d6a38f7a5a320fbf01f340b1c" content-type="image/png" url="https://www.rendal.me/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MjEsInB1ciI6ImJsb2JfaWQifX0=--d2213f9fd893e5185922b61c1a623c53efe757cc/image.png" filename="image.png" filesize="17931" presentation="gallery" caption="Paperclip starting up — the ASCII banner confirms the server is running"><figure class="attachment attachment--preview">
  <img src="https://www.rendal.me/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MjEsInB1ciI6ImJsb2JfaWQifX0=--d2213f9fd893e5185922b61c1a623c53efe757cc/image.png">
    <figcaption class="attachment__caption">
      Paperclip starting up — the ASCII banner confirms the server is running
    </figcaption>
</figure></action-text-attachment><h2>Setting Up Caddy</h2><p>Install Caddy:</p><pre data-language="bash">sudo apt install -y caddy</pre><p>Add a DNS A record for your subdomain pointing to the server's public IP. I used <code>paperclip.rendal.me</code>. Set it to DNS-only (not proxied) in Cloudflare.</p><p>Generate a hashed password for basic auth:</p><pre data-language="bash">caddy hash-password</pre><p>Enter your password when prompted and copy the hash it outputs. Edit the Caddyfile:</p><pre data-language="bash">sudo nano /etc/caddy/Caddyfile</pre><p>Replace the entire contents with:</p><pre data-language="javascript">paperclip.your-domain.com {<br>    basicauth {<br>        yourname HASHED_PASSWORD_HERE<br>    }<br>    reverse_proxy 127.0.0.1:3100<br>}</pre><p>Restart Caddy:</p><pre data-language="bash">sudo systemctl restart caddy</pre><p>Caddy will automatically provision a TLS certificate.</p><h2>Running Paperclip as a Service</h2><p>Right now Paperclip dies when you close the SSH session. Fix that with systemd:</p><pre data-language="bash">sudo nano /etc/systemd/system/paperclip.service</pre><pre data-language="javascript">[Unit]<br>Description=Paperclip AI Orchestration<br>After=network.target<br><br>[Service]<br>User=deploy<br>WorkingDirectory=/home/deploy/paperclip<br>ExecStart=/usr/bin/npx paperclipai run<br>Restart=always<br>RestartSec=10<br>Environment=HOME=/home/deploy<br><br>[Install]<br>WantedBy=multi-user.target</pre><p>Enable and start it:</p><pre data-language="bash">sudo systemctl daemon-reload<br>sudo systemctl enable paperclip<br>sudo systemctl start paperclip</pre><p>Check it's running:</p><pre data-language="bash">sudo systemctl status paperclip</pre><p>Paperclip now starts automatically on boot and restarts if it crashes.</p><p>Visit <code>https://paperclip.your-domain.com</code> – You should see a login prompt, and after authenticating, the Paperclip onboarding screen.</p><action-text-attachment sgid="eyJfcmFpbHMiOnsiZGF0YSI6ImdpZDovL3JlbmRhbC1tZS9BY3RpdmVTdG9yYWdlOjpCbG9iLzE0P2V4cGlyZXNfaW4iLCJwdXIiOiJhdHRhY2hhYmxlIn19--9c953f4cde83a4f3fe7670e2b56d1d4fd894ab48" content-type="image/png" url="https://www.rendal.me/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTQsInB1ciI6ImJsb2JfaWQifX0=--5e34cf41ab7032704a7554db682053f2d1d0b6eb/image.png" filename="image.png" filesize="251459" presentation="gallery" caption="The Paperclip dashboard after a successful setup"><figure class="attachment attachment--preview">
  <img src="https://www.rendal.me/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTQsInB1ciI6ImJsb2JfaWQifX0=--5e34cf41ab7032704a7554db682053f2d1d0b6eb/image.png">
    <figcaption class="attachment__caption">
      The Paperclip dashboard after a successful setup
    </figcaption>
</figure></action-text-attachment><h2>What You Now Have</h2><p>After all of this:</p><ul><li><strong>Paperclip</strong> is running persistently as a systemd service under the deploy user</li><li><strong>HTTPS</strong> is handled by Caddy with automatic certificate renewal</li><li><strong>Basic auth</strong> protects the UI from the public internet</li><li><strong>SSH</strong> is invisible to the internet — accessible only through Tailscale</li><li><strong>Claude Code</strong> is authenticated and optionally connected to MCP servers</li><li><strong>Agents</strong> can run autonomously 24/7 without your laptop being open</li></ul><p>The next step is configuring your LLM provider in the Paperclip UI, creating your first company, and giving your agents something real to do. That's a separate post.</p>
</div>
</div>
]]>
      </description>
      <link>https://rendal.me/blog/running-paperclip-on-a-vps</link>
      <guid isPermaLink="true">https://rendal.me/blog/running-paperclip-on-a-vps</guid>
      <pubDate>Mon, 30 Mar 2026 19:35:38 +0700</pubDate>
    </item>
    <item>
      <title>Setting Up a Hetzner VPS: Provisioning, Firewalls, and Locking It Down</title>
      <description>
        <![CDATA[<div data-controller="syntax-highlight" class="lexxy-content">
  <div class="lexxy-content">
  <p>This post covers how I set up and hardened the VPS that runs rendal.me. By the end, you'll have a provisioned Hetzner server with SSH locked down through Tailscale, fail2ban running as a safety net, and a clear mental model of why each layer exists.</p><h2>Why Hetzner</h2><p>I wanted a simple VPS for a personal Rails site — nothing managed, nothing abstracted away, just a server I control. Hetzner is the obvious choice if you're in Europe or fine with European data centers: the price-to-hardware ratio is genuinely hard to beat. A CX22 (2 vCPUs, 4 GB RAM, 40 GB NVMe) costs around €4/month. The equivalent on DigitalOcean or AWS is two to three times that.</p><p>The Cloud Console is clean and fast, the network is reliable, and they have a straightforward firewall product at the network level. No complaints after running it for months.</p><h2>Provisioning the Server</h2><p>Log into the <a href="https://console.hetzner.cloud/">Hetzner Cloud Console</a> and create a new project. Projects are just organizational containers — you can put all your servers, firewalls, and SSH keys for a given thing in one place.</p><p>Inside the project, click <strong>Add Server</strong>. The choices that matter:</p><p><strong>Location:</strong> Pick whatever's geographically closest to your users. For a personal site with no strong preference, Helsinki or Falkenstein are both fine.</p><p><strong>Image:</strong> Ubuntu 24.04. Stable, well-documented, <code>apt</code> just works, and most tutorials you'll find for server tooling assume Debian/Ubuntu.</p><p><strong>Type:</strong> For a personal Rails app, the CX22 is plenty. You can always resize later if you need to.</p><p><strong>SSH keys:</strong> This is important to get right upfront. Add your public key here before the server is created. Hetzner will inject it into the <code>authorized_keys</code> for the root user, so your first login is already key-authenticated. If you don't have an Ed25519 key pair yet:</p><pre data-language="bash">ssh-keygen -t ed25519 -C "your@email.com"</pre><p>Paste the contents of <code>~/.ssh/id_ed25519.pub</code> into the Hetzner SSH key field.</p><p><strong>Networking:</strong> Leave the public IPv4 enabled for now. You'll tighten firewall rules shortly.</p><p>Create the server. Hetzner provisions it in about 30 seconds.</p><h2>The Hetzner Cloud Firewall</h2><p>Before you SSH in and touch anything, set up a firewall at the network level. This is separate from any firewall running on the server itself — it operates in Hetzner's infrastructure and blocks traffic before it even reaches your instance.</p><p>Go to <strong>Firewalls</strong> in the left sidebar and create a new one. The default inbound rules allow everything. Replace them with:</p><figure class="lexxy-content__table-wrapper"><table><tbody><tr><th class="lexxy-content__table-cell--header"><p>Direction</p></th><th class="lexxy-content__table-cell--header"><p>Protocol</p></th><th class="lexxy-content__table-cell--header"><p>Port</p></th><th class="lexxy-content__table-cell--header"><p>Source</p></th></tr><tr><td><p>Inbound</p></td><td><p>TCP</p></td><td><p>22</p></td><td><p>Any</p></td></tr><tr><td><p>Inbound</p></td><td><p>TCP</p></td><td><p>80</p></td><td><p>Any</p></td></tr><tr><td><p>Inbound</p></td><td><p>TCP</p></td><td><p>443</p></td><td><p>Any</p></td></tr></tbody></table></figure><p>Port 22 stays open for now — you need SSH access to configure everything else. You'll close it later once Tailscale is running. Outbound rules can stay as "allow all."</p><p>Apply the firewall to your server under the <strong>Resources</strong> tab.</p><action-text-attachment sgid="eyJfcmFpbHMiOnsiZGF0YSI6ImdpZDovL3JlbmRhbC1tZS9BY3RpdmVTdG9yYWdlOjpCbG9iLzQ_ZXhwaXJlc19pbiIsInB1ciI6ImF0dGFjaGFibGUifX0=--36503e0331139767699f89889f936e406886b43e" content-type="image/png" url="https://www.rendal.me/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6NCwicHVyIjoiYmxvYl9pZCJ9fQ==--4e328f7f4afef2dbe62d6946122484566622d0e5/image.png" filename="image.png" filesize="88373" presentation="gallery" caption="Hetzner firewall with port 22"><figure class="attachment attachment--preview">
  <img src="https://www.rendal.me/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6NCwicHVyIjoiYmxvYl9pZCJ9fQ==--4e328f7f4afef2dbe62d6946122484566622d0e5/image.png">
    <figcaption class="attachment__caption">
      Hetzner firewall with port 22
    </figcaption>
</figure></action-text-attachment><p>The principle here is default-deny at the network perimeter. Nothing reaches your server except the specific ports you've explicitly allowed. Every other port is just gone from the internet's perspective — not rejected, not filtered, simply unreachable.</p><h2>First Login and Baseline Hardening</h2><p>SSH in as root using the server's public IP:</p><pre data-language="bash">ssh root@&lt;your-server-ip&gt;</pre><p>First, update everything:</p><pre data-language="bash">apt update &amp;&amp; apt upgrade -y</pre><h3>Lock Down SSH</h3><p>Edit <code>/etc/ssh/sshd_config</code>:</p><pre data-language="bash">PasswordAuthentication no<br>PubkeyAuthentication yes<br>PermitRootLogin prohibit-password</pre><p>Then restart:</p><pre data-language="bash">systemctl restart ssh</pre><p>Key-only authentication means brute-force password attacks are pointless. Bots don't know that though — they'll keep hammering port 22 regardless. You'll see this if you check the auth log after even a few hours on a public IP:</p><pre data-language="bash">journalctl -u ssh --since "24 hours ago" | grep "Failed password" | wc -l</pre><p>The number is always higher than you expect. Which brings us to the next layer.</p><h3>Create a Deploy User</h3><p>Running everything as root works, but it's unnecessarily risky. Create a dedicated user for day-to-day operations and deployments:</p><pre data-language="bash">useradd -m -s /bin/bash deploy<br>usermod -aG sudo deploy<br>mkdir -p /home/deploy/.ssh<br>cp ~/.ssh/authorized_keys /home/deploy/.ssh/<br>chown -R deploy:deploy /home/deploy/.ssh<br>chmod 700 /home/deploy/.ssh<br>chmod 600 /home/deploy/.ssh/authorized_keys<br>echo "deploy ALL=(ALL) NOPASSWD:ALL" &gt; /etc/sudoers.d/deploy</pre><p>Open a new terminal and verify SSH works as the deploy user before continuing:</p><pre data-language="bash">ssh deploy@&lt;tailscale-ip&gt;</pre><p>Once the deploy user is confirmed working, you can tighten further by setting <code>PermitRootLogin no</code> in <code>sshd_config</code> to disable root SSH entirely. For most setups, <code>prohibit-password</code> is sufficient.</p><h3>fail2ban</h3><p>fail2ban watches your log files for repeated failed authentication attempts and temporarily bans offending IPs using firewall rules. It's not your primary defense — key-only SSH is — but it's a useful safety net for the cases where you temporarily slip (debugging with password auth enabled, a misconfigured <code>sshd_config</code>, etc.).</p><p>Install it:</p><pre data-language="bash">apt install fail2ban -y</pre><p>Create a local config (the default <code>jail.conf</code> gets overwritten on updates — always work in <code>jail.local</code>):</p><pre data-language="bash">cp /etc/fail2ban/jail.conf /etc/fail2ban/jail.local</pre><p>The key settings in <code>/etc/fail2ban/jail.local</code>:</p><pre data-language="bash">[DEFAULT]<br>bantime = 1h<br>findtime = 10m<br>maxretry = 3<br><br>[sshd]<br>enabled = true<br>port = ssh<br>logpath = %(sshd_log)s<br>backend = %(sshd_backend)s<br>maxretry = 3</pre><p>Three failed attempts within 10 minutes gets an IP banned for an hour. You can be more aggressive — some people set <code>bantime</code> to 24 hours or use <code>bantime = -1</code> for permanent bans. I find an hour is enough to make automated attacks non-viable without the operational overhead of managing permanent bans.</p><p>Start and enable it:</p><pre data-language="bash">systemctl enable fail2ban<br>systemctl start fail2ban</pre><p>Check what it's doing:</p><pre data-language="bash">fail2ban-client status sshd</pre><p>Within a few hours, you'll see IPs showing up in the banned list. It's grimly satisfying.</p><h2>Tailscale</h2><p>This is where the setup gets genuinely interesting. Tailscale creates a private WireGuard-based mesh network between your devices. Your server gets a stable private IP in the <code>100.x.y.z</code> range that's only reachable from your other Tailscale-connected machines.</p><p>The goal: stop SSH-ing to the server's public IP entirely. Once Tailscale is running, you SSH to the private Tailscale IP, then close port 22 on the public firewall completely. The server stays reachable for web traffic on 80 and 443, but SSH is invisible to the internet. Bots can't brute-force a port that doesn't exist as far as they're concerned.</p><h3>Install Tailscale on the Server</h3><pre data-language="bash">curl -fsSL https://tailscale.com/install.sh | sh<br>tailscale up</pre><p>This gives you a URL to authenticate the server with your Tailscale account. Once authenticated, it joins your tailnet and gets a private IP.</p><p>Check it:</p><pre data-language="bash">tailscale ip -4</pre><p>Note that IP — you'll use it for everything going forward.</p><h3>Install Tailscale on Your Machine</h3><p>Install the Tailscale client on your laptop too. It's available for macOS, Linux, Windows, iOS, and Android. Once both devices are authenticated to the same tailnet, they can reach each other at their Tailscale IPs regardless of what network either is on.</p><p>Test SSH over Tailscale:</p><pre data-language="javascript">ssh root@100.x.y.z</pre><p>If that works, you're ready to close the public port.</p><h3>Close Port 22 on the Hetzner Firewall</h3><p>Go back to the Hetzner Cloud Console and remove the inbound rule for port 22.</p><action-text-attachment sgid="eyJfcmFpbHMiOnsiZGF0YSI6ImdpZDovL3JlbmRhbC1tZS9BY3RpdmVTdG9yYWdlOjpCbG9iLzY_ZXhwaXJlc19pbiIsInB1ciI6ImF0dGFjaGFibGUifX0=--65316edb02953c2d9338f00685580764e6646546" content-type="image/png" url="https://www.rendal.me/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6NiwicHVyIjoiYmxvYl9pZCJ9fQ==--e3a175d1a05c0816942c8a58416aba16868e4c56/image.png" filename="image.png" filesize="75352" presentation="gallery" caption="Hetzner firewall without port 22"><figure class="attachment attachment--preview">
  <img src="https://www.rendal.me/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6NiwicHVyIjoiYmxvYl9pZCJ9fQ==--e3a175d1a05c0816942c8a58416aba16868e4c56/image.png">
    <figcaption class="attachment__caption">
      Hetzner firewall without port 22
    </figcaption>
</figure></action-text-attachment><p>That's it. SSH is now only accessible through the Tailscale network. The server's public IP still answers on 80 and 443, but port 22 doesn't exist from the internet's perspective.</p><h3>SSH Config for Convenience</h3><p>Add the server to your local SSH config so you don't have to remember the Tailscale IP:</p><pre data-language="ruby"># ~/.ssh/config<br>Host rendal<br>  HostName 100.x.y.z<br>  User deploy<br>  IdentityFile ~/.ssh/id_ed25519</pre><p>Now <code>ssh rendal</code> just works, routing through Tailscale automatically.</p><h3>Kamal and Tailscale</h3><p>Kamal deploys over SSH, so it also needs to connect through Tailscale now that port 22 is closed publicly. Update <code>config/deploy.yml</code> to use the Tailscale IP:</p><pre data-language="bash">servers:<br>  web:<br>    - 100.x.y.z</pre><p>This means you can only deploy from a machine that's on your tailnet, which for a personal project is a non-issue — you're always deploying from your own laptop. If you needed CI/CD deployments, you'd run Tailscale on the CI runner too, or scope a separate firewall rule to the CI provider's IP range.</p><h2>UFW: Belt and Suspenders</h2><p>The Hetzner firewall operates at the network level, outside your server. I also set up UFW on the server itself as a second layer — defense in depth, not redundancy.</p><pre data-language="bash">ufw default deny incoming<br>ufw default allow outgoing<br>ufw allow 80/tcp<br>ufw allow 443/tcp<br>ufw allow in on tailscale0 to any port 22<br>ufw enable</pre><p>The key line is <code>ufw allow in on tailscale0 to any port 22</code>. This allows SSH only on the Tailscale network interface. Even if someone bypassed the Hetzner network firewall somehow, the server itself would reject SSH connections arriving on the public interface.</p><p>Check the rules are applied correctly:</p><pre data-language="bash">ufw status verbose</pre><h2>What You Now Have</h2><p>After all of this:</p><ul><li><strong>SSH</strong> is invisible to the public internet. Port 22 is closed at both the network level (Hetzner firewall) and the host level (UFW). The only way in is through Tailscale.</li><li><strong>fail2ban</strong> is a safety net for the rare case you need to temporarily expose SSH publicly.</li><li><strong>Web traffic</strong> flows normally on ports 80 and 443.</li><li><strong>Deployments</strong> run over the Tailscale network from your laptop.</li></ul><p>The bot traffic didn't stop — it just can't reach anything anymore. In practice, the auth log went from hundreds of daily failed attempts to nothing.</p><p>Each of these layers took about 10 minutes to set up. None of them are exotic. The value is in understanding how they compose: network-level firewall for the perimeter, key-only SSH to make password attacks irrelevant, fail2ban as a behavioral safety net, Tailscale to eliminate the public attack surface entirely, UFW as a host-level backstop. Remove any one layer and the others cover it. That's the point.</p>
</div>
</div>
]]>
      </description>
      <link>https://rendal.me/blog/hetzner-vps-setup</link>
      <guid isPermaLink="true">https://rendal.me/blog/hetzner-vps-setup</guid>
      <pubDate>Mon, 23 Mar 2026 18:53:58 +0700</pubDate>
    </item>
  </channel>
</rss>
