InvestorCheck: automating SEC accredited-investor verification

May 18, 2026

If you sell securities in a private offering, the SEC often makes you verify that your investors are accredited -- wealthy or credentialed enough to fend for themselves. In practice that means a lawyer or CPA reviews someone's tax returns and brokerage statements and signs off. It's slow and manual, and it's mostly rules plus document reading, which is the kind of task software can do. So I built InvestorCheck.

The qualification rules. You qualify as accredited by net worth (over $1M, excluding your primary residence), by income (over $200k individually, $300k with a spouse, in each of the last two years and expected again this year), or by holding a Series 7, 65, or 82 license. Each path needs different evidence, and people present it inconsistently. The product has to handle all of them and end on a clean yes or no.

Reading the documents. A user uploads a tax return or a brokerage statement, often as a phone photo of a printout. I send it to OpenAI's vision models and get back structured JSON: the numbers that matter, typed and checkable. Sometimes the response comes back malformed or hedged, so every analysis is typed as either a clean success or an explicit failure that carries the raw response. The pipeline doesn't assume the model succeeded. It checks, and routes the failures to a human instead of guessing.

The model extracts; code decides. This is the rule the whole system is built on: the LLM never does arithmetic and never makes the call. An LLM that adds up income is non-deterministic and unauditable, which is exactly wrong for a compliance decision. So the model's only job is to pull the numbers off the page. Once it has, the decision is a pure function over those numbers, comparing them to the SEC thresholds the same way every time:

const INDIVIDUAL = 200_000   // SEC income thresholds, per year
const JOINT = 300_000

// The model already extracted the numbers; this only compares them.
function determineIncomeStatus(user, spouse, spouseDone) {
  // Nothing parsed -> never guess, send to a human.
  if (user.y1 == null || user.y2 == null) return 'review'

  // Must clear the bar in BOTH of the last two years.
  const meetsIndividual =
    user.y1 >= INDIVIDUAL && user.y2 >= INDIVIDUAL

  if (spouseDone && spouse.y1 != null) {
    const meetsJoint =
      user.y1 + spouse.y1 >= JOINT &&
      user.y2 + spouse.y2 >= JOINT
    return meetsIndividual || meetsJoint ? 'approved' : 'rejected'
  }
  if (spouseDone) return meetsIndividual ? 'approved' : 'rejected'
  return meetsIndividual ? 'approved' : 'review'
}

Same inputs, same answer, forever. And because the raw model output and the raw Plaid responses are stored alongside the parsed numbers, every decision can be reconstructed from source: this number came from that line of that document, and the thresholds were applied like so. That audit trail is the difference between a demo and something a compliance officer will actually stand behind.

The hard part is the state machine. That little function hides the real difficulty, which is that the income path isn't one flow, it's two parallel tracks that have to branch apart and then reconverge. An applicant might clear $200k alone, in which case you approve immediately and never need the spouse. Or they fall short, in which case you can't reject yet: you drop to review, let them add a spouse, and only then check the combined income against $300k -- unless they skip the spouse, at which point you fall back to the individual bar and decide. Each track carries its own status and its own Plaid tokens, the tokens survive a cancel-and-retry, and the overall result can't go final until the right combination of tracks is done. Most of the engineering went into getting those invariants exactly right, because in compliance "approved a little too early" is a real liability.

Identity matching. A tax return proves someone makes $300k. You still have to prove it's this someone. So there's a name-matching layer that parses "JOHN B. DOE" and "John Doe Jr." and "Dr. John Doe" into comparable parts, strips titles and suffixes, and decides whether the person in front of you matches the person on the document. Compliance depends on these edge cases, and there are a lot of them.

The infrastructure is all code. Net-worth checks pull live balances through Plaid. Documents land in S3. The heavy work -- reading a return, processing an accountant's letter, scoring a net-worth packet -- runs as async jobs off a queue, so a slow OpenAI call never blocks the person filling out the form. The whole stack is defined as infrastructure-as-code with SST: a shared VPC with a bastion and NAT, Aurora Serverless v2 that scales down to almost nothing when idle, S3, SES for email, and the Next.js app itself, split across separate sandbox and production stages that are protected from accidental deletion. Auth is Clerk; the schema is Postgres via Drizzle. The entire environment can be stood up, or torn down, from one command.

This one isn't a demo. It's a real regulatory process, the kind people pay lawyers by the hour to perform, encoded carefully enough to make the same call every time, auditable enough to defend it, and honest enough to hand back the cases where it shouldn't.

← Writing