Skip to main content

Command Palette

Search for a command to run...

Building a Message Parser from Scratch: How WhatsApp Knows This is Bold

Updated
29 min read

How Rocket.Chat's GSoC project taught me to replace a parser generator with 400 lines of handwritten TypeScript — and why that matters.


You've done it a thousand times. You open WhatsApp, type *Hello*, hit send — and the word appears bold on the other side. No asterisks. No asterisks anywhere.

Have you ever stopped to wonder: what actually happened between "you pressed send" and "they saw bold text"?

That question is exactly what my Google Summer of Code project at Rocket.Chat forced me to answer — in TypeScript, at production scale. This post is everything I learned: the concepts, the code, and the architecture behind every chat app that renders rich text.


The magic trick, demystified

WhatsApp, Rocket.Chat, Slack, Telegram — they all play the same trick. When you type *Hello*, the app doesn't store <b>Hello</b>. It doesn't even store *Hello*. It stores a tree — a structured object that describes what your message means:

{
  "type": "Root",
  "children": [
    {
      "type": "Bold",
      "children": [
        { "type": "Plain", "value": "Hello" }
      ]
    }
  ]
}

The renderer then walks that tree and produces HTML. The * markers never reach the renderer — they're dissolved during parsing.

That tree is called an Abstract Syntax Tree, or AST. Building it is a four-stage pipeline:

Raw text → Lexer → Parser → AST → Renderer → HTML

Let's walk through each stage carefully, with real code.


The pipeline, stage by stage

Stage 1 — Grammar: writing down the rules

Before you write a single line of code, you need to define your grammar — the rules of your markup language. Think of it as a recipe:

bold   → "*" inline_content "*"
italic → "_" inline_content "_"
strike → "~" inline_content "~"

inline_content → (bold | italic | strike | plain_text)+
plain_text     → any character that is not a marker

These rules say: bold text is an asterisk, then some content, then a closing asterisk. Italic uses underscores. Strike uses tildes. Each rule will become a function in your parser.

Different organizations write different grammars. WhatsApp uses *bold*. Discord uses **bold**. Rocket.Chat has its own variant. The grammar is the contract — everything downstream must obey it.


Stage 2 — The Lexer (Scanner / Tokenizer)

The lexer is the first pass over your input. Its job is simple and deliberately limited: group characters into labeled chunks called tokens. It does not understand grammar. It does not understand meaning. It just labels.

Given the input *Hello* _world_, the lexer produces:

STAR
TEXT "Hello"
STAR
TEXT " "
UNDERSCORE
TEXT "world"
UNDERSCORE
EOF

Each * became a STAR token. Each _ became an UNDERSCORE token. The words became TEXT tokens. The lexer is completely blind to the fact that a STAR followed by a TEXT followed by another STAR means "bold" — that understanding belongs to the next stage.

Here's the full lexer in TypeScript:

type TokenType = 'STAR' | 'UNDERSCORE' | 'TILDE' | 'TEXT' | 'EOF';

interface Token {
  type: TokenType;
  value: string;
}

function tokenize(input: string): Token[] {
  const tokens: Token[] = [];
  let i = 0;
  let buf = '';

  const flush = () => {
    if (buf) {
      tokens.push({ type: 'TEXT', value: buf });
      buf = '';
    }
  };

  while (i < input.length) {
    const ch = input[i];

    if (ch === '*') {
      flush();
      tokens.push({ type: 'STAR', value: '*' });
      i++;
    } else if (ch === '_') {
      flush();
      tokens.push({ type: 'UNDERSCORE', value: '_' });
      i++;
    } else if (ch === '~') {
      flush();
      tokens.push({ type: 'TILDE', value: '~' });
      i++;
    } else {
      buf += ch;
      i++;
    }
  }

  flush();
  tokens.push({ type: 'EOF', value: '' });
  return tokens;
}

Notice the buf pattern: we accumulate non-marker characters into a buffer and flush() it into a TEXT token whenever we hit a marker. This is the most natural way to lex text-heavy inputs.


Stage 3 — The Parser: Recursive Descent

This is where the real magic happens. The parser reads the token stream and builds the AST according to the grammar rules.

We use a technique called recursive descent parsing: each grammar rule becomes a function, and those functions call each other — hence "recursive". It's the cleanest parsing strategy for hand-written parsers, and it's what PeggyJS (the generator Rocket.Chat uses today) internally compiles to.

First, define your AST node types:

type ASTNode =
  | { type: 'Root';   children: ASTNode[] }
  | { type: 'Bold';   children: ASTNode[] }
  | { type: 'Italic'; children: ASTNode[] }
  | { type: 'Strike'; children: ASTNode[] }
  | { type: 'Plain';  value: string       };

Now the parser:

function parse(input: string): ASTNode {
  const tokens = tokenize(input);
  let pos = 0;

  const peek = (): Token => tokens[pos];
  const eat  = (): Token => tokens[pos++];

  // parseInline processes a sequence of nodes, stopping when
  // it sees the EOF or a specific closing marker token.
  function parseInline(stopAt?: TokenType): ASTNode[] {
    const nodes: ASTNode[] = [];

    while (peek().type !== 'EOF' && peek().type !== stopAt) {
      const tok = peek();

      if (tok.type === 'STAR') {
        nodes.push(tryWrapped('STAR', 'Bold') ?? plainToken(eat()));
      } else if (tok.type === 'UNDERSCORE') {
        nodes.push(tryWrapped('UNDERSCORE', 'Italic') ?? plainToken(eat()));
      } else if (tok.type === 'TILDE') {
        nodes.push(tryWrapped('TILDE', 'Strike') ?? plainToken(eat()));
      } else {
        nodes.push(plainToken(eat()));
      }
    }

    return nodes;
  }

  // tryWrapped attempts to parse a marker-wrapped span.
  // If it can't find the closing marker, it BACKTRACKS
  // and returns null — the caller will treat the opening
  // marker as plain text instead.
  function tryWrapped(
    marker: TokenType,
    nodeType: 'Bold' | 'Italic' | 'Strike'
  ): ASTNode | null {
    const saved = pos;       // save position for backtracking
    eat();                   // consume opening marker

    const children = parseInline(marker);

    if (peek().type !== marker || children.length === 0) {
      pos = saved;           // backtrack — restore position
      return null;
    }

    eat();                   // consume closing marker
    return { type: nodeType, children };
  }

  function plainToken(tok: Token): ASTNode {
    return { type: 'Plain', value: tok.value };
  }

  return { type: 'Root', children: parseInline() };
}

The backtracking story

The most important behavior here is backtracking. Consider the input *unclosed bold.

When the parser sees STAR, it calls tryWrapped('STAR', 'Bold'). It saves the current position, consumes the STAR, then calls parseInline('STAR') to look for content followed by a closing STAR. But there is no closing STAR — the token stream ends at EOF.

At that point, tryWrapped restores pos to the saved value and returns null. The caller then calls plainToken(eat()) on the STAR — it becomes { type: 'Plain', value: '*' }. The unmatched asterisk appears literally in the output.

This is identical to what WhatsApp does: *unclosed shows up with a literal asterisk because there's no closing marker.


Stage 4 — The Renderer

The renderer is the simplest stage. It walks the AST and emits HTML. Crucially, it never sees *, _, or ~ — those were consumed by the parser. The renderer only speaks the language of AST nodes.

function render(node: ASTNode): string {
  switch (node.type) {
    case 'Root':
      return node.children.map(render).join('');

    case 'Bold':
      return '<strong>' + node.children.map(render).join('') + '</strong>';

    case 'Italic':
      return '<em>' + node.children.map(render).join('') + '</em>';

    case 'Strike':
      return '<del>' + node.children.map(render).join('') + '</del>';

    case 'Plain':
      // Escape HTML entities so plain text can't inject markup
      return node.value
        .replace(/&/g, '&amp;')
        .replace(/</g, '&lt;')
        .replace(/>/g, '&gt;');
  }
}

This is the beauty of the AST approach. If tomorrow Rocket.Chat wants to add a dark-mode renderer, or a renderer that produces Markdown, or a renderer that produces a PDF — they just write a new render function. The lexer and parser don't change at all.


Nesting: _Hello *world!_

Here's where it gets interesting. What does _Hello *world!_ produce?

The grammar says italic wraps everything inside the underscores. The * inside is just a plain character because there's no closing * before the _ closes. So the AST looks like:

{
  "type": "Root",
  "children": [
    {
      "type": "Italic",
      "children": [
        { "type": "Plain", "value": "Hello " },
        { "type": "Plain", "value": "*" },
        { "type": "Plain", "value": "world!" }
      ]
    }
  ]
}

And renders as: Hello *world!

But *bold _italic_ bold* produces a proper nested tree:

{
  "type": "Bold",
  "children": [
    { "type": "Plain", "value": "bold " },
    {
      "type": "Italic",
      "children": [
        { "type": "Plain", "value": "italic" }
      ]
    },
    { "type": "Plain", "value": " bold" }
  ]
}

The recursive nature of parseInline is what makes nesting work naturally. Each call to parseInline creates its own scope, stopped by whatever closing marker was passed as stopAt.


Why hand-write? The PeggyJS story

Rocket.Chat currently uses PeggyJS — a parser generator that takes a grammar specification and outputs a JavaScript parser. This is a completely valid approach for many projects. So why are we replacing it?

1. The global flag mutation bug. PeggyJS-generated parsers use global regex objects. When a parse fails and the parser backtracks, the regex flags (particularly lastIndex) are not always reset correctly. This produces silent, intermittent failures that are nearly impossible to debug from outside the generated code.

2. InlineItemSlowPath and O(n²) behavior. PeggyJS generates a catch-all alternative for inline content that tries every rule in sequence before falling back to plain text. On certain inputs — especially messages heavy with special characters that don't form valid spans — this causes quadratic behavior. A 1,000-character message with many unmatched markers can trigger 1,000 × 1,000 rule attempts.

3. Zero control over error recovery. When PeggyJS fails to parse, you get whatever it gives you. In a hand-written parser, you decide exactly what to do on every failure — backtrack here, emit plain text there, skip that character, emit a warning.

4. Unnecessary dependencies. PeggyJS is a runtime dependency for Rocket.Chat. It's included in every bundle sent to every client. Removing it reduces bundle size.

5. The GSoC goal. The project goal is 100% AST parity with the current parser, a measurable performance improvement on pathological inputs, and a pure TypeScript codebase that Rocket.Chat's maintainers can read and modify without learning PeggyJS grammar syntax.


The Strangler Fig approach

Because Rocket.Chat is a production application with millions of users, we can't flip a switch and replace the parser overnight. Instead, we're using the Strangler Fig pattern — a technique from enterprise software where you grow a replacement alongside the original until it's ready to take over completely.

The public API surface of the parser looks like this:

// The public interface — unchanged from day one
export function parse(message: string, options?: ParseOptions): RootNode {
  if (options?.useNewParser) {
    return newParse(message, options);  // our hand-written parser
  }
  return legacyParse(message, options); // original PeggyJS parser
}

The useNewParser feature flag lets the team test the new parser against real messages in production without risk. When the new parser achieves 100% test parity, the flag becomes the default and eventually the legacy path is deleted.

This is exactly how Facebook migrated to React, how Google migrated large systems to new infrastructure, and how Rocket.Chat will migrate to this parser.


Putting it all together: the full parser

Here's the complete, production-ready parser for bold, italic, and strikethrough — ready to run in a TypeScript project:

// ─── token types ─────────────────────────────────────────────────────────────

type TokenType = 'STAR' | 'UNDERSCORE' | 'TILDE' | 'TEXT' | 'EOF';

interface Token {
  type: TokenType;
  value: string;
}

// ─── ast node types ───────────────────────────────────────────────────────────

export type ASTNode =
  | { type: 'Root';   children: ASTNode[] }
  | { type: 'Bold';   children: ASTNode[] }
  | { type: 'Italic'; children: ASTNode[] }
  | { type: 'Strike'; children: ASTNode[] }
  | { type: 'Plain';  value: string       };

// ─── lexer ────────────────────────────────────────────────────────────────────

function tokenize(input: string): Token[] {
  const tokens: Token[] = [];
  let i = 0;
  let buf = '';

  const flush = () => {
    if (buf) {
      tokens.push({ type: 'TEXT', value: buf });
      buf = '';
    }
  };

  while (i < input.length) {
    const ch = input[i];
    switch (ch) {
      case '*': flush(); tokens.push({ type: 'STAR',       value: '*' }); break;
      case '_': flush(); tokens.push({ type: 'UNDERSCORE', value: '_' }); break;
      case '~': flush(); tokens.push({ type: 'TILDE',      value: '~' }); break;
      default:  buf += ch;
    }
    i++;
  }

  flush();
  tokens.push({ type: 'EOF', value: '' });
  return tokens;
}

// ─── parser ───────────────────────────────────────────────────────────────────

export function parse(input: string): ASTNode {
  const tokens = tokenize(input);
  let pos = 0;

  const peek = (): Token => tokens[pos];
  const eat  = (): Token => tokens[pos++];

  function parseInline(stopAt?: TokenType): ASTNode[] {
    const nodes: ASTNode[] = [];

    while (peek().type !== 'EOF' && peek().type !== stopAt) {
      const tok = peek();

      if (tok.type === 'STAR') {
        nodes.push(tryWrapped('STAR', 'Bold') ?? plainToken(eat()));
      } else if (tok.type === 'UNDERSCORE') {
        nodes.push(tryWrapped('UNDERSCORE', 'Italic') ?? plainToken(eat()));
      } else if (tok.type === 'TILDE') {
        nodes.push(tryWrapped('TILDE', 'Strike') ?? plainToken(eat()));
      } else {
        nodes.push(plainToken(eat()));
      }
    }

    return nodes;
  }

  function tryWrapped(
    marker: TokenType,
    nodeType: 'Bold' | 'Italic' | 'Strike'
  ): ASTNode | null {
    const saved = pos;
    eat(); // opening marker

    const children = parseInline(marker);

    if (peek().type !== marker || children.length === 0) {
      pos = saved; // backtrack
      return null;
    }

    eat(); // closing marker
    return { type: nodeType, children };
  }

  function plainToken(tok: Token): ASTNode {
    return { type: 'Plain', value: tok.value };
  }

  return { type: 'Root', children: parseInline() };
}

// ─── renderer ─────────────────────────────────────────────────────────────────

export function render(node: ASTNode): string {
  switch (node.type) {
    case 'Root':
      return node.children.map(render).join('');
    case 'Bold':
      return `<strong>${node.children.map(render).join('')}</strong>`;
    case 'Italic':
      return `<em>${node.children.map(render).join('')}</em>`;
    case 'Strike':
      return `<del>${node.children.map(render).join('')}</del>`;
    case 'Plain':
      return node.value
        .replace(/&/g, '&amp;')
        .replace(/</g, '&lt;')
        .replace(/>/g, '&gt;');
  }
}

// ─── usage ────────────────────────────────────────────────────────────────────

const ast    = parse('*Hello* _world_ ~struck~');
const html   = render(ast);
console.log(html);
// → <strong>Hello</strong> <em>world</em> <del>struck</del>

Testing strategy: 100% parity

The hardest part of replacing a parser isn't writing the new parser. It's proving it's correct. Our testing strategy has three tiers:

Tier 1 — Golden tests. For every quirky edge case in the existing PeggyJS parser, we capture the exact AST output and make it a test fixture. The new parser must produce identical JSON. This is how we catch every backtracking edge case, every nested-marker interaction, every emoji.

Tier 2 — Property tests. We use fast-check (a property-based testing library) to generate millions of random messages and assert that both parsers produce the same AST. This finds cases we never thought to write.

Tier 3 — Performance benchmarks. We measure against a suite of pathological inputs — messages with many unmatched markers, very long words, deep nesting — and assert that the new parser is within a constant factor of the old one on typical inputs and significantly better on pathological inputs.


Concepts recap: the mental model

If you take only one thing from this post, let it be this:

Stage What it does What it knows
Grammar Defines the rules Everything — it's the spec
Lexer Labels characters Nothing about grammar
Parser Builds the tree Grammar rules only
AST The structural object Structure, no markers
Renderer Walks the tree AST only, never sees *

Each stage has a single responsibility. The lexer doesn't parse. The parser doesn't render. The renderer doesn't know what * means. This separation is what makes the system testable, replaceable, and maintainable.


What's next in the GSoC project

This post covers the core inline parser — bold, italic, strikethrough. The full Rocket.Chat parser also handles:

  • Emoji:smile: shortcodes and Unicode emoji sequences (the hardest part: replacing Intl.Segmenter for ES2020 compatibility)

  • Mentions@username and @here

  • URLs — detecting and linkifying https://...

  • Inline code`backtick` spans

  • Block elements``` fenced code blocks, > blockquotes

  • KaTeX math\( ... \) inline math

Each of these follows exactly the same pattern: a grammar rule, a lexer token type, a parser function, and an AST node. The pipeline is the same all the way down.


Let's build it — from zero, step by step

Everything I've explained so far is the theory. Now let's actually build it. Not the production Rocket.Chat version — that comes at the end. First, we're going to build a simpler version that you can understand completely before we layer complexity on top.

This is the section I wish existed when I was starting out. We'll go from an empty folder to a working parser with bold, italic, strikethrough — and a frontend that renders everything. Every step builds on the previous one.


Step 1 — Project setup

Open your terminal. We need a TypeScript project. Nothing fancy.

mkdir message-parser
cd message-parser
npm init -y
npm install typescript --save-dev
npx tsc --init

That last command generates a tsconfig.json. Open it and find these two lines (they're there, just commented out) and set them:

{
  "compilerOptions": {
    "rootDir": "./src",
    "outDir": "./dist",
    "strict": true
  }
}

rootDir is where your TypeScript lives. outDir is where the compiled JavaScript goes. Create the source folder:

mkdir src
touch src/parser.ts
touch src/index.ts

To compile and run during development, add this to your package.json scripts:

"scripts": {
  "build": "tsc",
  "dev": "tsc --watch"
}

Your folder should look like this now:

message-parser/
  src/
    parser.ts
    index.ts
  tsconfig.json
  package.json

Step 2 — The simplest possible parser: just plain text

Before we add any markup at all, let's get the foundation right. What does a "parser" even return? It returns an array — the AST — where each item represents one chunk of text.

Open src/parser.ts and write this:

const parser = (input: string) => {
  const root = [];
  let i = 0;

  while (i < input.length) {
    const ch = input[i];
    root.push({ value: ch });
    i++;
  }

  return root;
};

console.log(parser("Hi there"));

Run it with npx ts-node src/parser.ts (install ts-node with npm i -D ts-node if you don't have it).

Output:

[
  { "value": "H" },
  { "value": "i" },
  { "value": " " },
  ...
]

Good — we have a loop that walks through every character and builds an array. Right now it's one object per character which is wasteful, but the structure is correct. This is the skeleton of every parser ever written: a loop, an index, and an array you push things into.


Step 3 — Add a type field

A single value field isn't enough. The renderer needs to know what kind of node it's looking at. Is this plain text? Is it bold? Is it italic? Without a type, every node looks the same.

Let's add type: "PLAIN_TEXT" to every node we push:

const parser = (input: string) => {
  const root = [];
  let i = 0;

  while (i < input.length) {
    const ch = input[i];
    root.push({ type: "PLAIN_TEXT", value: ch });
    i++;
  }

  return root;
};

console.log(parser("Hi there"));

Now each node says what it is:

[
  { "type": "PLAIN_TEXT", "value": "H" },
  { "type": "PLAIN_TEXT", "value": "i" },
  { "type": "PLAIN_TEXT", "value": " " },
  ...
]

Still one character per node, but now we have a type. The renderer can check node.type === "PLAIN_TEXT" and know exactly what to do.

One more improvement — instead of pushing one character at a time, let's collect consecutive plain characters into a single node. Otherwise a 100-character message produces 100 nodes, which is inefficient:

const parser = (input: string) => {
  const root = [];
  let i = 0;
  let buf = "";

  while (i < input.length) {
    const ch = input[i];
    buf += ch;
    i++;
  }

  if (buf) {
    root.push({ type: "PLAIN_TEXT", value: buf });
  }

  return root;
};

parser("Hi there") now returns [{ type: "PLAIN_TEXT", value: "Hi there" }]. One node for the whole string. We'll use this buffering pattern throughout.


Step 4 — Replace raw objects with factory functions

Here's something that starts small but matters a lot as the codebase grows: instead of writing { type: "PLAIN_TEXT", value: text } everywhere, we make a function that returns it.

const plain = (value: string) => ({
  type: "PLAIN_TEXT" as const,
  value,
});

The as const tells TypeScript that the type field is literally the string "PLAIN_TEXT" — not just any string. This makes TypeScript's type narrowing work properly later when you write if (node.type === "PLAIN_TEXT").

Now the parser looks like:

const plain = (value: string) => ({
  type: "PLAIN_TEXT" as const,
  value,
});

const parser = (input: string) => {
  const root = [];
  let i = 0;
  let buf = "";

  while (i < input.length) {
    const ch = input[i];
    buf += ch;
    i++;
  }

  if (buf) {
    root.push(plain(buf));
  }

  return root;
};

This might seem like a small change. But once you have ten node types, having plain(text), bold(children), italic(children) instead of inline object literals everywhere makes everything dramatically easier to read.


Step 5 — TypeScript types for your AST nodes

Now let's give TypeScript the full picture. We have three kinds of nodes so far (and more coming). The key insight is that plain text and formatted nodes have different shapes:

  • PLAIN_TEXT holds a string in value — it's a leaf, no children

  • BOLD, ITALIC, STRIKE hold an array in value — they can contain other nodes

type Plain = {
  type: "PLAIN_TEXT";
  value: string;         // a string — this is a leaf node
};

type Bold = {
  type: "BOLD";
  value: Token[];        // an array — this node has children
};

type Italic = {
  type: "ITALIC";
  value: Token[];        // same pattern
};

type Strike = {
  type: "STRIKE";
  value: Token[];
};

type Token = Plain | Bold | Italic | Strike;
type Root = Token[];

Notice that Plain.value is a string but Bold.value is Token[]. That asymmetry is intentional and mirrors the AST structure exactly — a plain node has no children, formatted nodes do. This is what lets bold contain italic, which contains plain text:

Bold
  └── Italic
        └── Plain("hello")

Now add the factory functions for the formatted types:

const plain = (value: string): Plain => ({
  type: "PLAIN_TEXT" as const,
  value,
});

const bold = (value: Token[]): Bold => ({
  type: "BOLD" as const,
  value,
});

const italic = (value: Token[]): Italic => ({
  type: "ITALIC" as const,
  value,
});

const strike = (value: Token[]): Strike => ({
  type: "STRIKE" as const,
  value,
});

You can test the factory functions immediately, before the parser handles them:

const ast = bold([plain("hello "), italic([plain("world")])]);
console.log(JSON.stringify(ast, null, 2));

Output:

{
  "type": "BOLD",
  "value": [
    { "type": "PLAIN_TEXT", "value": "hello " },
    {
      "type": "ITALIC",
      "value": [
        { "type": "PLAIN_TEXT", "value": "world" }
      ]
    }
  ]
}

The tree is working. We hand-built it. Now the parser needs to build it automatically from raw text.


Step 6 — Parse bold (*)

Now the interesting part. Let's extend the parser to handle *bold* syntax.

The logic is: when we see a *, we don't immediately push it as plain text. Instead, we look ahead and try to find a closing *. If we find one, everything between them becomes a BOLD node. If we don't find one, we treat the * as a regular character.

Let's write a helper function for this — parseBoldMarkup:

const parseBoldMarkup = (input: string, startIndex: number) => {
  let j = startIndex + 1; // start after the opening *

  while (j < input.length) {
    if (input[j] === "*") {
      // Found the closing *
      const text = input.slice(startIndex + 1, j);
      return {
        token: bold([plain(text)]),
        newIndex: j + 1,     // resume parsing after the closing *
      };
    }
    j++;
  }

  return null; // no closing * found — fail gracefully
};

And the updated main parser:

const parser = (input: string): Root => {
  const root: Root = [];
  let i = 0;
  let buf = "";

  const flushBuf = () => {
    if (buf) {
      root.push(plain(buf));
      buf = "";
    }
  };

  while (i < input.length) {
    const ch = input[i];

    if (ch === "*") {
      flushBuf(); // push any accumulated plain text first
      const result = parseBoldMarkup(input, i);
      if (result) {
        root.push(result.token);
        i = result.newIndex;
        continue;
      } else {
        // No closing * — treat this * as plain text
        buf += ch;
      }
    } else {
      buf += ch;
    }

    i++;
  }

  flushBuf();
  return root;
};

console.log(JSON.stringify(parser("Hi *world*"), null, 2));

Output:

[
  { "type": "PLAIN_TEXT", "value": "Hi " },
  {
    "type": "BOLD",
    "value": [
      { "type": "PLAIN_TEXT", "value": "world" }
    ]
  }
]

Two things to notice. First: flushBuf() before handling * — we don't want to lose the "Hi " that came before. Second: when parseBoldMarkup returns null (no closing *), we fall back to treating * as plain text. No crash, no silent wrong output.


Step 7 — Add italic (_) and strike (~)

The pattern for italic and strike is identical to bold. We just swap the marker character and the factory function. Let's refactor slightly to avoid repeating ourselves:

const parseWrapped = (
  input: string,
  startIndex: number,
  marker: string,
  wrap: (children: Token[]) => Token
) => {
  let j = startIndex + 1;

  while (j < input.length) {
    if (input[j] === marker) {
      const text = input.slice(startIndex + 1, j);
      return {
        token: wrap([plain(text)]),
        newIndex: j + 1,
      };
    }
    j++;
  }

  return null;
};

And the updated parser loop:

const parser = (input: string): Root => {
  const root: Root = [];
  let i = 0;
  let buf = "";

  const flushBuf = () => {
    if (buf) { root.push(plain(buf)); buf = ""; }
  };

  while (i < input.length) {
    const ch = input[i];

    if (ch === "*" || ch === "_" || ch === "~") {
      flushBuf();
      const wrap = ch === "*" ? bold : ch === "_" ? italic : strike;
      const result = parseWrapped(input, i, ch, wrap);
      if (result) {
        root.push(result.token);
        i = result.newIndex;
        continue;
      } else {
        buf += ch;
      }
    } else {
      buf += ch;
    }

    i++;
  }

  flushBuf();
  return root;
};

Test it:

console.log(JSON.stringify(parser("*bold* _italic_ ~strike~"), null, 2));

Output:

[
  { "type": "BOLD",   "value": [{ "type": "PLAIN_TEXT", "value": "bold" }] },
  { "type": "PLAIN_TEXT", "value": " " },
  { "type": "ITALIC", "value": [{ "type": "PLAIN_TEXT", "value": "italic" }] },
  { "type": "PLAIN_TEXT", "value": " " },
  { "type": "STRIKE", "value": [{ "type": "PLAIN_TEXT", "value": "strike" }] }
]

Step 8 — Build the frontend renderer

The parser's job is done. Now let's actually display it. Create an index.html file in the project root:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <title>Message Parser</title>
  <style>
    body {
      font-family: system-ui, sans-serif;
      max-width: 640px;
      margin: 60px auto;
      padding: 0 20px;
      background: #f9f9f9;
    }
    textarea {
      width: 100%;
      padding: 12px;
      font-size: 15px;
      border: 1px solid #ddd;
      border-radius: 8px;
      box-sizing: border-box;
      resize: none;
    }
    .preview {
      margin-top: 16px;
      padding: 16px;
      background: #fff;
      border: 1px solid #ddd;
      border-radius: 8px;
      font-size: 15px;
      min-height: 48px;
    }
    .ast {
      margin-top: 12px;
      background: #1e1e1e;
      color: #d4d4d4;
      padding: 16px;
      border-radius: 8px;
      font-family: monospace;
      font-size: 12px;
      white-space: pre;
      overflow-x: auto;
    }
    label {
      display: block;
      margin-top: 16px;
      font-size: 13px;
      color: #666;
      margin-bottom: 4px;
    }
  </style>
</head>
<body>
  <h2>Message Parser</h2>
  <p style="color:#666;font-size:14px">
    Try: <code>*bold*</code> &nbsp; <code>_italic_</code> &nbsp; <code>~strike~</code>
  </p>

  <textarea id="input" rows="3" placeholder="Type your message...">*Hello* _world_ ~struck~</textarea>

  <label>Rendered output</label>
  <div class="preview" id="preview"></div>

  <label>AST</label>
  <div class="ast" id="ast"></div>

  <script>
    // ── copy your compiled parser logic here ──────────────────────────────────

    const plain  = (value)    => ({ type: "PLAIN_TEXT", value });
    const bold   = (value)    => ({ type: "BOLD",       value });
    const italic = (value)    => ({ type: "ITALIC",     value });
    const strike = (value)    => ({ type: "STRIKE",     value });

    const parseWrapped = (input, startIndex, marker, wrap) => {
      let j = startIndex + 1;
      while (j < input.length) {
        if (input[j] === marker) {
          return { token: wrap([plain(input.slice(startIndex + 1, j))]), newIndex: j + 1 };
        }
        j++;
      }
      return null;
    };

    const parser = (input) => {
      const root = [];
      let i = 0, buf = "";
      const flush = () => { if (buf) { root.push(plain(buf)); buf = ""; } };

      while (i < input.length) {
        const ch = input[i];
        if (ch === "*" || ch === "_" || ch === "~") {
          flush();
          const wrap = ch === "*" ? bold : ch === "_" ? italic : strike;
          const result = parseWrapped(input, i, ch, wrap);
          if (result) { root.push(result.token); i = result.newIndex; continue; }
          else { buf += ch; }
        } else { buf += ch; }
        i++;
      }
      flush();
      return root;
    };

    // ── renderer: walks AST → HTML ────────────────────────────────────────────

    const renderNode = (node) => {
      if (node.type === "PLAIN_TEXT") {
        // Escape HTML entities — never trust user input
        return node.value
          .replace(/&/g, "&amp;")
          .replace(/</g, "&lt;")
          .replace(/>/g, "&gt;");
      }
      const inner = node.value.map(renderNode).join("");
      if (node.type === "BOLD")   return `<strong>${inner}</strong>`;
      if (node.type === "ITALIC") return `<em>${inner}</em>`;
      if (node.type === "STRIKE") return `<del>${inner}</del>`;
      return inner;
    };

    const render = (ast) => ast.map(renderNode).join("");

    // ── live update ───────────────────────────────────────────────────────────

    const update = () => {
      const input = document.getElementById("input").value;
      const ast   = parser(input);

      document.getElementById("preview").innerHTML = render(ast);
      document.getElementById("ast").textContent   = JSON.stringify(ast, null, 2);
    };

    document.getElementById("input").addEventListener("input", update);
    update(); // run on load
  </script>
</body>
</html>

Open that file in a browser. As you type in the textarea, the rendered output and AST update live.

The renderer function renderNode is the final piece. It walks the AST:

  • PLAIN_TEXT → escape HTML entities and return the raw string

  • BOLD → recursively render children, wrap in <strong>

  • ITALIC → wrap in <em>

  • STRIKE → wrap in <del>

Notice it never looks at the original input. It has no idea what * or _ are. It only speaks in node types. That's the whole point.


The complete src/parser.ts

Here's the whole thing in one place, fully typed:

// ── node types ────────────────────────────────────────────────────────────────

type Plain  = { type: "PLAIN_TEXT"; value: string   };
type Bold   = { type: "BOLD";       value: Token[]  };
type Italic = { type: "ITALIC";     value: Token[]  };
type Strike = { type: "STRIKE";     value: Token[]  };

type Token = Plain | Bold | Italic | Strike;
type Root  = Token[];

// ── factory functions ─────────────────────────────────────────────────────────

const plain  = (value: string):  Plain  => ({ type: "PLAIN_TEXT" as const, value });
const bold   = (value: Token[]): Bold   => ({ type: "BOLD"       as const, value });
const italic = (value: Token[]): Italic => ({ type: "ITALIC"     as const, value });
const strike = (value: Token[]): Strike => ({ type: "STRIKE"     as const, value });

// ── helper: try to parse a marker-wrapped span ────────────────────────────────

const parseWrapped = (
  input: string,
  startIndex: number,
  marker: string,
  wrap: (children: Token[]) => Token
): { token: Token; newIndex: number } | null => {
  let j = startIndex + 1;

  while (j < input.length) {
    if (input[j] === marker) {
      const text = input.slice(startIndex + 1, j);
      return { token: wrap([plain(text)]), newIndex: j + 1 };
    }
    j++;
  }

  return null; // no closing marker found
};

// ── main parser ───────────────────────────────────────────────────────────────

export const parser = (input: string): Root => {
  const root: Root = [];
  let i = 0;
  let buf = "";

  const flushBuf = () => {
    if (buf) { root.push(plain(buf)); buf = ""; }
  };

  while (i < input.length) {
    const ch = input[i];

    if (ch === "*" || ch === "_" || ch === "~") {
      flushBuf();
      const wrap = ch === "*" ? bold : ch === "_" ? italic : strike;
      const result = parseWrapped(input, i, ch, wrap);
      if (result) {
        root.push(result.token);
        i = result.newIndex;
        continue;
      } else {
        buf += ch; // no closing marker — treat as plain text
      }
    } else {
      buf += ch;
    }

    i++;
  }

  flushBuf();
  return root;
};

// ── renderer ──────────────────────────────────────────────────────────────────

export const renderNode = (node: Token): string => {
  if (node.type === "PLAIN_TEXT") {
    return node.value
      .replace(/&/g, "&amp;")
      .replace(/</g, "&lt;")
      .replace(/>/g, "&gt;");
  }
  const inner = node.value.map(renderNode).join("");
  if (node.type === "BOLD")   return `<strong>${inner}</strong>`;
  if (node.type === "ITALIC") return `<em>${inner}</em>`;
  if (node.type === "STRIKE") return `<del>${inner}</del>`;
  return inner;
};

export const render = (ast: Root): string => ast.map(renderNode).join("");

// ── quick test ────────────────────────────────────────────────────────────────

const ast = parser("*bold* _italic_ ~strike~ plain");
console.log(render(ast));
// → <strong>bold</strong> <em>italic</em> <del>strike</del> plain

Compile it:

npm run build
node dist/parser.js

What this simpler parser doesn't handle (and why the production version does)

This walkthrough parser is honest about its limitations. It handles one level of nesting — *bold* works, but *bold _italic inside bold_* won't parse correctly because parseWrapped just grabs raw text, not sub-nodes. The production recursive descent parser at the top of this article handles that through the parseInline(stopAt) pattern, where each nested span gets its own parsing context.

The goal here wasn't to ship production code. It was to make the idea concrete before you read the real thing. Once you understand parseWrapped returning a node and an index, you understand recursive descent — it's the same idea with more depth.


Final thought

When you send *Hello* on WhatsApp and your friend sees Hello, a tiny pipeline runs in milliseconds: a scanner that labels characters, a parser that understands grammar and builds a tree, and a renderer that walks that tree to produce markup. No magic — just a clean separation of concerns.

Hand-writing that pipeline is harder than using a generator. But it gives you complete control, zero hidden dependencies, and code that any engineer can understand and modify. For a project like Rocket.Chat — serving millions of messages per day — that control is worth it.


I'm Amit, a Google Summer of Code contributor at Rocket.Chat working on the High-Performance Message Parser Rewrite. My mentors are Ahmed Nasser and Matheus Cardoso. Follow the project at github.com/RocketChat — contributions and feedback welcome.

If this helped you understand parsers better, share it. If something's wrong, open an issue. That's what open source is for.