# Reassemble multi-segment messages

This guide shows you how to reassemble a single logical message that a source split across several transport messages. You’ll learn to frame each segment, gather the segments of one message with an outer [`window`](https://preview.docs.tenzir.com/407/reference/operators/window.md) and an inner [`group`](https://preview.docs.tenzir.com/407/reference/operators/group.md), and concatenate them back into one record you can parse.

## The problem

Many log sources cap the size of a single transport message. When an event exceeds that cap, the source splits it into several smaller messages that share an identifier and carry a segment counter. The receiver sees a burst of partial messages that mean nothing on their own—you have to put them back together before you can parse the event.

A segmented message typically frames each part with a shared id, the total number of segments, and this segment’s number, followed by the payload:

```text
f47ac10b-58cc 3 0 first part of the body
f47ac10b-58cc 3 1 second part
f47ac10b-58cc 3 2 third part
```

Two properties make this awkward:

* **Segments arrive interleaved.** The network does not guarantee order, so segment `2` may reach you before segment `0`. You can’t assume the first segment comes first.
* **The identifier is only unique for a while.** A source reuses ids over time, so you need a time bound that tells two unrelated bursts apart.

## Frame each segment

Start by extracting the framing fields from every message. A [`parse_grok`](https://preview.docs.tenzir.com/407/reference/functions/parse_grok.md) pattern pulls the shared id, the segment total, the segment number, and the payload that follows:

```tql
let $header = r"^(?<message_id>[\w-]+)\s+(?<segment_total>\d+)\s+(?<segment_number>\d+)\s+(?<payload>.*)$"


from {content: "f47ac10b-58cc 3 0 first part of the body"},
     {content: "f47ac10b-58cc 3 1 second part"}
segment = content.parse_grok($header)
```

```tql
{
  content: "f47ac10b-58cc 3 0 first part of the body",
  segment: {
    message_id: "f47ac10b-58cc",
    segment_total: 3,
    segment_number: 0,
    payload: "first part of the body",
  },
}
{
  content: "f47ac10b-58cc 3 1 second part",
  segment: {
    message_id: "f47ac10b-58cc",
    segment_total: 3,
    segment_number: 1,
    payload: "second part",
  },
}
```

Adjust the pattern to match your source’s framing. Grok infers the counters as integers—so the segment number sorts numerically in the next step—and leaves the text id and payload as strings.

Keep numeric identifiers verbatim

Type inference reads a numeric identifier as a number, which silently drops leading zeros: a zero-padded id like `0001133664` becomes `1133664`. If your id or payload carries values that must survive verbatim—Cisco ISE ids are zero-padded and its payloads hold MAC addresses and session ids—pass `raw=true` so [`parse_grok`](https://preview.docs.tenzir.com/407/reference/functions/parse_grok.md) keeps every field as text. You then cast the segment number back to a number with `int()` before the sort, so segment `10` sorts after `2`.

## Gather the segments

To reassemble a message, you need every segment that shares an id, but only the ones from the same burst. Combine two operators:

* [`window`](https://preview.docs.tenzir.com/407/reference/operators/window.md) bounds reassembly in event time, so an id reused later starts a fresh message instead of merging with the old one. Keeping it on the outside caps in-flight state at the few windows that are open at once.
* [`group`](https://preview.docs.tenzir.com/407/reference/operators/group.md), inside the window, routes the segments that share an id through the same subpipeline so each message reassembles on its own.

Inside the group, [`sort`](https://preview.docs.tenzir.com/407/reference/operators/sort.md) the segments by number and collect their payloads with [`collect`](https://preview.docs.tenzir.com/407/reference/functions/collect.md), which preserves the sorted order. Carry the message-level fields through with [`first`](https://preview.docs.tenzir.com/407/reference/functions/first.md):

```tql
from {message_id: "f47ac10b", segment_number: 2, ts: 2024-01-01T10:00:00, payload: "c=3"},
     {message_id: "f47ac10b", segment_number: 0, ts: 2024-01-01T10:00:00, payload: "a=1"},
     {message_id: "f47ac10b", segment_number: 1, ts: 2024-01-01T10:00:00, payload: "b=2"}
window size=30s, on=ts, idle_timeout=30s {
  group message_id {
    sort segment_number
    summarize \
      message_id=first(message_id),
      ts=first(ts),
      payloads=collect(payload)
  }
}
message = payloads.join(", ")
drop payloads
```

```tql
{
  message_id: "f47ac10b",
  ts: 2024-01-01T10:00:00Z,
  message: "a=1, b=2, c=3",
}
```

The three interleaved fragments become one record whose `message` field holds the complete body, ready to parse like any other event.

Why window goes on the outside

An id is often effectively unique per message, so an outer [`group`](https://preview.docs.tenzir.com/407/reference/operators/group.md) over it would spawn one subpipeline per message and hold that state for the life of the pipeline. Keeping [`window`](https://preview.docs.tenzir.com/407/reference/operators/window.md) outermost instead bounds state to the few open windows and tears each one down when it closes; the inner [`group`](https://preview.docs.tenzir.com/407/reference/operators/group.md) only separates segments within a window. Both orderings reassemble identically here, because [`window`](https://preview.docs.tenzir.com/407/reference/operators/window.md) assigns events to fixed event-time buckets regardless of nesting, and every segment of a message shares one timestamp. Reserve the outer-[`group`](https://preview.docs.tenzir.com/407/reference/operators/group.md) form for low-cardinality entities that each need an independent event-time clock—see [Learn idiomatic TQL](https://preview.docs.tenzir.com/407/tutorials/learn-idiomatic-tql.md).

### Choose the window size

The window size bounds two things at once: how far apart a message’s segments may arrive, and how long the pipeline waits before emitting a reassembled message on a live source. Many sources send a message’s segments back-to-back, so a window of `30s` leaves a wide margin while keeping latency low. Raise it if segments are delayed in transit; lower it for faster output. Set `idle_timeout` to the same value so a message flushes shortly after its last segment instead of waiting for the full window.

## Supply an event-time field

Reassembly windows on the `on` field, so each segment needs a usable event-time value before this step. If your source already carries a parsed timestamp, point `on` at it. If the timestamp still needs parsing—or omits the year, as BSD syslog does—derive a full timestamp first. See [Work with time](https://preview.docs.tenzir.com/407/guides/transformation/work-with-time.md) for parsing and reconstructing time values.

## Test with simulated timing

To exercise reassembly like a live feed, replay a fixture with [`delay`](https://preview.docs.tenzir.com/407/reference/operators/delay.md), which sleeps between events in proportion to a timestamp field. Give each segment a `send` time so the segments arrive out of order and spread over real wall-clock time. Here two messages stream in over more than five seconds, yet a five-second window reassembles each one:

```tql
from {message_id: "f47ac10b", segment_number: 1, ts: 2024-01-01T10:00:00, send: 2024-01-01T10:00:00.0, payload: "world"},
     {message_id: "f47ac10b", segment_number: 0, ts: 2024-01-01T10:00:00, send: 2024-01-01T10:00:01.0, payload: "hello"},
     {message_id: "9c5b4d2e", segment_number: 0, ts: 2024-01-01T10:00:10, send: 2024-01-01T10:00:07.0, payload: "foo"},
     {message_id: "9c5b4d2e", segment_number: 1, ts: 2024-01-01T10:00:10, send: 2024-01-01T10:00:08.0, payload: "bar"}
delay send
window size=5s, on=ts, idle_timeout=5s {
  group message_id {
    sort segment_number
    summarize message_id=first(message_id), payloads=collect(payload)
  }
}
message = payloads.join(" ")
drop payloads
```

```tql
{
  message_id: "f47ac10b",
  message: "hello world",
}
{
  message_id: "9c5b4d2e",
  message: "foo bar",
}
```

`delay` anchors the `send` times to the current wall clock, so the first message arrives within a second and the second only seven seconds later. Each message still comes out whole and streams out as its window closes: `idle_timeout` flushes the first message about five seconds after its last segment, before the second even arrives. What matters is the spread of one message’s own segments, not how long the whole stream runs. Remove `delay` to replay the fixture instantly as a deterministic test.

Size the window to the segment spread

The window must be wider than the gap between a single message’s segments. If one message’s segments arrive farther apart than `size` and `idle_timeout`, its window closes before the stragglers arrive, and [`window`](https://preview.docs.tenzir.com/407/reference/operators/window.md) drops them:

```text
warning: `window` dropped 1 late event(s) that arrived after their window had closed
```

The reassembled message is then incomplete. Raise `size` and `idle_timeout` to cover the worst-case delay between a message’s segments.

## In practice

The Cisco package applies exactly this pattern to Cisco ISE syslog, which splits long messages at the source. It ships the reassembly and parsing steps as the reusable operators `cisco::ise::reassemble` and `cisco::ise::parse`, so a pipeline only has to read the data and supply an event-time field before calling them.

## See also

* [`group`](https://preview.docs.tenzir.com/407/reference/operators/group.md)
* [`window`](https://preview.docs.tenzir.com/407/reference/operators/window.md)
* [`sort`](https://preview.docs.tenzir.com/407/reference/operators/sort.md)
* [`delay`](https://preview.docs.tenzir.com/407/reference/operators/delay.md)
* [`collect`](https://preview.docs.tenzir.com/407/reference/functions/collect.md)
* [`parse_grok`](https://preview.docs.tenzir.com/407/reference/functions/parse_grok.md)
* [Parse string fields](https://preview.docs.tenzir.com/407/guides/parsing/parse-string-fields.md)
* [Work with time](https://preview.docs.tenzir.com/407/guides/transformation/work-with-time.md)
* [Aggregate event streams](https://preview.docs.tenzir.com/407/guides/analytics/aggregate-event-streams.md)
* [Learn idiomatic TQL](https://preview.docs.tenzir.com/407/tutorials/learn-idiomatic-tql.md)
* [Syslog](https://preview.docs.tenzir.com/407/integrations/syslog.md)