This guide shows you how to reassemble a single logical message that a source
split across several transport messages. You’ll learn to frame each segment,
gather the segments of one message with an outer window and an inner
group, and concatenate them back into one record you can parse.
The problem
Section titled “The problem”Many log sources cap the size of a single transport message. When an event exceeds that cap, the source splits it into several smaller messages that share an identifier and carry a segment counter. The receiver sees a burst of partial messages that mean nothing on their own—you have to put them back together before you can parse the event.
A segmented message typically frames each part with a shared id, the total number of segments, and this segment’s number, followed by the payload:
f47ac10b-58cc 3 0 first part of the bodyf47ac10b-58cc 3 1 second partf47ac10b-58cc 3 2 third partTwo properties make this awkward:
- Segments arrive interleaved. The network does not guarantee order, so
segment
2may reach you before segment0. You can’t assume the first segment comes first. - The identifier is only unique for a while. A source reuses ids over time, so you need a time bound that tells two unrelated bursts apart.
Frame each segment
Section titled “Frame each segment”Start by extracting the framing fields from every message. A parse_grok
pattern pulls the shared id, the segment total, the segment number, and the
payload that follows:
let $header = r"^(?<message_id>[\w-]+)\s+(?<segment_total>\d+)\s+(?<segment_number>\d+)\s+(?<payload>.*)$"
from {content: "f47ac10b-58cc 3 0 first part of the body"}, {content: "f47ac10b-58cc 3 1 second part"}segment = content.parse_grok($header){ content: "f47ac10b-58cc 3 0 first part of the body", segment: { message_id: "f47ac10b-58cc", segment_total: 3, segment_number: 0, payload: "first part of the body", },}{ content: "f47ac10b-58cc 3 1 second part", segment: { message_id: "f47ac10b-58cc", segment_total: 3, segment_number: 1, payload: "second part", },}Adjust the pattern to match your source’s framing. Grok infers the counters as integers—so the segment number sorts numerically in the next step—and leaves the text id and payload as strings.
Gather the segments
Section titled “Gather the segments”To reassemble a message, you need every segment that shares an id, but only the ones from the same burst. Combine two operators:
windowbounds reassembly in event time, so an id reused later starts a fresh message instead of merging with the old one. Keeping it on the outside caps in-flight state at the few windows that are open at once.group, inside the window, routes the segments that share an id through the same subpipeline so each message reassembles on its own.
Inside the group, sort the segments by number and collect their payloads
with collect, which preserves the sorted order. Carry the message-level
fields through with first:
from {message_id: "f47ac10b", segment_number: 2, ts: 2024-01-01T10:00:00, payload: "c=3"}, {message_id: "f47ac10b", segment_number: 0, ts: 2024-01-01T10:00:00, payload: "a=1"}, {message_id: "f47ac10b", segment_number: 1, ts: 2024-01-01T10:00:00, payload: "b=2"}window size=30s, on=ts, idle_timeout=30s { group message_id { sort segment_number summarize \ message_id=first(message_id), ts=first(ts), payloads=collect(payload) }}message = payloads.join(", ")drop payloads{ message_id: "f47ac10b", ts: 2024-01-01T10:00:00Z, message: "a=1, b=2, c=3",}The three interleaved fragments become one record whose message field holds the
complete body, ready to parse like any other event.
Choose the window size
Section titled “Choose the window size”The window size bounds two things at once: how far apart a message’s segments may
arrive, and how long the pipeline waits before emitting a reassembled message on a
live source. Many sources send a message’s segments back-to-back, so a window of
30s leaves a wide margin while keeping latency low. Raise it if segments are
delayed in transit; lower it for faster output. Set idle_timeout to the same
value so a message flushes shortly after its last segment instead of waiting for
the full window.
Supply an event-time field
Section titled “Supply an event-time field”Reassembly windows on the on field, so each segment needs a usable event-time
value before this step. If your source already carries a parsed timestamp, point
on at it. If the timestamp still needs parsing—or omits the year, as BSD syslog
does—derive a full timestamp first. See Work with time
for parsing and reconstructing time values.
Test with simulated timing
Section titled “Test with simulated timing”To exercise reassembly like a live feed, replay a fixture with delay,
which sleeps between events in proportion to a timestamp field. Give each segment
a send time so the segments arrive out of order and spread over real wall-clock
time. Here two messages stream in over more than five seconds, yet a five-second
window reassembles each one:
from {message_id: "f47ac10b", segment_number: 1, ts: 2024-01-01T10:00:00, send: 2024-01-01T10:00:00.0, payload: "world"}, {message_id: "f47ac10b", segment_number: 0, ts: 2024-01-01T10:00:00, send: 2024-01-01T10:00:01.0, payload: "hello"}, {message_id: "9c5b4d2e", segment_number: 0, ts: 2024-01-01T10:00:10, send: 2024-01-01T10:00:07.0, payload: "foo"}, {message_id: "9c5b4d2e", segment_number: 1, ts: 2024-01-01T10:00:10, send: 2024-01-01T10:00:08.0, payload: "bar"}delay sendwindow size=5s, on=ts, idle_timeout=5s { group message_id { sort segment_number summarize message_id=first(message_id), payloads=collect(payload) }}message = payloads.join(" ")drop payloads{ message_id: "f47ac10b", message: "hello world",}{ message_id: "9c5b4d2e", message: "foo bar",}delay anchors the send times to the current wall clock, so the first message
arrives within a second and the second only seven seconds later. Each message
still comes out whole and streams out as its window closes: idle_timeout flushes
the first message about five seconds after its last segment, before the second
even arrives. What matters is the spread of one message’s own segments, not how
long the whole stream runs. Remove delay to replay the fixture instantly as a
deterministic test.
In practice
Section titled “In practice”The Cisco package applies exactly this pattern to Cisco ISE syslog, which splits
long messages at the source. It ships the reassembly and parsing steps as the
reusable operators cisco::ise::reassemble and cisco::ise::parse, so a pipeline
only has to read the data and supply an event-time field before calling them.