read_auto

Detects the input format of a byte stream and selects a matching reader.

read_auto [fallback=string, max_probe_bytes=uint]

Description

The read_auto operator buffers the first bytes of its input as a probe and asks every reader whether it can parse them. Use it when the input format is unknown at authoring time, but should still be one of Tenzir’s structured formats.

Probe the first bytes of the input, up to max_probe_bytes.
Dry-run every reader’s parser on the probe to find capable readers.
Start the most specific capable reader. Without a capable reader, use the fallback reader or fail; when two formats are equally specific, fail with an ambiguity error.

Detection works in two layers:

Capability: Every reader dry-runs its actual parser on the probe. For example, YAML detection runs the YAML parser and requires a map document that read_yaml would turn into an event, and CSV detection tokenizes complete lines with the reader’s quoting rules and requires a stable number of fields. A reader only becomes a candidate when it would accept the probed input.
Specificity: When several readers are capable of parsing the same bytes, the most specific format wins. Magic-byte formats such as PCAP or Parquet rank above JSON dialects such as Suricata EVE or GELF, which rank above generic NDJSON, which ranks above key-value, delimited, Syslog, and YAML input. For example, a GELF stream is also valid NDJSON, but the GELF reader wins because it describes the input more precisely.

Detection is strict by default. If no reader is capable, or if two formats with equal specificity match the same probe, read_auto emits an error instead of guessing. A reader that needs more evidence delays the decision until more input arrives, the input ends, or the probe reaches max_probe_bytes. Once a single best candidate exists, read_auto starts that reader, replays the buffered bytes, and forwards the rest of the stream unchanged.

The built-in detectors cover common JSON, delimited text, security log, and magic-byte formats, including NDJSON, JSON objects, JSON arrays of objects, CSV, TSV, key-value text, YAML, Syslog, CEF, LEEF, Zeek TSV, Suricata EVE JSON, Zeek JSON, GELF, PCAP, Feather, BITZ, and Parquet. Formats that accept nearly arbitrary text never participate in detection: space-separated values look like prose, so select read_ssv explicitly, and Syslog messages without a <PRI> prefix look like free-form text, so they only match via fallback.

The output uses the schema name that the selected reader would normally assign. For example, detected CSV input produces the same schema name as read_csv, and detected NDJSON input produces the same schema name as read_ndjson. Inspect @name to see the schema name. read_auto does not add a separate field with the detected format.

Use read_auto for exploratory pipelines where you want to try sample data quickly, for file drops where names don’t reliably encode the format, and for multi-format ingestion endpoints. For example, accept_tcp can run read_auto per connection so one client sends NDJSON while another sends CSV, Syslog, or another supported format.

Prefer a concrete reader when you already know the format or need reader-specific options such as unflatten_separator for read_ndjson. read_auto selects the reader once for each byte stream and expects the remaining bytes in that stream to use the same format.

`fallback = string (optional)`

Controls what happens when no detector matches.

Valid values are:

"none": Emit an error. This is the default.
"lines": Use read_lines. The input must be valid UTF-8.
"all": Use read_all. read_auto uses the current probe to choose between text and binary output: valid UTF-8 probe bytes select read_all, while invalid probe bytes select read_all binary=true. If binary input can start with a valid UTF-8 prefix longer than max_probe_bytes, use a larger probe limit or read_all with binary=true directly.

read_auto uses a fallback only after the probe is final, either because the input ended or because the probe reached max_probe_bytes. For long-lived streams with unknown plain-text input, lower max_probe_bytes to reduce startup latency or use read_lines directly.

`max_probe_bytes = uint (optional)`

The maximum number of bytes to inspect before forcing a detection decision.

Defaults to 1Mi bytes.

Examples

Detect JSON lines

Given this input:

{"x":1}
{"x":2}

Use read_auto where you would normally use a concrete reader:

from_file "events.ndjson" {
  read_auto
}

{x: 1}
{x: 2}

Fall back to lines

For arbitrary UTF-8 text, opt into line-based parsing explicitly:

hello
world

from_file "messages.txt" {
  read_auto fallback="lines"
}

{line: "hello"}
{line: "world"}

Fall back to a single event

Use fallback="all" when unknown input should become one event instead of one event per line:

from_file "payload.bin" {
  read_auto fallback="all"
}

If the input is binary, the resulting event contains a blob value in the data field.

Accept multiple formats over TCP

Use read_auto in a network listener when the endpoint accepts producers with different formats:

accept_tcp "0.0.0.0:9000" {
  read_auto fallback="lines"
}

The detector runs separately for each connection. This makes the pattern useful for rapid prototyping, intake endpoints shared by several teams, and package pipelines that normalize data only after the parser has selected the input format.