Skip to content
Legacy docs for Tenzir v5.x. For the latest Tenzir v6 series, visit docs.tenzir.com. Migrating from v5? Read the Tenzir v6 migration guide.

This guide shows you how to clean and normalize values in your data before mapping to a schema. You’ll learn to handle null placeholders, normalize sentinel values, fix types, and provide defaults.

Many data sources use string placeholders instead of actual null values. Common patterns include "None", "N/A", "-", and empty strings.

Use replace to convert placeholders to null across all string fields:

from {status: "active", error: "None"},
{status: "N/A", error: "timeout"},
{status: "-", error: ""}
replace what="None", with=null
replace what="N/A", with=null
replace what="-", with=null
replace what="", with=null
{status: "active", error: null}
{status: null, error: "timeout"}
{status: null, error: null}

When you only want to normalize certain fields, specify them explicitly:

from {enabled: "YES", disabled: "NO", status: "YES"}
replace enabled, what="YES", with=true
replace enabled, what="NO", with=false
replace disabled, what="YES", with=true
replace disabled, what="NO", with=false
// status remains unchanged
{enabled: true, disabled: false, status: "YES"}

Combine multiple replacements for thorough cleanup:

from {value: "N/A"},
{value: "null"},
{value: "undefined"},
{value: ""}
replace what="N/A", with=null
replace what="null", with=null
replace what="undefined", with=null
replace what="", with=null

Some systems use specific values to indicate special states. Normalize these to consistent representations.

Convert string booleans to actual boolean values:

from {active: "true", verified: "1", enabled: "yes"},
{active: "false", verified: "0", enabled: "no"}
active = active == "true"
verified = verified == "1"
enabled = enabled == "yes"
{active: true, verified: true, enabled: true}
{active: false, verified: false, enabled: false}

Convert special numeric values:

from {port: -1, count: 0xFFFFFFFF, size: -999}
// -1 often means "any port" or "not applicable"
port = null if port == -1
// 0xFFFFFFFF often means "unknown" in network data
count = null if count == 4294967295
// -999 is sometimes used as a missing value indicator
size = null if size == -999

Raw data often arrives with incorrect types. Convert strings to appropriate native types.

Convert string timestamps to proper time values:

from {ts: "2024-01-15T10:30:45Z", epoch: "1705316445"}
ts = ts.time()
epoch = epoch.int().seconds().from_epoch()
{ts: 2024-01-15T10:30:45Z, epoch: 2024-01-15T11:00:45Z}

Convert string IPs to native IP types:

from {src: "192.168.1.100", dst: "10.0.0.1"}
src = src.ip()
dst = dst.ip()
{src: 192.168.1.100, dst: 10.0.0.1}

Convert CIDR notation strings to subnet types:

from {network: "10.0.0.0/8", allowed: "192.168.0.0/16"}
network = network.subnet()
allowed = allowed.subnet()
{network: 10.0.0.0/8, allowed: 192.168.0.0/16}

Convert numeric strings to integers or floats:

from {port: "443", ratio: "0.95", count: "1000"}
port = port.int()
ratio = ratio.float()
count = count.int()
{port: 443, ratio: 0.95, count: 1000}

Convert duration strings to duration types:

from {timeout: "30s", interval: "5min", ttl: "24h"}
timeout = timeout.duration()
interval = interval.duration()
ttl = ttl.duration()
{timeout: 30s, interval: 5min, ttl: 24h}

Use the else keyword to fill in missing values:

from {name: "alice", score: 85},
{name: "bob"},
{name: "charlie", score: null}
score = score? else 0
status = status? else "unknown"
{name: "alice", score: 85, status: "unknown"}
{name: "bob", score: 0, status: "unknown"}
{name: "charlie", score: 0, status: "unknown"}

The ? operator suppresses warnings when accessing fields that don’t exist.

Clean up whitespace and case inconsistencies:

from {name: " ALICE ", email: " Bob@Example.COM "}
name = name.trim().to_title()
email = email.trim().to_lower()
{name: "Alice", email: "bob@example.com"}

Remove null fields from nested records to reduce event size:

from {user: {name: "alice", email: null, phone: null}}
drop_null_fields user
{user: {name: "alice"}}

Avoid using drop_null_fields without arguments on the top-level record. Each unique combination of present fields creates a distinct schema, leading to schema fragmentation that defeats the purpose of normalization. It also removes fields you just normalized.

Here’s a complete cleanup pipeline for raw firewall logs:

from_kafka "firewall-raw"
this = message.parse_json()
// Replace string nulls
replace what="N/A", with=null
replace what="-", with=null
replace what="", with=null
// Convert types
src_ip = src_ip?.ip()
dst_ip = dst_ip?.ip()
src_port = src_port?.int()
dst_port = dst_port?.int()
timestamp = timestamp?.time()
bytes = bytes?.int()
// Normalize booleans
blocked = action? == "block"
// Provide defaults
action = action? else "unknown"
protocol = protocol? else "unknown"
// Drop nulls from user context (e.g., domain, group may be missing)
drop_null_fields user
  1. Clean early: Apply cleanup transformations at the start of your pipeline before any business logic.

  2. Handle missing fields: Use ? when accessing fields that might not exist, and provide sensible defaults with else.

  3. Document your assumptions: Comment which sentinel values you’re normalizing and why.

  4. Test edge cases: Verify cleanup handles unusual values like empty strings, whitespace-only strings, and mixed case.

  5. Consider performance: The replace operator is efficient for bulk replacements, but be mindful of chaining many replacements.

Last updated: