Skip to content

OVERVIEW

The @dpkit/table package provides high-performance data validation and processing capabilities for tabular data. Built on top of nodejs-polars (a Rust-based DataFrame library), it offers robust schema validation, type inference, and error handling for CSV, Excel, and other tabular data formats.

import { DataFrame } from "nodejs-polars"
import { validateTable } from "@dpkit/table"
import type { Schema } from "@dpkit/core"
// Create a table from data
const table = DataFrame({
id: [1, 2, 3],
name: ["John", "Jane", "Bob"],
}).lazy()
// Define schema with constraints
const schema: Schema = {
fields: [
{ name: "id", type: "integer", constraints: { required: true, unique: true } },
{ name: "name", type: "string", constraints: { required: true } },
{ name: "email", type: "string", constraints: { pattern: "^[^@]+@[^@]+\\.[^@]+$" } }
]
}
// Validate the table
const result = await validateTable(table, { schema })
console.log(result.valid) // true/false
console.log(result.errors) // Array of validation errors
import { inferSchema } from "@dpkit/table"
// Automatically infer schema from data patterns
const table = DataFrame({
id: ["1", "2", "3"],
price: ["10.50", "25.00", "15.75"],
date: ["2023-01-15", "2023-02-20", "2023-03-25"],
active: ["true", "false", "true"]
}).lazy()
const inferredSchema = await inferSchema(table, {
sampleSize: 100, // Sample size for inference
confidence: 0.9, // Confidence threshold
monthFirst: false, // Date format preference
commaDecimal: false // Decimal separator preference
})
// Result: automatically detected integer, number, date, and boolean types
// Subset matching - data can have extra fields
const schema: Schema = {
fieldsMatch: "subset",
fields: [
{ name: "id", type: "number" },
{ name: "name", type: "string" }
]
}
// Equal matching - field names must match regardless of order
const equalSchema: Schema = {
fieldsMatch: "equal",
fields: [
{ name: "id", type: "number" },
{ name: "name", type: "string" }
]
}
import { processTable } from "@dpkit/table"
// Process table with schema (converts string columns to proper types)
const table = DataFrame({
id: ["1", "2", "3"], // String data
price: ["10.50", "25.00", "15.75"],
active: ["true", "false", "true"],
date: ["2023-01-15", "2023-02-20", "2023-03-25"]
}).lazy()
const schema: Schema = {
fields: [
{ name: "id", type: "integer" },
{ name: "price", type: "number" },
{ name: "active", type: "boolean" },
{ name: "date", type: "date" }
]
}
const processedTable = await processTable(table, { schema })
const result = await processedTable.collect()
// Result will have properly typed columns:
// { id: 1, price: 10.50, active: true, date: Date('2023-01-15') }
// { id: 2, price: 25.00, active: false, date: Date('2023-02-20') }
// { id: 3, price: 15.75, active: true, date: Date('2023-03-25') }
const result = await validateTable(table, { schema })
result.errors.forEach(error => {
switch (error.type) {
case "cell/required":
console.log(`Required field missing in row ${error.rowNumber}: '${error.fieldName}'`)
break
case "cell/unique":
console.log(`Duplicate value in row ${error.rowNumber}: '${error.cell}'`)
break
case "cell/pattern":
console.log(`Pattern mismatch: '${error.cell}' doesn't match ${error.constraint}`)
break
}
})

The package uses LazyDataFrame from nodejs-polars as its core table representation, enabling lazy evaluation and efficient processing of large datasets through vectorized operations.

Integrates seamlessly with @dpkit/core schemas, bridging Data Package field definitions with Polars data types for comprehensive validation workflows.

Field-Level Validation:

  • Type Validation: Converts and validates data types (string → integer, etc.)
  • Name Validation: Ensures field names match schema requirements
  • Constraint Validation: Enforces required, unique, enum, pattern, min/max values, and length constraints

Table-Level Validation:

  • Field Presence: Validates missing/extra fields based on flexible matching strategies
  • Schema Compatibility: Ensures data structure aligns with schema definitions

Row-Level Validation:

  • Primary Key Uniqueness: Validates unique identifiers
  • Composite Keys: Supports multi-column unique constraints

Primitive Types:

  • string, integer, number, boolean

Temporal Types:

  • date, datetime, time, year, yearmonth, duration

Spatial Types:

  • geopoint, geojson

Complex Types:

  • array, list, object

Automatically infers field types and formats from data using:

  • Regex pattern matching with configurable confidence thresholds
  • Locale-specific format detection (comma decimals, date formats)
  • Complex type recognition (objects, arrays, temporal data)
  • exact: Fields must match exactly in order and count
  • equal: Same fields in any order
  • subset: Data must contain all schema fields (extras allowed)
  • superset: Schema must contain all data fields
  • partial: At least one field must match

Format-Aware Parsing:

  • Handles missing values at schema and field levels
  • Supports group/decimal character customization
  • Processes currency symbols and whitespace
  • Parses complex formats (ISO dates, scientific notation)

Performance Optimizations:

  • Sample-based validation for large datasets
  • Lazy evaluation for memory efficiency
  • Vectorized constraint checking
  • Configurable error limits and batch processing

Field Errors:

  • Name mismatches between schema and data
  • Type conversion failures
  • Missing or extra field violations

Cell Errors:

  • Type validation failures with specific conversion details
  • Constraint violations (required, unique, enum, pattern, range, length)
  • Format parsing errors with problematic values

Row Errors:

  • Unique key constraint violations
  • Composite constraint failures

Each error includes:

  • Precise row and column locations
  • Actual vs expected values
  • Specific error type classification
  • Actionable error messages for debugging