OVERVIEW
@dpkit/table
Section titled “@dpkit/table”The @dpkit/table
package provides high-performance data validation and processing capabilities for tabular data. Built on top of nodejs-polars (a Rust-based DataFrame library), it offers robust schema validation, type inference, and error handling for CSV, Excel, and other tabular data formats.
Examples
Section titled “Examples”Basic Table Validation
Section titled “Basic Table Validation”import { DataFrame } from "nodejs-polars"import { validateTable } from "@dpkit/table"import type { Schema } from "@dpkit/core"
// Create a table from dataconst table = DataFrame({ id: [1, 2, 3], name: ["John", "Jane", "Bob"],}).lazy()
// Define schema with constraintsconst schema: Schema = { fields: [ { name: "id", type: "integer", constraints: { required: true, unique: true } }, { name: "name", type: "string", constraints: { required: true } }, { name: "email", type: "string", constraints: { pattern: "^[^@]+@[^@]+\\.[^@]+$" } } ]}
// Validate the tableconst result = await validateTable(table, { schema })console.log(result.valid) // true/falseconsole.log(result.errors) // Array of validation errors
Schema Inference
Section titled “Schema Inference”import { inferSchema } from "@dpkit/table"
// Automatically infer schema from data patternsconst table = DataFrame({ id: ["1", "2", "3"], price: ["10.50", "25.00", "15.75"], date: ["2023-01-15", "2023-02-20", "2023-03-25"], active: ["true", "false", "true"]}).lazy()
const inferredSchema = await inferSchema(table, { sampleSize: 100, // Sample size for inference confidence: 0.9, // Confidence threshold monthFirst: false, // Date format preference commaDecimal: false // Decimal separator preference})
// Result: automatically detected integer, number, date, and boolean types
Field Matching Strategies
Section titled “Field Matching Strategies”// Subset matching - data can have extra fieldsconst schema: Schema = { fieldsMatch: "subset", fields: [ { name: "id", type: "number" }, { name: "name", type: "string" } ]}
// Equal matching - field names must match regardless of orderconst equalSchema: Schema = { fieldsMatch: "equal", fields: [ { name: "id", type: "number" }, { name: "name", type: "string" } ]}
Table Processing
Section titled “Table Processing”import { processTable } from "@dpkit/table"
// Process table with schema (converts string columns to proper types)const table = DataFrame({ id: ["1", "2", "3"], // String data price: ["10.50", "25.00", "15.75"], active: ["true", "false", "true"], date: ["2023-01-15", "2023-02-20", "2023-03-25"]}).lazy()
const schema: Schema = { fields: [ { name: "id", type: "integer" }, { name: "price", type: "number" }, { name: "active", type: "boolean" }, { name: "date", type: "date" } ]}
const processedTable = await processTable(table, { schema })const result = await processedTable.collect()
// Result will have properly typed columns:// { id: 1, price: 10.50, active: true, date: Date('2023-01-15') }// { id: 2, price: 25.00, active: false, date: Date('2023-02-20') }// { id: 3, price: 15.75, active: true, date: Date('2023-03-25') }
Error Handling
Section titled “Error Handling”const result = await validateTable(table, { schema })
result.errors.forEach(error => { switch (error.type) { case "cell/required": console.log(`Required field missing in row ${error.rowNumber}: '${error.fieldName}'`) break case "cell/unique": console.log(`Duplicate value in row ${error.rowNumber}: '${error.cell}'`) break case "cell/pattern": console.log(`Pattern mismatch: '${error.cell}' doesn't match ${error.constraint}`) break }})
Core Architecture
Section titled “Core Architecture”Table Type
Section titled “Table Type”The package uses LazyDataFrame
from nodejs-polars as its core table representation, enabling lazy evaluation and efficient processing of large datasets through vectorized operations.
Schema Integration
Section titled “Schema Integration”Integrates seamlessly with @dpkit/core
schemas, bridging Data Package field definitions with Polars data types for comprehensive validation workflows.
Key Features
Section titled “Key Features”1. Multi-Level Validation System
Section titled “1. Multi-Level Validation System”Field-Level Validation:
- Type Validation: Converts and validates data types (string → integer, etc.)
- Name Validation: Ensures field names match schema requirements
- Constraint Validation: Enforces required, unique, enum, pattern, min/max values, and length constraints
Table-Level Validation:
- Field Presence: Validates missing/extra fields based on flexible matching strategies
- Schema Compatibility: Ensures data structure aligns with schema definitions
Row-Level Validation:
- Primary Key Uniqueness: Validates unique identifiers
- Composite Keys: Supports multi-column unique constraints
2. Comprehensive Field Types
Section titled “2. Comprehensive Field Types”Primitive Types:
string
,integer
,number
,boolean
Temporal Types:
date
,datetime
,time
,year
,yearmonth
,duration
Spatial Types:
geopoint
,geojson
Complex Types:
array
,list
,object
3. Smart Schema Inference
Section titled “3. Smart Schema Inference”Automatically infers field types and formats from data using:
- Regex pattern matching with configurable confidence thresholds
- Locale-specific format detection (comma decimals, date formats)
- Complex type recognition (objects, arrays, temporal data)
4. Flexible Field Matching Strategies
Section titled “4. Flexible Field Matching Strategies”- exact: Fields must match exactly in order and count
- equal: Same fields in any order
- subset: Data must contain all schema fields (extras allowed)
- superset: Schema must contain all data fields
- partial: At least one field must match
5. Advanced Data Processing
Section titled “5. Advanced Data Processing”Format-Aware Parsing:
- Handles missing values at schema and field levels
- Supports group/decimal character customization
- Processes currency symbols and whitespace
- Parses complex formats (ISO dates, scientific notation)
Performance Optimizations:
- Sample-based validation for large datasets
- Lazy evaluation for memory efficiency
- Vectorized constraint checking
- Configurable error limits and batch processing
Error Handling
Section titled “Error Handling”Comprehensive Error Taxonomy
Section titled “Comprehensive Error Taxonomy”Field Errors:
- Name mismatches between schema and data
- Type conversion failures
- Missing or extra field violations
Cell Errors:
- Type validation failures with specific conversion details
- Constraint violations (required, unique, enum, pattern, range, length)
- Format parsing errors with problematic values
Row Errors:
- Unique key constraint violations
- Composite constraint failures
Error Details
Section titled “Error Details”Each error includes:
- Precise row and column locations
- Actual vs expected values
- Specific error type classification
- Actionable error messages for debugging