Architecture
This page provides a technical explanation of Pseudata’s architecture, core components, and design decisions. For a high-level overview, see the Concept page. For implementation details, see the Contributing guides.
System Architecture
Section titled “System Architecture”Pseudata operates across three distinct phases:
Build Time (Code Generation)
Section titled “Build Time (Code Generation)”TypeSpec definitions serve as the single source of truth, feeding multiple code generation pipelines:
- TypeSpec Definitions (
*.tspfiles) define models, primitives, and arrays - JSON Schema Emitter generates schemas for model generation
- Custom Emitters generate language-specific code:
interface-emitter.js→ Primitives interfacearray-emitter.js→ Array classesresource-emitter.js→ Embedded locale data
- Quicktype transforms JSON Schema into language-specific model classes
Runtime (Data Generation)
Section titled “Runtime (Data Generation)”At runtime, components work together to generate deterministic data:
The flow shows:
- UserArray (or AddressArray, etc.) extends PseudoArray base class
- PseudoArray.at(index) creates Generator with (worldSeed, typeSeq)
- PseudoArray.at(index) instantiates Primitives with (Generator, index)
- Primitives uses Generator and accesses Resources to generate field values
- Model (User, Address, etc.) is composed from primitive values and returned
Testing (Verification)
Section titled “Testing (Verification)”Fixture-based tests ensure cross-language consistency:
- Generate fixtures from Go (golden reference)
- Verify all other languages match exactly
- Detect subtle bugs that unit tests miss
Core Components
Section titled “Core Components”Generator (PCG32)
Section titled “Generator (PCG32)”Purpose: Deterministic pseudo-random number generation.
Algorithm: PCG32 (Permuted Congruential Generator) with 64-bit state.
Why PCG32?
- Portable: Simple to implement consistently across languages
- Fast: Linear congruential base with efficient permutation
- High-quality: Passes statistical randomness tests
- Deterministic: Same seed always produces identical sequence
- Small state: 64 bits vs. Mersenne Twister’s 2.5KB
Key Operations:
// Initialize with seedstate := seed
// Generate next uint32state = state * 6364136223846793005 + 1442695040888963407xorshifted := uint32(((state >> 18) ^ state) >> 27)rot := uint32(state >> 59)return (xorshifted >> rot) | (xorshifted << ((-rot) & 31))The state advances identically in Go, Java, Python, TypeScript (using BigInt), ensuring perfect cross-language reproducibility.
SeedFrom
Section titled “SeedFrom”Purpose: Convert human-readable strings to uint64 seeds.
Algorithm: FNV-1a hash with Murmur3 Avalanche finalization.
Why this combination?
- FNV-1a: Fast initial hash, simple to implement
- Murmur3 Avalanche: Improves bit distribution, reduces collision clusters
- Portable: Consistent across all languages
- No dependencies: Pure algorithm, no external libraries
Implementation:
func SeedFrom(s string) uint64 { // FNV-1a hash hash := uint64(14695981039346656037) // FNV offset basis for _, b := range []byte(s) { hash ^= uint64(b) hash *= 1099511628211 // FNV prime }
// Murmur3 Avalanche finalization hash ^= hash >> 33 hash *= 0xff51afd7ed558ccd hash ^= hash >> 33 hash *= 0xc4ceb9fe1a85ec53 hash ^= hash >> 33
return hash}This allows readable test scenarios: seed_from("test-scenario-1") produces the same seed everywhere.
PseudoID
Section titled “PseudoID”Purpose: Encode seed and index into reversible UUID v8 identifiers.
Format: UUID v8 with custom layout
SSSSSSSS-SSSS-8SSS-vSTT-TTIIIIIIIII
S = WorldSeed bitsT = TypeSeq bitsI = Index bitsv = Variant nibble + Skip bits (reserved)8 = Version (UUID v8)Why UUID v8?
- Standard: RFC 9562 allows custom data
- Reversible: Can extract seed and index
- Unique: Seed + index pair ensures uniqueness
- Compatible: Works with existing UUID infrastructure
Encoding:
func (g *Generator) ID() string { // Extract seed and index bits seedHigh := (g.seed >> 32) & 0xFFFFFFFF seedLow := (g.seed >> 16) & 0xFFFF indexHigh := (g.index >> 48) & 0x0FFF indexLow := g.index & 0xFFFFFFFFFFFF
// Format as UUID v8 return fmt.Sprintf("%08x-%04x-8%03x-%04x-%012x", seedHigh, seedLow, indexHigh, 0x8000 | ((indexLow >> 48) & 0x0FFF), indexLow & 0xFFFFFFFFFFFF)}Pseudo Arrays
Section titled “Pseudo Arrays”Purpose: Pseudo-Arrays with access and zero memory overhead.
Concept: Arrays don’t store data; they regenerate it on demand using deterministic seeding.
How It Works:
type UserArray struct { seed uint64}
func (a *UserArray) At(index int64) User { // Create generator with base seed + index gen := NewGenerator(a.seed, uint64(index)) primitives := NewPrimitives(gen)
// Generate user deterministically return User{ ID: primitives.ID(), Name: primitives.Name(), Email: primitives.Email(), // ... other fields }}Benefits:
- Memory efficient:
User[1_000_000]uses ~16 bytes - Instant access: No pre-generation or loading time
- Reproducible: Same index always returns same object
- Language agnostic: Works identically everywhere
Trade-off: CPU time for memory. Accessing user[500] recalculates it each time. For frequent access, cache the result.
Resource System
Section titled “Resource System”Purpose: Provide locale-specific data (names, cities, etc.) for realistic generation.
Atomic Module Structure:
typespec/resources/├── general/ # Global resources (all locales)│ └── email_domains.txt├── lang/ # Language-specific│ ├── en/│ │ ├── months.txt│ │ └── weekdays.txt│ └── fr/│ ├── months.txt│ └── weekdays.txt├── country/ # Country-specific│ ├── us/│ │ ├── address_format.txt│ │ └── street_format.txt│ └── ca/│ ├── address_format.txt│ └── street_format.txt└── locale/ # Locale-specific ├── en_us/ │ ├── given_male_names.txt │ ├── given_female_names.txt │ ├── family_names.txt │ ├── cities.txt │ └── streets.txt └── fr_ca/ ├── given_male_names.txt └── ...Note: All directory and file names use lowercase (e.g., en_us, not en_US) for cross-platform consistency.
Loading Strategy:
- Build time:
resource-emitter.jsdiscovers and embeds resources into atomic modules - Runtime: SDK provides resources as separate importable modules for tree-shaking
- Tree-shaking: Import only the bundles you need (e.g., US bundle = ~50KB vs World = ~750KB)
- No fallback: Each locale must have complete resource files
Generated Structure (TypeScript example):
// Atomic modulesimport { data as generalData } from "./resources/general/data.js";import { data as langenData } from "./resources/lang/en/data.js";import { data as countryusData } from "./resources/country/us/data.js";import { data as localeenusData } from "./resources/locale/en_us/data.js";
// Pre-composed bundlesimport { ResourcesUS } from "./resources/bundles/us.js"; // ~50KBimport { ResourcesWorld } from "./resources/bundles/world.js"; // ~750KBAccess Pattern:
data := p.resources()name := data.givenMaleNames[p.generator.Intn(len(data.givenMaleNames))]Resources are embedded at compile time with atomic modules enabling tree-shaking, ensuring no runtime file I/O and optimal bundle sizes. Each locale is self-contained with all required resource files.
Code Generation Pipeline
Section titled “Code Generation Pipeline”Why Code Generation?
Section titled “Why Code Generation?”Single Source of Truth: TypeSpec definitions prevent drift between languages.
Consistency: Generated code follows identical patterns.
Maintainability: Changes propagate automatically to all SDKs.
Type Safety: Models are strongly typed in each language.
Pipeline Stages
Section titled “Pipeline Stages”1. TypeSpec → JSON Schema
Section titled “1. TypeSpec → JSON Schema”Standard TypeSpec compiler generates JSON Schema for models:
@modelmodel User { @generator("id") id: string; @generator("name") name: string; @generator("email") email: string;}Output: User.json schema for quicktype consumption.
2. Custom Emitters → Interfaces/Arrays
Section titled “2. Custom Emitters → Interfaces/Arrays”Three custom emitters generate SDK code:
interface-emitter.js:
- Reads
Primitivesinterface from TypeSpec - Generates language-specific interface declarations
- Maps TypeSpec types to native types (int64 → long/i64/number)
array-emitter.js:
- Finds models decorated with
@array(typeSequence) - Generates
<Model>Arrayclasses extendingPseudoArray - Implements
at(index)using primitives + generators
resource-emitter.js:
- Scans
typespec/resources/directory - Discovers locales automatically (no registration needed)
- Embeds resource data as language-specific constants
3. Quicktype → Models
Section titled “3. Quicktype → Models”Quicktype transforms JSON Schema into idiomatic models:
- Go: structs with json tags
- Java: classes with Jackson annotations
- Python: dataclasses
- TypeScript: interfaces
Why quicktype? TypeSpec supports only a handful of languages natively, while quicktype supports 20+ languages including C#, PHP, Rust, Swift, Dart, and many others. Using JSON Schema as an intermediate format allows Pseudata to leverage quicktype’s extensive language support for model generation.
Build Command
Section titled “Build Command”task generateRuns all emitters and code generators, producing:
sdks/├── go/│ ├── primitives.go # generated by interface-emitter│ ├── arrays.go # generated by array-emitter│ ├── resources.go # generated by resource-emitter│ └── models.go # generated by quicktype├── python/│ └── pseudata/│ ├── primitives.py # generated by interface-emitter│ ├── arrays.py # generated by array-emitter│ ├── resources.py # generated by resource-emitter│ └── models.py # generated by quicktype└── ...Testing Strategy
Section titled “Testing Strategy”Fixture-Based Testing
Section titled “Fixture-Based Testing”Problem: How to verify cross-language consistency?
Solution: Pre-generated test vectors (fixtures) as single source of truth.
The Golden Reference Pattern
Section titled “The Golden Reference Pattern”- Implement in Go first (chosen as reference language)
- Run tests with
-updateflag to generate fixture JSON - All other languages load the same fixtures and verify outputs match
Example Fixture:
{ "tests": [ { "seed": "42", "index": 0, "expected": { "id": "0000002a-0000-8000-8000-000000000000", "name": "John Smith", "email": "john.smith@example.com" } } ]}What Fixtures Catch
Section titled “What Fixtures Catch”Real bugs caught by fixture testing:
- RNG State Bugs: Calling
rng()multiple times in a loop - UTF-8 Handling: Byte slicing corrupting multi-byte characters
- Type Mapping: Signed vs. unsigned integer handling
- Resource Access: Wrong locale or missing fallback
- Bitwise Operations: UUID encoding differences
Test Organization
Section titled “Test Organization”fixtures/├── pcg32_test_vectors.json # Generator tests├── seedfrom_test_vectors.json # SeedFrom tests├── id_utils_test_vectors.json # PseudoID tests├── primitives_test_vectors.json # All primitives├── array_user_test_vectors.json # User array└── array_address_test_vectors.jsonEach SDK loads these fixtures and verifies exact matches. See the Testing Guide for implementation details.
Design Decisions
Section titled “Design Decisions”PCG32 over Other PRNGs
Section titled “PCG32 over Other PRNGs”Alternatives considered: Mersenne Twister, xoroshiro128+, SplitMix64, Mulberry32
Why PCG32?
- Portability: Simpler to implement correctly across languages than MT19937
- Size: 64-bit state vs. MT’s 2.5KB
- Quality: Sufficient for mock data (not cryptographic needs)
- Speed: Faster than MT, comparable to xoroshiro
- Better than Mulberry32: 64-bit state vs. 32-bit, longer period, better statistical properties
- Proven: Widely used, well-tested
Pseudo-Arrays over Pre-generation
Section titled “Pseudo-Arrays over Pre-generation”Alternatives considered: Pre-generate and serialize, lazy generation with caching
Why Pseudo Arrays?
- Memory efficiency: Testing pagination with millions of items
- Instant startup: No loading or deserialization time
- Simplicity: No cache invalidation logic
- Determinism: Recalculation guarantees consistency
Trade-off: CPU time on repeated access. Applications can add their own caching layer if needed.
Code Generation over Manual Implementation
Section titled “Code Generation over Manual Implementation”Alternatives considered: Hand-write each SDK, shared C library with bindings, runtime code generation
Why code generation?
- Single source of truth: TypeSpec definitions prevent drift between languages
- Consistency: Generated code follows identical patterns and structure
- Maintainability: Changes propagate automatically to all SDKs
- Type safety: Strong typing in each language’s native system
- No runtime overhead: Generated code is native, not interpreted
- Language idioms: Each SDK uses language-specific patterns and conventions
- Compile-time validation: Errors caught during build, not at runtime
Trade-off: Requires build step, but ensures correctness and eliminates manual synchronization burden across 9+ languages.
TypeSpec over Other IDLs
Section titled “TypeSpec over Other IDLs”Alternatives considered: Protocol Buffers, Thrift, OpenAPI, JSON Schema
Why TypeSpec?
- Designed for code generation: First-class emitter API
- Extensible: Custom decorators (
@array,@generator) - Modern: TypeScript-based, active development
- JSON Schema output: Can leverage quicktype
- Microsoft backing: Long-term support expected
Pseudo Arrays over Pre-generation
Section titled “Pseudo Arrays over Pre-generation”Alternatives considered: Pre-generate and serialize, lazy generation with caching
Why Pseudo Arrays?
- Memory efficiency: Testing pagination with millions of items
- Instant startup: No loading or deserialization time
- Simplicity: No cache invalidation logic
- Determinism: Recalculation guarantees consistency
Trade-off: CPU time on repeated access. Applications can add their own caching layer if needed.
Fixtures over Hope
Section titled “Fixtures over Hope”Alternatives considered: Property-based testing, manual verification, parallel implementations
Why fixtures?
- Concrete verification: Exact expected outputs, not properties
- Debugging: Easy to reproduce failures with specific test cases
- Comprehensive: Cover edge cases explicitly
- Fast: No complex property generation or shrinking
- Simple: JSON files are human-readable
Process: Go generates fixtures → All languages verify → Commit fixtures to git
© 2025 Pseudata Project. Open Source under Apache License 2.0. · RSS Feed