Data Generation#

pyrudof includes bindings for rudof_generate, a module that generates synthetic RDF data from ShEx or SHACL schemas. This is useful for testing, benchmarking, and creating sample datasets.

Overview#

The data generation module provides:

  • Schema-driven generation: Create data that conforms to your ShEx or SHACL schemas

  • Reproducible results: Use seeds for deterministic generation

  • Parallel processing: Generate large datasets efficiently

  • Quality control: Configure data quality from simple to complex

  • Flexible output: Support for Turtle and N-Triples formats

Basic Usage#

The simplest way to generate data:

import pyrudof

# 1. Configure
config = pyrudof.GeneratorConfig()
config.set_entity_count(100)
config.set_output_path("output.ttl")
config.set_output_format(pyrudof.OutputFormat.Turtle)

# 2. Create generator
generator = pyrudof.DataGenerator(config)

# 3. Load schema and generate
generator.run("schema.shex")

Step-by-Step Generation#

You can also load schemas and generate data in separate steps:

generator = pyrudof.DataGenerator(config)

# Load schema (choose one method)
generator.load_shex_schema("schema.shex")
# OR
generator.load_shacl_schema("shapes.ttl")
# OR auto-detect format
generator.load_schema_auto("schema_file")

# Then generate
generator.generate()

Configuration#

Configuration from Python#

config = pyrudof.GeneratorConfig()

# Basic settings
config.set_entity_count(1000)
config.set_output_path("/tmp/generated_data.ttl")
config.set_output_format(pyrudof.OutputFormat.Turtle)

# Reproducibility
config.set_seed(42)

# Data quality
config.set_data_quality(pyrudof.DataQuality.High)
config.set_locale("en")  # Use English locale for generated text

# Cardinality handling
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)

# Performance
config.set_worker_threads(4)
config.set_batch_size(100)
config.set_parallel_writing(True)

Reproducible Generation#

Use a seed for reproducible results:

config = pyrudof.GeneratorConfig()
config.set_seed(42)
config.set_entity_count(50)

generator = pyrudof.DataGenerator(config)
generator.run("schema.shex")

# Running again with the same seed produces identical output

Note

Setting a seed ensures that the same configuration always generates the same data, which is essential for reproducible testing and benchmarking.

Cardinality Strategies#

Control how cardinalities are handled when generating relationships:

Strategy

Description

Minimum

Generate the minimum number of relationships allowed

Maximum

Generate the maximum number of relationships allowed

Random

Generate a random number within the valid range

Balanced

Use a balanced distribution (default, recommended)

config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)

Example with different strategies:

# Minimum relationships (faster, smaller output)
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Minimum)

# Maximum relationships (slower, larger output, tests edge cases)
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Maximum)

Data Quality Levels#

Configure the realism and complexity of generated data:

Level

Characteristics

Use Case

Low

Fast, simple random data

Quick testing, performance benchmarks

Medium

Realistic patterns

Integration testing, demos

High

Complex, correlated data

Production-like testing, presentations

# High-quality data with correlations
config.set_data_quality(pyrudof.DataQuality.High)
config.set_locale("es")  # Spanish locale

Tip

Use DataQuality.Low for performance testing and DataQuality.High when you need realistic data for demonstrations or integration testing.

Parallel Processing#

Enable parallel processing for faster generation of large datasets:

config = pyrudof.GeneratorConfig()
config.set_entity_count(10000)

# Enable parallelization
config.set_worker_threads(4)  # Use 4 CPU cores
config.set_batch_size(100)    # Process 100 entities per batch
config.set_parallel_shapes(True)   # Parallel shape processing
config.set_parallel_fields(True)   # Parallel field generation
config.set_parallel_writing(True)  # Parallel output writing
config.set_parallel_file_count(4)  # Write to 4 files simultaneously

generator = pyrudof.DataGenerator(config)
generator.run("large_schema.shex")

Warning

Using parallel writing creates multiple output files. You’ll need to merge them manually if you need a single file.

Output Formats#

Supported output formats:

  • Turtle (OutputFormat.Turtle) - Human-readable, compact (default)

  • N-Triples (OutputFormat.NTriples) - Line-based, simple format

# Turtle format (default)
config.set_output_format(pyrudof.OutputFormat.Turtle)

# N-Triples format (useful for streaming processing)
config.set_output_format(pyrudof.OutputFormat.NTriples)

# Enable compression
config.set_compress(True)  # Creates .ttl.gz or .nt.gz

# Generate statistics file
config.set_write_stats(True)  # Creates output.stats.json

Advanced Example#

Complete example with all features:

import pyrudof

# Create configuration
config = pyrudof.GeneratorConfig()

# Generation settings
config.set_entity_count(5000)
config.set_seed(42)
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)
config.set_data_quality(pyrudof.DataQuality.High)
config.set_locale("en")

# Output settings
config.set_output_path("./output/data.ttl")
config.set_output_format(pyrudof.OutputFormat.Turtle)
config.set_compress(True)
config.set_write_stats(True)

# Performance settings
config.set_worker_threads(8)
config.set_batch_size(500)
config.set_parallel_shapes(True)
config.set_parallel_fields(True)

# Validate configuration
config.validate()

# Create generator and run
generator = pyrudof.DataGenerator(config)
generator.run_with_format("schema.shex", pyrudof.SchemaFormat.ShEx)

print("Generation complete!")
print(f"Configuration: {config.show()}")

API Reference#

Generator Configuration#

class pyrudof.GeneratorConfig#

Python wrapper for GeneratorConfig from rudof_generate.

Provides access to configuration options for synthetic data generation.

classmethod __new__(*args, **kwargs)#
static from_json_file(path)#

Load configuration from a JSON file.

Parameters:

path (str) – Path to the JSON configuration file.

Returns:

Loaded configuration object.

Return type:

GeneratorConfig

Raises:

ValueError – If the file cannot be read or parsed.

static from_toml_file(path)#

Load configuration from a TOML file.

Parameters:

path (str) – Path to the TOML configuration file.

Returns:

Loaded configuration object.

Return type:

GeneratorConfig

Raises:

ValueError – If the file cannot be read or parsed.

get_batch_size()#

Get batch size for parallel processing.

Returns:

Batch size.

Return type:

int

get_compress()#

Get whether output is compressed.

Returns:

True if compression is enabled.

Return type:

bool

get_entity_count()#

Get the number of entities to generate.

Returns:

Number of entities.

Return type:

int

get_locale()#

Get locale for field generation.

Returns:

Current locale.

Return type:

str

get_output_path()#

Get the output file path.

Returns:

Output path.

Return type:

str

get_parallel_fields()#

Get whether parallel field generation is enabled.

Returns:

True if enabled.

Return type:

bool

get_parallel_file_count()#

Get number of parallel output files.

Returns:

Number of files.

Return type:

int

get_parallel_shapes()#

Get whether parallel shape processing is enabled.

Returns:

True if enabled.

Return type:

bool

get_parallel_writing()#

Get whether parallel writing is enabled.

Returns:

True if enabled.

Return type:

bool

get_seed()#

Get the random seed.

Returns:

Seed value.

Return type:

Optional[int]

get_worker_threads()#

Get number of worker threads.

Returns:

Worker threads.

Return type:

Optional[int]

get_write_stats()#

Get whether statistics will be written.

Returns:

True if statistics are written.

Return type:

bool

set_batch_size(batch_size)#

Set the batch size for parallel processing.

Parameters:

batch_size (int) – Batch size.

set_cardinality_strategy(strategy)#

Set the cardinality strategy.

Parameters:

strategy (CardinalityStrategy) – Strategy for cardinalities.

set_compress(compress)#

Enable or disable compression.

Parameters:

compress (bool) – Whether to compress the output.

set_data_quality(quality)#

Set data quality level.

Parameters:

quality (DataQuality) – Data quality (Low, Medium, High).

set_entity_count(count)#

Set the number of entities to generate.

Parameters:

count (int) – Number of entities to generate.

set_entity_distribution(distribution)#

Set entity distribution strategy.

Parameters:

distribution (EntityDistribution) – Entity distribution strategy.

set_locale(locale)#

Set locale for field generation.

Parameters:

locale (str) – Locale string (e.g., “en”, “es”).

set_output_format(format)#

Set the output format.

Parameters:

format (OutputFormat) – Desired output format.

set_output_path(path)#

Set the output file path.

Parameters:

path (str) – Path to output file.

set_parallel_fields(enabled)#

Enable or disable parallel field generation.

Parameters:

enabled (bool) – Whether parallel field generation is enabled.

set_parallel_file_count(count)#

Set the number of parallel output files.

Parameters:

count (int) – Number of files.

set_parallel_shapes(enabled)#

Enable or disable parallel shape processing.

Parameters:

enabled (bool) – Whether parallel shape processing is enabled.

set_parallel_writing(parallel_writing)#

Enable or disable parallel writing.

Parameters:

parallel_writing (bool) – Whether to write output in parallel.

set_schema_format(format)#

Set the schema format.

Parameters:

format (Optional[SchemaFormat]) – Desired schema format.

set_seed(seed)#

Set the random seed for reproducible generation.

Parameters:

seed (Optional[int]) – Seed value.

set_worker_threads(threads)#

Set the number of worker threads.

Parameters:

threads (Optional[int]) – Number of threads.

set_write_stats(write_stats)#

Enable or disable writing statistics.

Parameters:

write_stats (bool) – Whether to write statistics.

show()#

Convert configuration to string.

Returns:

Debug string of the configuration.

Return type:

str

to_toml_file(path)#

Save configuration to a TOML file.

Parameters:

path (str) – Path where the TOML file will be saved.

Raises:

ValueError – If writing to the file fails.

validate()#

Validate the configuration.

Raises:

ValueError – If configuration is invalid.

Data Generator#

class pyrudof.DataGenerator#

Main data generator class.

Provides an interface to load schemas and generate synthetic RDF data.

classmethod __new__(*args, **kwargs)#
generate()#

Generate synthetic data and write it to the configured output.

Raises:
load_schema_auto(path)#

Auto-detect schema format and load it.

Parameters:

path (str) – Path to the schema file.

Raises:
  • RuntimeError – If the generator is not initialized.

  • ValueError – If the schema cannot be loaded or parsed.

load_shacl_schema(path)#

Load and process a SHACL schema file.

Parameters:

path (str) – Path to the SHACL schema file.

Raises:
  • RuntimeError – If the generator is not initialized.

  • ValueError – If the schema cannot be loaded or parsed.

load_shex_schema(path)#

Load and process a ShEx schema file.

Parameters:

path (str) – Path to the ShEx schema file.

Raises:
  • RuntimeError – If the generator is not initialized.

  • ValueError – If the schema cannot be loaded or parsed.

run(schema_path)#

Run the complete generation pipeline with automatic schema format detection.

Parameters:

schema_path (str) – Path to the schema file.

Raises:
run_with_format(schema_path, format)#

Run the complete generation pipeline with optional schema format.

Parameters:
  • schema_path (str) – Path to the schema file.

  • format (Optional[SchemaFormat]) – Schema format. If None, auto-detect.

Raises:

Formats#

Schema Format#

class pyrudof.SchemaFormat#

Schema format for the generator.

Represents the supported schema formats that can be used to drive the data generation process.

Schema formats supported by the generator:

  • SchemaFormat.ShEx - Shape Expressions schema

  • SchemaFormat.SHACL - SHACL (Shapes Constraint Language) schema

ShEx = SchemaFormat.ShEx#
Shacl = SchemaFormat.Shacl#
classmethod __new__(*args, **kwargs)#

Output Format#

class pyrudof.OutputFormat#

Output format for generated data.

Defines the RDF serialization format used for generated output.

RDF serialization formats for generated output:

  • OutputFormat.Turtle - Turtle/Terse RDF Triple Language (.ttl) - Human-readable, compact (default)

  • OutputFormat.NTriples - N-Triples (.nt) - Line-based, simple format for streaming

NTriples = OutputFormat.NTriples#
Turtle = OutputFormat.Turtle#
classmethod __new__(*args, **kwargs)#

Cardinality Strategy#

class pyrudof.CardinalityStrategy#

Strategy for handling cardinalities in relationships.

Determines how many relationships are generated when constraints define minimum and maximum cardinalities.

Strategies for handling cardinalities when generating relationships:

  • CardinalityStrategy.Minimum - Always use minimum cardinality (fastest, smallest output)

  • CardinalityStrategy.Maximum - Always use maximum cardinality (slowest, largest output, tests edge cases)

  • CardinalityStrategy.Random - Random value within valid range (unpredictable distribution)

  • CardinalityStrategy.Balanced - Balanced distribution across range (default, recommended)

Balanced = CardinalityStrategy.Balanced#
Maximum = CardinalityStrategy.Maximum#
Minimum = CardinalityStrategy.Minimum#
Random = CardinalityStrategy.Random#
classmethod __new__(*args, **kwargs)#

Data Quality#

class pyrudof.DataQuality#

Data quality level for generated data.

Controls how realistic and complex the generated data should be.

Data quality levels controlling realism and complexity:

  • DataQuality.Low - Simple random data (fastest generation, minimal realism)

  • DataQuality.Medium - Realistic patterns (moderate speed, good for demos)

  • DataQuality.High - Complex realistic data with correlations (slower, production-like)

High = DataQuality.High#
Low = DataQuality.Low#
Medium = DataQuality.Medium#
classmethod __new__(*args, **kwargs)#

Entity Distribution#

class pyrudof.EntityDistribution#

Entity distribution strategy.

Defines how entities are distributed across shapes during generation.

Entity distribution strategies across shapes:

  • EntityDistribution.Equal - Equal distribution of entities across all shapes

Note

Currently only Equal distribution is supported.

Equal = EntityDistribution.Equal#
classmethod __new__(*args, **kwargs)#