Data Generation#

pyrudof includes bindings for rudof_generate, a module that generates synthetic RDF data from ShEx or SHACL schemas. This is useful for testing, benchmarking, and creating sample datasets.

Overview#

The data generation module provides:

Schema-driven generation: Create data that conforms to your ShEx or SHACL schemas
Reproducible results: Use seeds for deterministic generation
Parallel processing: Generate large datasets efficiently
Quality control: Configure data quality from simple to complex
Flexible output: Support for Turtle and N-Triples formats

Basic Usage#

The simplest way to generate data:

import pyrudof

# 1. Configure
config = pyrudof.GeneratorConfig()
config.set_entity_count(100)
config.set_output_path("output.ttl")
config.set_output_format(pyrudof.OutputFormat.Turtle)

# 2. Create generator
generator = pyrudof.DataGenerator(config)

# 3. Load schema and generate
generator.run("schema.shex")

Step-by-Step Generation#

You can also load schemas and generate data in separate steps:

generator = pyrudof.DataGenerator(config)

# Load schema (choose one method)
generator.load_shex_schema("schema.shex")
# OR
generator.load_shacl_schema("shapes.ttl")
# OR auto-detect format
generator.load_schema_auto("schema_file")

# Then generate
generator.generate()

Configuration#

Configuration from Python#

config = pyrudof.GeneratorConfig()

# Basic settings
config.set_entity_count(1000)
config.set_output_path("/tmp/generated_data.ttl")
config.set_output_format(pyrudof.OutputFormat.Turtle)

# Reproducibility
config.set_seed(42)

# Data quality
config.set_data_quality(pyrudof.DataQuality.High)
config.set_locale("en")  # Use English locale for generated text

# Cardinality handling
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)

# Performance
config.set_worker_threads(4)
config.set_batch_size(100)
config.set_parallel_writing(True)

Reproducible Generation#

Use a seed for reproducible results:

config = pyrudof.GeneratorConfig()
config.set_seed(42)
config.set_entity_count(50)

generator = pyrudof.DataGenerator(config)
generator.run("schema.shex")

# Running again with the same seed produces identical output

Note

Setting a seed ensures that the same configuration always generates the same data, which is essential for reproducible testing and benchmarking.

Cardinality Strategies#

Control how cardinalities are handled when generating relationships:

Strategy	Description
`Minimum`	Generate the minimum number of relationships allowed
`Maximum`	Generate the maximum number of relationships allowed
`Random`	Generate a random number within the valid range
`Balanced`	Use a balanced distribution (default, recommended)

config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)

Example with different strategies:

# Minimum relationships (faster, smaller output)
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Minimum)

# Maximum relationships (slower, larger output, tests edge cases)
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Maximum)

Data Quality Levels#

Configure the realism and complexity of generated data:

Level	Characteristics	Use Case
`Low`	Fast, simple random data	Quick testing, performance benchmarks
`Medium`	Realistic patterns	Integration testing, demos
`High`	Complex, correlated data	Production-like testing, presentations

# High-quality data with correlations
config.set_data_quality(pyrudof.DataQuality.High)
config.set_locale("es")  # Spanish locale

Tip

Use DataQuality.Low for performance testing and DataQuality.High when you need realistic data for demonstrations or integration testing.

Parallel Processing#

Enable parallel processing for faster generation of large datasets:

config = pyrudof.GeneratorConfig()
config.set_entity_count(10000)

# Enable parallelization
config.set_worker_threads(4)  # Use 4 CPU cores
config.set_batch_size(100)    # Process 100 entities per batch
config.set_parallel_shapes(True)   # Parallel shape processing
config.set_parallel_fields(True)   # Parallel field generation
config.set_parallel_writing(True)  # Parallel output writing
config.set_parallel_file_count(4)  # Write to 4 files simultaneously

generator = pyrudof.DataGenerator(config)
generator.run("large_schema.shex")

Warning

Using parallel writing creates multiple output files. You’ll need to merge them manually if you need a single file.

Output Formats#

Supported output formats:

Turtle (OutputFormat.Turtle) - Human-readable, compact (default)
N-Triples (OutputFormat.NTriples) - Line-based, simple format

# Turtle format (default)
config.set_output_format(pyrudof.OutputFormat.Turtle)

# N-Triples format (useful for streaming processing)
config.set_output_format(pyrudof.OutputFormat.NTriples)

# Enable compression
config.set_compress(True)  # Creates .ttl.gz or .nt.gz

# Generate statistics file
config.set_write_stats(True)  # Creates output.stats.json

Advanced Example#

Complete example with all features:

import pyrudof

# Create configuration
config = pyrudof.GeneratorConfig()

# Generation settings
config.set_entity_count(5000)
config.set_seed(42)
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)
config.set_data_quality(pyrudof.DataQuality.High)
config.set_locale("en")

# Output settings
config.set_output_path("./output/data.ttl")
config.set_output_format(pyrudof.OutputFormat.Turtle)
config.set_compress(True)
config.set_write_stats(True)

# Performance settings
config.set_worker_threads(8)
config.set_batch_size(500)
config.set_parallel_shapes(True)
config.set_parallel_fields(True)

# Validate configuration
config.validate()

# Create generator and run
generator = pyrudof.DataGenerator(config)
generator.run_with_format("schema.shex", pyrudof.SchemaFormat.ShEx)

print("Generation complete!")
print(f"Configuration: {config.show()}")

API Reference#

Generator Configuration#

class pyrudof.GeneratorConfig#

Python wrapper for GeneratorConfig from rudof_generate.

Provides access to configuration options for synthetic data generation.

classmethod __new__(*args, **kwargs)#

static from_json_file(path)#

Load configuration from a JSON file.

Parameters:: path (str) – Path to the JSON configuration file.
Returns:: Loaded configuration object.
Return type:: GeneratorConfig
Raises:: ValueError – If the file cannot be read or parsed.

static from_toml_file(path)#

Load configuration from a TOML file.

Parameters:: path (str) – Path to the TOML configuration file.
Returns:: Loaded configuration object.
Return type:: GeneratorConfig
Raises:: ValueError – If the file cannot be read or parsed.

get_batch_size()#

Get batch size for parallel processing.

Returns:: Batch size.
Return type:: int

get_compress()#

Get whether output is compressed.

Returns:: True if compression is enabled.
Return type:: bool

get_entity_count()#

Get the number of entities to generate.

Returns:: Number of entities.
Return type:: int

get_locale()#

Get locale for field generation.

Returns:: Current locale.
Return type:: str

get_output_path()#

Get the output file path.

Returns:: Output path.
Return type:: str

get_parallel_fields()#

Get whether parallel field generation is enabled.

Returns:: True if enabled.
Return type:: bool

get_parallel_file_count()#

Get number of parallel output files.

Returns:: Number of files.
Return type:: int

get_parallel_shapes()#

Get whether parallel shape processing is enabled.

Returns:: True if enabled.
Return type:: bool

get_parallel_writing()#

Get whether parallel writing is enabled.

Returns:: True if enabled.
Return type:: bool

get_seed()#

Get the random seed.

Returns:: Seed value.
Return type:: Optional[int]

get_worker_threads()#

Get number of worker threads.

Returns:: Worker threads.
Return type:: Optional[int]

get_write_stats()#

Get whether statistics will be written.

Returns:: True if statistics are written.
Return type:: bool

set_batch_size(batch_size)#

Set the batch size for parallel processing.

Parameters:: batch_size (int) – Batch size.

set_cardinality_strategy(strategy)#

Set the cardinality strategy.

Parameters:: strategy (CardinalityStrategy) – Strategy for cardinalities.

set_compress(compress)#

Enable or disable compression.

Parameters:: compress (bool) – Whether to compress the output.

set_data_quality(quality)#

Set data quality level.

Parameters:: quality (DataQuality) – Data quality (Low, Medium, High).

set_entity_count(count)#

Set the number of entities to generate.

Parameters:: count (int) – Number of entities to generate.

set_entity_distribution(distribution)#

Set entity distribution strategy.

Parameters:: distribution (EntityDistribution) – Entity distribution strategy.

set_locale(locale)#

Set locale for field generation.

Parameters:: locale (str) – Locale string (e.g., “en”, “es”).

set_output_format(format)#

Set the output format.

Parameters:: format (OutputFormat) – Desired output format.

set_output_path(path)#

Set the output file path.

Parameters:: path (str) – Path to output file.

set_parallel_fields(enabled)#

Enable or disable parallel field generation.

Parameters:: enabled (bool) – Whether parallel field generation is enabled.

set_parallel_file_count(count)#

Set the number of parallel output files.

Parameters:: count (int) – Number of files.

set_parallel_shapes(enabled)#

Enable or disable parallel shape processing.

Parameters:: enabled (bool) – Whether parallel shape processing is enabled.

set_parallel_writing(parallel_writing)#

Enable or disable parallel writing.

Parameters:: parallel_writing (bool) – Whether to write output in parallel.

set_schema_format(format)#

Set the schema format.

Parameters:: format (Optional[SchemaFormat]) – Desired schema format.

set_seed(seed)#

Set the random seed for reproducible generation.

Parameters:: seed (Optional[int]) – Seed value.

set_worker_threads(threads)#

Set the number of worker threads.

Parameters:: threads (Optional[int]) – Number of threads.

set_write_stats(write_stats)#

Enable or disable writing statistics.

Parameters:: write_stats (bool) – Whether to write statistics.

show()#

Convert configuration to string.

Returns:: Debug string of the configuration.
Return type:: str

to_toml_file(path)#

Save configuration to a TOML file.

Parameters:: path (str) – Path where the TOML file will be saved.
Raises:: ValueError – If writing to the file fails.

validate()#

Validate the configuration.

Raises:: ValueError – If configuration is invalid.

Data Generator#

class pyrudof.DataGenerator#

Main data generator class.

Provides an interface to load schemas and generate synthetic RDF data.

classmethod __new__(*args, **kwargs)#

generate()#

Generate synthetic data and write it to the configured output.

Raises:

RuntimeError – If the generator is not initialized.
ValueError – If data generation fails.

load_schema_auto(path)#

Auto-detect schema format and load it.

Parameters:

path (str) – Path to the schema file.

Raises:

RuntimeError – If the generator is not initialized.
ValueError – If the schema cannot be loaded or parsed.

load_shacl_schema(path)#

Load and process a SHACL schema file.

Parameters:

path (str) – Path to the SHACL schema file.

Raises:

RuntimeError – If the generator is not initialized.
ValueError – If the schema cannot be loaded or parsed.

load_shex_schema(path)#

Load and process a ShEx schema file.

Parameters:

path (str) – Path to the ShEx schema file.

Raises:

RuntimeError – If the generator is not initialized.
ValueError – If the schema cannot be loaded or parsed.

run(schema_path)#

Run the complete generation pipeline with automatic schema format detection.

Parameters:

schema_path (str) – Path to the schema file.

Raises:

RuntimeError – If the generator is not initialized.
ValueError – If schema loading or generation fails.

run_with_format(schema_path, format)#

Run the complete generation pipeline with optional schema format.

Parameters:

schema_path (str) – Path to the schema file.
format (Optional[SchemaFormat]) – Schema format. If None, auto-detect.

Raises:

RuntimeError – If the generator is not initialized.
ValueError – If schema loading or generation fails.

Formats#

Schema Format#

class pyrudof.SchemaFormat#

Schema format for the generator.

Represents the supported schema formats that can be used to drive the data generation process.

Schema formats supported by the generator:

SchemaFormat.ShEx - Shape Expressions schema
SchemaFormat.SHACL - SHACL (Shapes Constraint Language) schema

ShEx = SchemaFormat.ShEx#

Shacl = SchemaFormat.Shacl#

classmethod __new__(*args, **kwargs)#

Output Format#

class pyrudof.OutputFormat#

Output format for generated data.

Defines the RDF serialization format used for generated output.

RDF serialization formats for generated output:

OutputFormat.Turtle - Turtle/Terse RDF Triple Language (.ttl) - Human-readable, compact (default)
OutputFormat.NTriples - N-Triples (.nt) - Line-based, simple format for streaming

NTriples = OutputFormat.NTriples#

Turtle = OutputFormat.Turtle#

classmethod __new__(*args, **kwargs)#

Cardinality Strategy#

class pyrudof.CardinalityStrategy#

Strategy for handling cardinalities in relationships.

Determines how many relationships are generated when constraints define minimum and maximum cardinalities.

Strategies for handling cardinalities when generating relationships:

CardinalityStrategy.Minimum - Always use minimum cardinality (fastest, smallest output)
CardinalityStrategy.Maximum - Always use maximum cardinality (slowest, largest output, tests edge cases)
CardinalityStrategy.Random - Random value within valid range (unpredictable distribution)
CardinalityStrategy.Balanced - Balanced distribution across range (default, recommended)

Balanced = CardinalityStrategy.Balanced#

Maximum = CardinalityStrategy.Maximum#

Minimum = CardinalityStrategy.Minimum#

Random = CardinalityStrategy.Random#

classmethod __new__(*args, **kwargs)#

Data Quality#

class pyrudof.DataQuality#

Data quality level for generated data.

Controls how realistic and complex the generated data should be.

Data quality levels controlling realism and complexity:

DataQuality.Low - Simple random data (fastest generation, minimal realism)
DataQuality.Medium - Realistic patterns (moderate speed, good for demos)
DataQuality.High - Complex realistic data with correlations (slower, production-like)

High = DataQuality.High#

Low = DataQuality.Low#

Medium = DataQuality.Medium#

classmethod __new__(*args, **kwargs)#

Entity Distribution#

class pyrudof.EntityDistribution#

Entity distribution strategy.

Defines how entities are distributed across shapes during generation.

Entity distribution strategies across shapes:

EntityDistribution.Equal - Equal distribution of entities across all shapes

Note

Currently only Equal distribution is supported.

Equal = EntityDistribution.Equal#

classmethod __new__(*args, **kwargs)#

API Reference

Examples