Data Generation#
pyrudof includes bindings for rudof_generate, a module that generates
synthetic RDF data from ShEx or SHACL schemas. This is useful for testing,
benchmarking, and creating sample datasets.
Overview#
The data generation module provides:
Schema-driven generation: Create data that conforms to your ShEx or SHACL schemas
Reproducible results: Use seeds for deterministic generation
Parallel processing: Generate large datasets efficiently
Quality control: Configure data quality from simple to complex
Flexible output: Support for Turtle and N-Triples formats
Basic Usage#
The simplest way to generate data:
import pyrudof
# 1. Configure
config = pyrudof.GeneratorConfig()
config.set_entity_count(100)
config.set_output_path("output.ttl")
config.set_output_format(pyrudof.OutputFormat.Turtle)
# 2. Create generator
generator = pyrudof.DataGenerator(config)
# 3. Load schema and generate
generator.run("schema.shex")
Step-by-Step Generation#
You can also load schemas and generate data in separate steps:
generator = pyrudof.DataGenerator(config)
# Load schema (choose one method)
generator.load_shex_schema("schema.shex")
# OR
generator.load_shacl_schema("shapes.ttl")
# OR auto-detect format
generator.load_schema_auto("schema_file")
# Then generate
generator.generate()
Configuration#
Configuration from Python#
config = pyrudof.GeneratorConfig()
# Basic settings
config.set_entity_count(1000)
config.set_output_path("/tmp/generated_data.ttl")
config.set_output_format(pyrudof.OutputFormat.Turtle)
# Reproducibility
config.set_seed(42)
# Data quality
config.set_data_quality(pyrudof.DataQuality.High)
config.set_locale("en") # Use English locale for generated text
# Cardinality handling
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)
# Performance
config.set_worker_threads(4)
config.set_batch_size(100)
config.set_parallel_writing(True)
Reproducible Generation#
Use a seed for reproducible results:
config = pyrudof.GeneratorConfig()
config.set_seed(42)
config.set_entity_count(50)
generator = pyrudof.DataGenerator(config)
generator.run("schema.shex")
# Running again with the same seed produces identical output
Note
Setting a seed ensures that the same configuration always generates the same data, which is essential for reproducible testing and benchmarking.
Cardinality Strategies#
Control how cardinalities are handled when generating relationships:
Strategy |
Description |
|---|---|
|
Generate the minimum number of relationships allowed |
|
Generate the maximum number of relationships allowed |
|
Generate a random number within the valid range |
|
Use a balanced distribution (default, recommended) |
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)
Example with different strategies:
# Minimum relationships (faster, smaller output)
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Minimum)
# Maximum relationships (slower, larger output, tests edge cases)
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Maximum)
Data Quality Levels#
Configure the realism and complexity of generated data:
Level |
Characteristics |
Use Case |
|---|---|---|
|
Fast, simple random data |
Quick testing, performance benchmarks |
|
Realistic patterns |
Integration testing, demos |
|
Complex, correlated data |
Production-like testing, presentations |
# High-quality data with correlations
config.set_data_quality(pyrudof.DataQuality.High)
config.set_locale("es") # Spanish locale
Tip
Use DataQuality.Low for performance testing and DataQuality.High
when you need realistic data for demonstrations or integration testing.
Parallel Processing#
Enable parallel processing for faster generation of large datasets:
config = pyrudof.GeneratorConfig()
config.set_entity_count(10000)
# Enable parallelization
config.set_worker_threads(4) # Use 4 CPU cores
config.set_batch_size(100) # Process 100 entities per batch
config.set_parallel_shapes(True) # Parallel shape processing
config.set_parallel_fields(True) # Parallel field generation
config.set_parallel_writing(True) # Parallel output writing
config.set_parallel_file_count(4) # Write to 4 files simultaneously
generator = pyrudof.DataGenerator(config)
generator.run("large_schema.shex")
Warning
Using parallel writing creates multiple output files. You’ll need to merge them manually if you need a single file.
Output Formats#
Supported output formats:
Turtle (
OutputFormat.Turtle) - Human-readable, compact (default)N-Triples (
OutputFormat.NTriples) - Line-based, simple format
# Turtle format (default)
config.set_output_format(pyrudof.OutputFormat.Turtle)
# N-Triples format (useful for streaming processing)
config.set_output_format(pyrudof.OutputFormat.NTriples)
# Enable compression
config.set_compress(True) # Creates .ttl.gz or .nt.gz
# Generate statistics file
config.set_write_stats(True) # Creates output.stats.json
Advanced Example#
Complete example with all features:
import pyrudof
# Create configuration
config = pyrudof.GeneratorConfig()
# Generation settings
config.set_entity_count(5000)
config.set_seed(42)
config.set_cardinality_strategy(pyrudof.CardinalityStrategy.Balanced)
config.set_data_quality(pyrudof.DataQuality.High)
config.set_locale("en")
# Output settings
config.set_output_path("./output/data.ttl")
config.set_output_format(pyrudof.OutputFormat.Turtle)
config.set_compress(True)
config.set_write_stats(True)
# Performance settings
config.set_worker_threads(8)
config.set_batch_size(500)
config.set_parallel_shapes(True)
config.set_parallel_fields(True)
# Validate configuration
config.validate()
# Create generator and run
generator = pyrudof.DataGenerator(config)
generator.run_with_format("schema.shex", pyrudof.SchemaFormat.ShEx)
print("Generation complete!")
print(f"Configuration: {config.show()}")
API Reference#
Generator Configuration#
- class pyrudof.GeneratorConfig#
Python wrapper for GeneratorConfig from rudof_generate.
Provides access to configuration options for synthetic data generation.
- classmethod __new__(*args, **kwargs)#
- static from_json_file(path)#
Load configuration from a JSON file.
- Parameters:
path (
str) – Path to the JSON configuration file.- Returns:
Loaded configuration object.
- Return type:
- Raises:
ValueError – If the file cannot be read or parsed.
- static from_toml_file(path)#
Load configuration from a TOML file.
- Parameters:
path (
str) – Path to the TOML configuration file.- Returns:
Loaded configuration object.
- Return type:
- Raises:
ValueError – If the file cannot be read or parsed.
- get_compress()#
Get whether output is compressed.
- Returns:
True if compression is enabled.
- Return type:
- get_entity_count()#
Get the number of entities to generate.
- Returns:
Number of entities.
- Return type:
- get_parallel_fields()#
Get whether parallel field generation is enabled.
- Returns:
True if enabled.
- Return type:
- get_parallel_file_count()#
Get number of parallel output files.
- Returns:
Number of files.
- Return type:
- get_parallel_shapes()#
Get whether parallel shape processing is enabled.
- Returns:
True if enabled.
- Return type:
- get_parallel_writing()#
Get whether parallel writing is enabled.
- Returns:
True if enabled.
- Return type:
- get_worker_threads()#
Get number of worker threads.
- Returns:
Worker threads.
- Return type:
Optional[int]
- get_write_stats()#
Get whether statistics will be written.
- Returns:
True if statistics are written.
- Return type:
- set_batch_size(batch_size)#
Set the batch size for parallel processing.
- Parameters:
batch_size (
int) – Batch size.
- set_cardinality_strategy(strategy)#
Set the cardinality strategy.
- Parameters:
strategy (
CardinalityStrategy) – Strategy for cardinalities.
- set_compress(compress)#
Enable or disable compression.
- Parameters:
compress (
bool) – Whether to compress the output.
- set_data_quality(quality)#
Set data quality level.
- Parameters:
quality (
DataQuality) – Data quality (Low, Medium, High).
- set_entity_count(count)#
Set the number of entities to generate.
- Parameters:
count (
int) – Number of entities to generate.
- set_entity_distribution(distribution)#
Set entity distribution strategy.
- Parameters:
distribution (
EntityDistribution) – Entity distribution strategy.
- set_locale(locale)#
Set locale for field generation.
- Parameters:
locale (
str) – Locale string (e.g., “en”, “es”).
- set_output_format(format)#
Set the output format.
- Parameters:
format (
OutputFormat) – Desired output format.
- set_parallel_fields(enabled)#
Enable or disable parallel field generation.
- Parameters:
enabled (
bool) – Whether parallel field generation is enabled.
- set_parallel_file_count(count)#
Set the number of parallel output files.
- Parameters:
count (
int) – Number of files.
- set_parallel_shapes(enabled)#
Enable or disable parallel shape processing.
- Parameters:
enabled (
bool) – Whether parallel shape processing is enabled.
- set_parallel_writing(parallel_writing)#
Enable or disable parallel writing.
- Parameters:
parallel_writing (
bool) – Whether to write output in parallel.
- set_schema_format(format)#
Set the schema format.
- Parameters:
format (
Optional[SchemaFormat]) – Desired schema format.
- set_seed(seed)#
Set the random seed for reproducible generation.
- Parameters:
seed (
Optional[int]) – Seed value.
- set_worker_threads(threads)#
Set the number of worker threads.
- Parameters:
threads (
Optional[int]) – Number of threads.
- set_write_stats(write_stats)#
Enable or disable writing statistics.
- Parameters:
write_stats (
bool) – Whether to write statistics.
- show()#
Convert configuration to string.
- Returns:
Debug string of the configuration.
- Return type:
- to_toml_file(path)#
Save configuration to a TOML file.
- Parameters:
path (
str) – Path where the TOML file will be saved.- Raises:
ValueError – If writing to the file fails.
- validate()#
Validate the configuration.
- Raises:
ValueError – If configuration is invalid.
Data Generator#
- class pyrudof.DataGenerator#
Main data generator class.
Provides an interface to load schemas and generate synthetic RDF data.
- classmethod __new__(*args, **kwargs)#
- generate()#
Generate synthetic data and write it to the configured output.
- Raises:
RuntimeError – If the generator is not initialized.
ValueError – If data generation fails.
- load_schema_auto(path)#
Auto-detect schema format and load it.
- Parameters:
path (
str) – Path to the schema file.- Raises:
RuntimeError – If the generator is not initialized.
ValueError – If the schema cannot be loaded or parsed.
- load_shacl_schema(path)#
Load and process a SHACL schema file.
- Parameters:
path (
str) – Path to the SHACL schema file.- Raises:
RuntimeError – If the generator is not initialized.
ValueError – If the schema cannot be loaded or parsed.
- load_shex_schema(path)#
Load and process a ShEx schema file.
- Parameters:
path (
str) – Path to the ShEx schema file.- Raises:
RuntimeError – If the generator is not initialized.
ValueError – If the schema cannot be loaded or parsed.
- run(schema_path)#
Run the complete generation pipeline with automatic schema format detection.
- Parameters:
schema_path (
str) – Path to the schema file.- Raises:
RuntimeError – If the generator is not initialized.
ValueError – If schema loading or generation fails.
- run_with_format(schema_path, format)#
Run the complete generation pipeline with optional schema format.
- Parameters:
schema_path (
str) – Path to the schema file.format (
Optional[SchemaFormat]) – Schema format. If None, auto-detect.
- Raises:
RuntimeError – If the generator is not initialized.
ValueError – If schema loading or generation fails.
Formats#
Schema Format#
- class pyrudof.SchemaFormat#
Schema format for the generator.
Represents the supported schema formats that can be used to drive the data generation process.
Schema formats supported by the generator:
SchemaFormat.ShEx- Shape Expressions schemaSchemaFormat.SHACL- SHACL (Shapes Constraint Language) schema
- ShEx = SchemaFormat.ShEx#
- Shacl = SchemaFormat.Shacl#
- classmethod __new__(*args, **kwargs)#
Output Format#
- class pyrudof.OutputFormat#
Output format for generated data.
Defines the RDF serialization format used for generated output.
RDF serialization formats for generated output:
OutputFormat.Turtle- Turtle/Terse RDF Triple Language (.ttl) - Human-readable, compact (default)OutputFormat.NTriples- N-Triples (.nt) - Line-based, simple format for streaming
- NTriples = OutputFormat.NTriples#
- Turtle = OutputFormat.Turtle#
- classmethod __new__(*args, **kwargs)#
Cardinality Strategy#
- class pyrudof.CardinalityStrategy#
Strategy for handling cardinalities in relationships.
Determines how many relationships are generated when constraints define minimum and maximum cardinalities.
Strategies for handling cardinalities when generating relationships:
CardinalityStrategy.Minimum- Always use minimum cardinality (fastest, smallest output)CardinalityStrategy.Maximum- Always use maximum cardinality (slowest, largest output, tests edge cases)CardinalityStrategy.Random- Random value within valid range (unpredictable distribution)CardinalityStrategy.Balanced- Balanced distribution across range (default, recommended)
- Balanced = CardinalityStrategy.Balanced#
- Maximum = CardinalityStrategy.Maximum#
- Minimum = CardinalityStrategy.Minimum#
- Random = CardinalityStrategy.Random#
- classmethod __new__(*args, **kwargs)#
Data Quality#
- class pyrudof.DataQuality#
Data quality level for generated data.
Controls how realistic and complex the generated data should be.
Data quality levels controlling realism and complexity:
DataQuality.Low- Simple random data (fastest generation, minimal realism)DataQuality.Medium- Realistic patterns (moderate speed, good for demos)DataQuality.High- Complex realistic data with correlations (slower, production-like)
- High = DataQuality.High#
- Low = DataQuality.Low#
- Medium = DataQuality.Medium#
- classmethod __new__(*args, **kwargs)#
Entity Distribution#
- class pyrudof.EntityDistribution#
Entity distribution strategy.
Defines how entities are distributed across shapes during generation.
Entity distribution strategies across shapes:
EntityDistribution.Equal- Equal distribution of entities across all shapes
Note
Currently only
Equaldistribution is supported.- Equal = EntityDistribution.Equal#
- classmethod __new__(*args, **kwargs)#