Generating Synthetic Data
Learn how to generate high-quality synthetic data using various synthesis methods available in Synthetic Data Studio.
Synthesis Methods Overview​
Available Methods​
| Method | Best For | Speed | Quality | Privacy |
|---|---|---|---|---|
| CTGAN | Complex tabular data | Medium | Excellent | None |
| TVAE | Mixed data types | Fast | Good | None |
| GaussianCopula | Simple distributions | Very Fast | Fair | None |
| DP-CTGAN | Privacy-preserving | Slow | Good | Excellent |
| DP-TVAE | Fast privacy | Medium | Good | Excellent |
Method Selection Guide​
Choose CTGAN when:
- Your data has complex correlations
- You need high-fidelity synthetic data
- Training time is not a major constraint
Choose TVAE when:
- You have mixed data types (numeric + categorical)
- You need faster training than CTGAN
- Deterministic generation is preferred
Choose GaussianCopula when:
- You need very fast generation
- Data follows simple statistical distributions
- You're prototyping or need baseline comparisons
Choose DP-CTGAN/DP-TVAE when:
- Privacy guarantees are required
- Data contains sensitive information
- Regulatory compliance is needed
Basic Synthesis Workflow​
Step 1: Prepare Your Dataset​
First, ensure you have uploaded and profiled your dataset:
# Upload dataset
curl -X POST "http://localhost:8000/datasets/upload" \
-F "file=@your-data.csv"
# Profile it
curl -X POST "http://localhost:8000/datasets/{dataset_id}/profile"
Step 2: Choose Synthesis Method​
Select the appropriate method based on your needs:
CTGAN Generation (Recommended for Quality)​
curl -X POST "http://localhost:8000/generators/dataset/{dataset_id}/generate" \
-H "Content-Type: application/json" \
-d '{
"generator_type": "ctgan",
"num_rows": 1000,
"epochs": 50,
"batch_size": 500
}'
Parameters:
num_rows: Number of synthetic rows to generateepochs: Training iterations (50-300, higher = better quality)batch_size: Training batch size (100-1000)
TVAE Generation (Faster Alternative)​
curl -X POST "http://localhost:8000/generators/dataset/{dataset_id}/generate" \
-H "Content-Type: application/json" \
-d '{
"generator_type": "tvae",
"num_rows": 1000,
"epochs": 30,
"batch_size": 200
}'
Step 3: Monitor Generation Progress​
# Check generator status
curl http://localhost:8000/generators/{generator_id}
# Response shows progress
{
"id": "gen-123",
"status": "running", // or "completed", "failed"
"progress": 75,
"estimated_time_remaining": "2 minutes"
}
Step 4: Download Results​
# Download synthetic dataset
curl -O http://localhost:8000/datasets/{output_dataset_id}/download
Differential Privacy Synthesis​
DP-CTGAN Generation​
curl -X POST "http://localhost:8000/generators/dataset/{dataset_id}/generate" \
-H "Content-Type: application/json" \
-d '{
"generator_type": "dp-ctgan",
"num_rows": 1000,
"target_epsilon": 10.0,
"epochs": 50,
"batch_size": 200
}'
Key Parameters:
target_epsilon: Privacy budget (lower = more private)target_delta: Failure probability (auto-calculated)max_grad_norm: Gradient clipping (default: 1.0)
DP-TVAE Generation (Faster)​
curl -X POST "http://localhost:8000/generators/dataset/{dataset_id}/generate" \
-H "Content-Type: application/json" \
-d '{
"generator_type": "dp-tvae",
"num_rows": 1000,
"target_epsilon": 5.0,
"epochs": 30
}'
Privacy Parameter Selection​
| Epsilon (ε) | Privacy Level | Use Case |
|---|---|---|
| 0.1 - 1.0 | Very Strong | Clinical trials, genomic data |
| 1.0 - 5.0 | Strong | Healthcare, financial records |
| 5.0 - 10.0 | Moderate | Customer data, HR records |
| 10.0 - 20.0 | Weak | Aggregated analytics |
Pre-Training Validation​
Always validate DP parameters before training:
curl -X POST "http://localhost:8000/generators/dp/validate-config" \
-H "Content-Type: application/json" \
-d '{
"dataset_id": "your-dataset-id",
"generator_type": "dp-ctgan",
"epochs": 50,
"batch_size": 200,
"target_epsilon": 10.0
}'
Response:
{
"is_valid": true,
"errors": [],
"warnings": ["Batch size is 10% of dataset"],
"recommended_config": {
"epochs": 50,
"batch_size": 200,
"target_epsilon": 10.0,
"expected_privacy_level": "Moderate"
}
}
Get Recommended Parameters​
curl "http://localhost:8000/generators/dp/recommended-config?dataset_id={id}&desired_quality=balanced"
Advanced Synthesis Options​
Conditional Generation​
Generate data with specific conditions:
curl -X POST "http://localhost:8000/generators/dataset/{dataset_id}/generate" \
-H "Content-Type: application/json" \
-d '{
"generator_type": "ctgan",
"num_rows": 500,
"conditions": {
"age": {"min": 25, "max": 35},
"income": {"min": 50000}
}
}'
Schema-Based Generation​
Generate from a data schema without training:
curl -X POST "http://localhost:8000/generators/schema/generate" \
-H "Content-Type: application/json" \
-d '{
"columns": {
"name": {"type": "string", "distribution": "normal"},
"age": {"type": "integer", "min": 18, "max": 80},
"income": {"type": "float", "mean": 75000, "std": 25000}
},
"num_rows": 1000
}'
Custom Parameters​
Fine-tune synthesis parameters:
curl -X POST "http://localhost:8000/generators/dataset/{dataset_id}/generate" \
-H "Content-Type: application/json" \
-d '{
"generator_type": "ctgan",
"num_rows": 2000,
"epochs": 100,
"batch_size": 300,
"learning_rate": 0.0002,
"discriminator_steps": 5,
"pac": 10
}'
Quality Optimization​
Parameter Tuning Guide​
For High Quality (CTGAN)​
{
"epochs": 100,
"batch_size": 500,
"learning_rate": 0.0002,
"discriminator_steps": 5
}
For Fast Training (TVAE)​
{
"epochs": 30,
"batch_size": 200,
"compress_dims": [128, 64],
"decompress_dims": [64, 128]
}
For Privacy (DP-CTGAN)​
{
"target_epsilon": 5.0,
"epochs": 50,
"batch_size": 100,
"noise_multiplier": "auto"
}
Quality Metrics to Monitor​
After generation, evaluate quality:
# Quick evaluation
curl -X POST "http://localhost:8000/evaluations/quick/{generator_id}"
# Comprehensive evaluation
curl -X POST "http://localhost:8000/evaluations/run" \
-H "Content-Type: application/json" \
-d '{
"generator_id": "your-generator-id",
"dataset_id": "original-dataset-id",
"include_statistical": true,
"include_ml_utility": true,
"include_privacy": false
}'
Background Processing​
Asynchronous Generation​
Large datasets are processed asynchronously:
# Start generation (returns immediately)
curl -X POST "http://localhost:8000/generators/dataset/{dataset_id}/generate" \
-H "Content-Type: application/json" \
-d '{"generator_type": "ctgan", "num_rows": 10000}'
# Check status periodically
curl http://localhost:8000/generators/{generator_id}
Monitoring Jobs​
# List all jobs
curl http://localhost:8000/jobs/
# Get specific job details
curl http://localhost:8000/jobs/{job_id}
Troubleshooting​
Common Issues​
Generation Fails
Error: Training failed
Solution: Reduce epochs/batch_size, check data quality
Poor Quality Results
Issue: Synthetic data doesn't match real data
Solution: Increase epochs, use CTGAN instead of GaussianCopula
Memory Issues
Error: Out of memory
Solution: Reduce batch_size, use smaller dataset subset
Privacy Validation Fails
Error: Epsilon too low for parameters
Solution: Increase epsilon or reduce epochs/batch_size
Performance Optimization​
Speed Up Training:
- Use TVAE instead of CTGAN
- Reduce epochs initially
- Increase batch_size
- Use GPU if available
Improve Quality:
- Increase epochs gradually
- Use CTGAN for complex data
- Fine-tune learning rates
- Add more training data
Reduce Memory Usage:
- Smaller batch sizes
- Process in chunks
- Use TVAE over CTGAN
- Clear cache between runs
Best Practices​
Data Preparation​
- Profile First: Always run profiling before synthesis
- Clean Data: Remove outliers and inconsistencies
- Check PII: Run PII detection for sensitive data
- Scale Appropriately: Start with smaller datasets
Method Selection​
- CTGAN for Quality: Best for complex, high-fidelity data
- TVAE for Speed: Good balance of quality and performance
- DP Variants for Privacy: When regulatory compliance required
- GaussianCopula for Prototyping: Fast baseline comparisons
Parameter Tuning​
- Start Conservative: Use recommended defaults first
- Iterate Gradually: Increase complexity step by step
- Monitor Quality: Evaluate after each parameter change
- Balance Trade-offs: Quality vs speed vs privacy
Production Considerations​
- Validate Parameters: Always test DP configs first
- Monitor Resources: Watch memory and compute usage
- Version Control: Track parameter sets and results
- Audit Trail: Maintain records for compliance
Exporting Reports​
Export to S3​
Save evaluation reports (PDF/DOCX) directly to S3 for archival and compliance:
# Export evaluation as PDF to S3
curl -X POST "http://localhost:8000/llm/evaluations/{evaluation_id}/export-pdf?save_to_s3=true"
# Export as Word document to S3
curl -X POST "http://localhost:8000/llm/evaluations/{evaluation_id}/export-docx?save_to_s3=true"
Response:
{
"message": "Report exported successfully",
"download_url": "https://your-bucket.s3.amazonaws.com/exports/report_abc123.pdf",
"export_id": "export-uuid"
}
Managing Exports​
# List all exports
curl http://localhost:8000/exports/
# Get exports for a specific generator
curl http://localhost:8000/exports/generator/{generator_id}
# Get exports for a dataset
curl http://localhost:8000/exports/dataset/{dataset_id}
# Download a specific export
curl http://localhost:8000/exports/{export_id}/download
Export Types​
| Type | Format | Use Case |
|---|---|---|
| evaluation_report | PDF/DOCX | Quality assessment documentation |
| compliance_report | PDF/DOCX | Regulatory audit trail |
| model_card | PDF/DOCX | Model documentation |
Next Steps​
After generating synthetic data:
- Evaluate Quality - Assess how well your synthetic data matches the original
- Use Privacy Features - Learn about differential privacy and compliance
- Explore AI Features - Use AI-powered tools for insights and automation
Need help choosing the right method? Check our Method Selection Guide or create an issue on GitHub.