Basic Synthesis Tutorial

A hands-on tutorial for generating your first synthetic dataset using Synthetic Data Studio. Perfect for beginners who want to understand the end-to-end workflow.

Tutorial Goals

By the end of this tutorial, you will:

Understand the basic concepts of synthetic data generation
Upload and analyze a real dataset
Generate synthetic data using different methods
Compare original and synthetic data quality
Download and use your synthetic dataset

Time Required: 20-30 minutes Difficulty: Beginner Prerequisites: None

What You'll Need

Synthetic Data Studio: Running at https://api.synthdata.studio
Sample Dataset: We'll use the included sample_data.csv
Web Browser: For API testing
Terminal/Command Line: For curl commands (optional)

Step 1: Understanding Your Data

The Sample Dataset

Let's start with a simple customer dataset that includes:

Customer information: IDs, demographics
Financial data: Income, credit scores
Behavioral data: Purchase history, preferences

Data Privacy Considerations

Before generating synthetic data, consider:

PII Detection: Which columns contain personal information?
Privacy Requirements: Do you need differential privacy?
Use Case: What will the synthetic data be used for?

Step 2: Upload Your Dataset

Method 1: Using the API Interface (Recommended)

Open API Documentation Visit: https://api.synthdata.studio/docs
Find the Upload Endpoint
- Scroll to POST /datasets/upload
- Click "Try it out"
Upload the Sample File
- Click "Choose File"
- Select sample_data.csv from your project directory
- Click "Execute"

Check the Response

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "filename": "sample_data.csv",
  "row_count": 1000,
  "column_count": 8,
  "file_size": 45632,
  "upload_timestamp": "2025-11-27T14:30:00Z"
}

**Save the dataset ID** - you'll need it for the next steps!

### Method 2: Using curl

```bash
# Upload the dataset
curl -X POST "https://api.synthdata.studio/datasets/upload" \
  -F "file=@sample_data.csv"

# Expected response contains dataset ID

Step 3: Analyze Your Data

Generate a Data Profile

Data profiling helps you understand:

Column data types and distributions
Missing values and data quality issues
Correlations between variables
Potential privacy concerns

Find the Profile Endpoint
- POST /datasets/{dataset_id}/profile
- Replace {dataset_id} with your dataset ID
Execute the Request
- Click "Try it out"
- Enter your dataset ID
- Click "Execute"

Review the Profile

{
  "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
  "row_count": 1000,
  "column_count": 8,
  "columns": [
    {
      "name": "customer_id",
      "type": "integer",
      "nullable": false,
      "unique_count": 1000,
      "min": 1,
      "max": 1000
    },
    {
      "name": "age",
      "type": "integer",
      "nullable": false,
      "min": 18,
      "max": 80,
      "mean": 42.5,
      "std": 15.2
    }
  ],
  "correlations": {
    "age-income": 0.45,
    "income-credit_score": 0.32
  }
}

### Key Insights from Profiling

- **Data Types**: Mix of integers, floats, and strings
- **Data Quality**: No missing values in this sample
- **Correlations**: Age and income are moderately correlated
- **Distribution**: Age ranges from 18-80 with mean ~43

## Step 4: Generate Synthetic Data

### Choose Your Synthesis Method

Synthetic Data Studio offers multiple synthesis methods:

| Method | Best For | Speed | Quality | When to Use |
|--------|----------|-------|---------|-------------|
| **CTGAN** | Complex data | Medium | Excellent | Most cases |
| **TVAE** | Mixed types | Fast | Good | Quick results |
| **GaussianCopula** | Simple stats | Very Fast | Fair | Prototyping |

**For this tutorial**: We'll use CTGAN for the best quality results.

### Start Generation

1. **Find the Generation Endpoint**
   - `POST /generators/dataset/{dataset_id}/generate`
   - Use your dataset ID

2. **Configure Parameters**
   ```json
   {
     "generator_type": "ctgan",
     "num_rows": 500,
     "epochs": 50,
     "batch_size": 200
   }

Execute Generation
- Click "Try it out"
- Enter your dataset ID
- Set the parameters above
- Click "Execute"

Monitor Progress

{
  "message": "Generation started",
  "generator_id": "660e8400-e29b-41d4-a716-446655440001",
  "estimated_time": "2-3 minutes"
}

### Understanding the Parameters

- **generator_type**: "ctgan" (best quality)
- **num_rows**: 500 (half the original size)
- **epochs**: 50 (training iterations)
- **batch_size**: 200 (training batch size)

## Step 5: Monitor Generation

### Check Generation Status

Since generation runs asynchronously, you need to check its progress:

1. **Find the Generator Status Endpoint**
   - `GET /generators/{generator_id}`
   - Use the generator ID from step 4

2. **Check Status Regularly**
   - Click "Try it out"
   - Enter generator ID
   - Click "Execute"

3. **Status Response**
   ```json
   {
     "id": "660e8400-e29b-41d4-a716-446655440001",
     "status": "running",
     "progress": 75,
     "created_at": "2025-11-27T14:35:00Z",
     "updated_at": "2025-11-27T14:37:30Z"
   }

Wait for Completion
- Status will change to "completed"
- This takes 2-3 minutes for CTGAN

What Happens During Generation?

Model Training: CTGAN learns patterns in your data
Privacy Accounting: Tracks privacy budget (if using DP)
Sample Generation: Creates new synthetic records
Quality Validation: Basic checks on generated data

Step 6: Evaluate Quality

Quick Statistical Evaluation

Once generation completes, evaluate how well the synthetic data matches the original:

Find the Quick Evaluation Endpoint
- POST /evaluations/quick/{generator_id}
Run Evaluation
- Use your generator ID
- Click "Execute"

Review Results

{
  "generator_id": "660e8400-e29b-41d4-a716-446655440001",
  "quality_level": "Good",
  "overall_score": 0.87,
  "statistical_similarity": {
    "ks_test": 0.91,
    "chi_square": 0.89,
    "wasserstein_distance": 0.12
  },
  "recommendations": [
    "Data quality looks excellent for most applications",
    "Statistical distributions are well-preserved"
  ]
}

### Understanding the Scores

- **Overall Score**: 0.87 (87% quality retention)
- **KS Test**: 0.91 (excellent distribution matching)
- **Chi-Square**: 0.89 (good categorical data matching)
- **Wasserstein Distance**: 0.12 (acceptable distribution difference)

## Step 7: Download Results

### Get Your Synthetic Dataset

1. **Find the Download Endpoint**
   - First, get the output dataset ID from the generator:
   - `GET /generators/{generator_id}`

2. **Locate Output Dataset**
   ```json
   {
     "id": "660e8400-e29b-41d4-a716-446655440001",
     "output_dataset_id": "770e8400-e29b-41d4-a716-446655440002",
     "status": "completed"
   }

Download the Data
- GET /datasets/{output_dataset_id}/download
- Use the output_dataset_id
- Click "Execute" to download
Save the File
- Your browser will download dataset_{id}.csv
- This contains your 500 synthetic customer records

Step 8: Compare Original vs Synthetic

Basic Comparison

Let's examine both datasets to see the differences:

Original Data Sample:

customer_id,age,income,credit_score,purchases,category,region,signup_date
1,35,65000,720,12,A,East,2023-01-15
2,42,78000,680,8,B,West,2023-02-20
3,28,45000,750,15,A,North,2023-01-08

Synthetic Data Sample:

customer_id,age,income,credit_score,purchases,category,region,signup_date
1,34,64800,718,11,A,East,2023-01-14
2,41,77500,685,9,B,West,2023-02-18
3,29,45200,748,14,A,North,2023-01-09

Key Observations:

Similar distributions: Ages, incomes look realistic
Preserved correlations: High earners still tend to have better credit
Realistic values: No impossible combinations
Category balance: A/B categories maintained

Step 9: Experiment with Different Methods

Try TVAE for Speed

Generate with TVAE

{
  "generator_type": "tvae",
  "num_rows": 500,
  "epochs": 30,
  "batch_size": 100
}

2. **Compare Results**
   - TVAE is usually 2-3x faster than CTGAN
   - Quality might be slightly lower for complex data
   - Better for mixed data types

### Try GaussianCopula for Simplicity

1. **Generate with GaussianCopula**
   ```json
   {
     "generator_type": "gaussian_copula",
     "num_rows": 500
   }

Compare Results
- Very fast (seconds)
- Good for simple statistical properties
- May not capture complex correlations

Tutorial Complete!

What You Accomplished

Uploaded a real dataset to Synthetic Data Studio Analyzed data structure and quality Generated synthetic data using CTGAN Evaluated quality with statistical tests Downloaded results for use in your applications Compared original vs synthetic data characteristics

Your Synthetic Dataset is Ready!

You now have:

500 synthetic customer records
Statistical properties similar to original data
Privacy-safe data for development/testing
Quality-validated results

Next Steps

Advanced Tutorials

Privacy Synthesis Tutorial: Learn differential privacy
Quality Evaluation Tutorial: Deep dive into evaluation metrics
Compliance Reporting Tutorial: Generate audit documentation

Practical Applications

Use synthetic data in development environments
Test ML models with diverse, realistic data
Share data safely with partners (no real PII)
Scale up to larger datasets and more complex scenarios

Further Learning

API Reference: Explore all available endpoints
User Guides: Learn about advanced features
Developer Docs: Build integrations and custom workflows

Troubleshooting

Common Issues

Generation Takes Too Long

Reduce epochs to 20-30 for testing
Use TVAE instead of CTGAN
Try smaller batch_size

Poor Quality Results

Increase epochs for better training
Switch from GaussianCopula to CTGAN
Check if your data has complex patterns

API Errors

Verify dataset/generator IDs are correct
Check server logs for error details
Ensure server is running

Download Fails

Wait for generation to complete (status = "completed")
Check that output_dataset_id exists
Verify file permissions

Getting Help

API Docs: http://localhost:8000/docs (try endpoints directly)
Logs: Check server console for error messages
GitHub: Create issues for bugs or questions

Congratulations! You've successfully generated your first synthetic dataset. Ready to explore more advanced features?

Tutorial Goals​

What You'll Need​

Step 1: Understanding Your Data​

The Sample Dataset​

Data Privacy Considerations​

Step 2: Upload Your Dataset​

Method 1: Using the API Interface (Recommended)​

Step 3: Analyze Your Data​

Generate a Data Profile​

What Happens During Generation?​

Step 6: Evaluate Quality​

Quick Statistical Evaluation​

Step 8: Compare Original vs Synthetic​

Basic Comparison​

Step 9: Experiment with Different Methods​

Try TVAE for Speed​

Tutorial Complete!​

What You Accomplished​

Your Synthetic Dataset is Ready!​

Next Steps​

Advanced Tutorials​

Practical Applications​

Further Learning​

Troubleshooting​

Common Issues​

Getting Help​