AI-Powered Features

Discover how Synthetic Data Studio leverages AI to enhance your synthetic data workflow with intelligent chat, automated insights, and smart recommendations.

AI Feature Overview

Synthetic Data Studio integrates multiple AI capabilities to make synthetic data generation more intelligent and user-friendly:

Core AI Capabilities

** Interactive Chat**: Natural language conversations about your data quality
** Smart Suggestions**: AI-powered recommendations for improvement
** Metric Explanations**: Plain English interpretations of technical metrics
** Auto-Documentation**: Automated generation of compliance documents
** Enhanced Detection**: Context-aware PII identification
** Guided Workflows**: Step-by-step AI assistance

Interactive Chat

Chat Interface

Engage in natural conversations about your synthetic data:

curl -X POST "http://localhost:8000/llm/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "How good is my synthetic data quality?",
    "evaluation_id": "eval-123",
    "history": []
  }'

Sample Conversation:

You: How good is my synthetic data quality?

AI: Based on your evaluation results, your synthetic data shows good overall quality with an 85% similarity score. The statistical tests passed for 8 out of 10 columns, with particularly strong performance in continuous variables. However, there's room for improvement in categorical data representation.

You: What can I do to improve the categorical columns?

AI: To improve categorical data quality, consider:
1. Increasing training epochs from 50 to 100
2. Using CTGAN instead of TVAE for better mode capture
3. Adding more categorical-specific preprocessing
4. Fine-tuning the embedding dimensions

Context-Aware Responses

The chat system understands your specific context:

Evaluation Results: References your quality metrics
Generator Configuration: Knows your synthesis parameters
Data Characteristics: Understands your dataset structure
Historical Context: Remembers previous conversations

Advanced Chat Features

Multi-turn Conversations

{
  "message": "What about privacy?",
  "evaluation_id": "eval-123",
  "history": [
    {
      "role": "user",
      "content": "How good is my synthetic data quality?"
    },
    {
      "role": "assistant",
      "content": "Your data shows 85% similarity..."
    }
  ]
}

Generator-Specific Chat

{
  "message": "Why is training taking so long?",
  "generator_id": "gen-123",
  "context": {
    "current_status": "running",
    "progress": 75
  }
}

Smart Suggestions

Improvement Recommendations

Get AI-powered suggestions for enhancing your synthetic data:

curl -X POST "http://localhost:8000/llm/suggest-improvements/eval-123" \
  -H "Content-Type: application/json"

Response:

{
  "evaluation_id": "eval-123",
  "suggestions": [
    {
      "category": "quality_improvement",
      "priority": "high",
      "title": "Increase Training Epochs",
      "description": "Your current 50 epochs may be insufficient for optimal quality",
      "recommended_action": "Increase to 100-150 epochs",
      "expected_impact": "15-20% improvement in statistical similarity",
      "implementation_effort": "low"
    },
    {
      "category": "method_optimization",
      "priority": "medium",
      "title": "Switch to CTGAN",
      "description": "TVAE shows weaker performance on mixed data types",
      "recommended_action": "Use CTGAN for better categorical handling",
      "expected_impact": "10-15% improvement in ML utility",
      "implementation_effort": "medium"
    }
  ],
  "implementation_plan": {
    "immediate_actions": ["Increase epochs to 100"],
    "short_term": ["Switch to CTGAN", "Fine-tune batch size"],
    "long_term": ["Implement hyperparameter optimization"]
  }
}

Suggestion Categories

Quality Improvement: Statistical similarity enhancements
Performance Optimization: Speed and resource improvements
Privacy Enhancement: Better privacy-utility balance
Method Selection: Algorithm recommendations
Parameter Tuning: Hyperparameter optimization

Metric Explanations

Understanding Technical Metrics

Get plain English explanations of complex metrics:

curl "http://localhost:8000/llm/explain-metric?metric_name=wasserstein_distance&metric_value=0.15"

Response:

{
  "metric_name": "wasserstein_distance",
  "metric_value": 0.15,
  "explanation": {
    "simple_description": "This measures how different two distributions are, like how much 'work' it would take to transform one pile of dirt into another.",
    "interpretation": "A value of 0.15 means the distributions are fairly similar - about 15% different from perfect match.",
    "quality_assessment": "This is a good result. Values under 0.20 are generally acceptable for most applications.",
    "business_impact": "Your synthetic data captures the essential patterns of the real data well enough for most analytical purposes."
  },
  "analogies": [
    "Like two bell curves that overlap 85% - they're not identical but very similar",
    "Similar to getting 85% on a test - you understand the material well"
  ],
  "next_steps": [
    "This metric is good enough for most use cases",
    "If you need higher precision, consider increasing training epochs",
    "For critical applications, aim for values under 0.10"
  ]
}

Supported Metrics

wasserstein_distance: Distribution similarity
ks_test: Statistical significance testing
chi_square: Categorical data comparison
js_divergence: Information theory similarity
epsilon: Privacy budget
delta: Failure probability

Auto-Documentation

Model Cards

Generate comprehensive model documentation:

curl -X POST "http://localhost:8000/generators/gen-123/model-card" \
  -H "Content-Type: application/json" \
  -d '{"include_privacy": true}'

Generated Model Card:

{
  "model_card": {
    "model_details": {
      "name": "Customer Data Generator",
      "type": "DP-CTGAN",
      "version": "1.0",
      "creation_date": "2025-11-27",
      "framework": "Synthetic Data Studio"
    },
    "intended_use": {
      "primary_uses": ["Development testing", "Analytics training"],
      "out_of_scope": ["Production deployment without validation"]
    },
    "data_profile": {
      "training_data_size": 10000,
      "features": ["age", "income", "category", "location"],
      "data_characteristics": "Mixed numerical and categorical"
    },
    "privacy_considerations": {
      "differential_privacy": true,
      "epsilon": 10.0,
      "privacy_guarantees": "ε=10.0, δ=1e-5",
      "compliance_frameworks": ["GDPR", "CCPA"]
    },
    "performance_metrics": {
      "statistical_similarity": 0.87,
      "ml_utility": 0.92,
      "privacy_score": 0.89
    }
  },
  "format": "json",
  "disclaimer": "AI-generated content. Requires review before distribution."
}

Audit Narratives

Create human-readable audit trails:

curl http://localhost:8000/generators/gen-123/audit-narrative

Generated Narrative:

{
  "generator_id": "gen-123",
  "narrative": {
    "executive_summary": "This synthetic data generator was created on November 27, 2025, using differential privacy techniques to ensure regulatory compliance.",
    "technical_details": {
      "method": "DP-CTGAN with RDP accounting",
      "privacy_parameters": "ε=10.0, δ=1e-5",
      "training_duration": "45 minutes",
      "quality_metrics": "87% statistical similarity"
    },
    "compliance_statement": "The generator implements mathematical privacy guarantees suitable for HIPAA, GDPR, and CCPA compliance.",
    "recommendations": [
      "Validate output quality before production use",
      "Document privacy parameters in compliance records",
      "Regularly review and update privacy budgets"
    ]
  }
}

Compliance Reports

Generate framework-specific compliance documentation:

curl -X POST "http://localhost:8000/generators/gen-123/compliance-report" \
  -H "Content-Type: application/json" \
  -d '{"framework": "GDPR"}'

Enhanced PII Detection

AI-Powered Detection

Use advanced AI for better sensitive data identification:

curl -X POST "http://localhost:8000/datasets/dataset-123/pii-detection-enhanced" \
  -H "Content-Type: application/json"

Enhanced Detection Features:

Context Analysis: Understands data relationships
Pattern Recognition: Identifies complex PII patterns
Risk Assessment: Provides confidence scores and risk levels
Recommendations: Suggests handling strategies

Detection Comparison

Feature	Standard Detection	AI-Enhanced Detection
Accuracy	85%	95%
Context Awareness	Limited	Advanced
False Positives	Higher	Lower
Complex Patterns	Basic	Advanced
Recommendations	Generic	Specific

Guided Workflows

Step-by-Step Assistance

Get AI-guided help through complex workflows:

Privacy Implementation Guide

curl -X POST "http://localhost:8000/llm/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Help me implement differential privacy for my healthcare data",
    "context": {
      "data_type": "healthcare",
      "compliance_needs": ["HIPAA"],
      "current_method": "CTGAN"
    }
  }'

AI Response:

{
  "response": "I'll guide you through implementing HIPAA-compliant differential privacy for healthcare data. Here's your step-by-step plan:",
  "workflow_steps": [
    {
      "step": 1,
      "title": "Validate Current Setup",
      "action": "Run /generators/dp/validate-config with ε=1.0",
      "description": "Ensure your configuration meets HIPAA requirements"
    },
    {
      "step": 2,
      "title": "Choose Appropriate Method",
      "recommendation": "DP-CTGAN for complex healthcare correlations",
      "rationale": "Better at preserving medical data relationships"
    },
    {
      "step": 3,
      "title": "Set Privacy Parameters",
      "parameters": {
        "epsilon": 1.0,
        "delta": "auto",
        "max_grad_norm": 1.0
      },
      "explanation": "ε=1.0 provides strong HIPAA-compliant privacy"
    }
  ],
  "estimated_time": "2-3 hours",
  "success_criteria": [
    "Privacy report shows ε ≤ 1.0",
    "HIPAA compliance check passes",
    "Data utility remains > 80%"
  ]
}

Feature Configuration

AI Model Selection

Configure which AI models to use:

# Primary AI provider
USE_GEMINI=true
GEMINI_API_KEY=your-key
GEMINI_MODEL=gemini-1.5-flash

# Fallback provider
USE_GROQ=true
GROQ_API_KEY=your-key
GROQ_MODEL=llama-3.1-70b-versatile

# Response settings
AI_TEMPERATURE=0.7
AI_MAX_TOKENS=2048

Feature Toggles

Enable/disable specific AI features:

# Chat features
ENABLE_AI_CHAT=true
ENABLE_CONTEXT_AWARE_RESPONSES=true

# Suggestions
ENABLE_SMART_SUGGESTIONS=true
ENABLE_AUTOMATED_IMPROVEMENTS=true

# Documentation
ENABLE_AUTO_MODEL_CARDS=true
ENABLE_COMPLIANCE_REPORTS=true

# Detection
ENABLE_ENHANCED_PII_DETECTION=true

Usage Analytics

AI Feature Usage Tracking

Monitor AI feature adoption and effectiveness:

# Get AI usage statistics
curl http://localhost:8000/admin/ai-usage-stats

Tracking Metrics:

Chat conversation volume and satisfaction
Suggestion acceptance rates
Documentation generation frequency
PII detection accuracy improvements

Privacy & Security

AI Data Handling

No Training Data Storage: AI responses don't retain your data
Secure API Communication: Encrypted connections to AI providers
Compliance-Aware: AI respects privacy constraints
Audit Logging: All AI interactions are logged for compliance

Data Minimization

Context-Only Sharing: Only necessary metadata sent to AI
No Raw Data: Original datasets never leave your infrastructure
Aggregated Insights: AI works with statistical summaries

� Troubleshooting

Common AI Issues

Slow Responses

Cause: High server load or network latency
Solution: Use Groq for faster responses, implement caching

Inaccurate Suggestions

Cause: Limited context or outdated training data
Solution: Provide more evaluation details, use latest AI models

API Rate Limits

Cause: Exceeded provider limits
Solution: Implement request queuing, use multiple providers

Context Loss

Cause: Long conversations exceed token limits
Solution: Start new conversations for different topics

Performance Optimization

Response Speed

Use Groq for faster responses
Implement response caching
Optimize context window usage

Accuracy Improvement

Provide detailed evaluation results
Include relevant metadata
Use specific, clear questions

Cost Management

Monitor API usage
Implement response caching
Use appropriate model sizes

Best Practices

Effective AI Usage

Be Specific: Ask detailed, specific questions
Provide Context: Include evaluation IDs and relevant metadata
Iterate: Use follow-up questions to refine understanding
Validate: Cross-check AI suggestions with your domain knowledge

Workflow Integration

Quality Assessment: Use AI explanations to understand evaluation results
Improvement Planning: Leverage smart suggestions for optimization
Documentation: Generate compliance documents automatically
Knowledge Sharing: Use chat to explain concepts to team members

Compliance Considerations

Audit Trail: All AI interactions are logged
Data Handling: Understand what data is shared with AI providers
Regulatory Approval: Get approval for AI-assisted workflows
Fallback Procedures: Have manual processes for critical decisions

Integration Examples

Development Workflow

# Python integration example
import requests

class SynthStudioAI:
    def __init__(self, base_url="http://localhost:8000"):
        self.base_url = base_url

    def explain_metric(self, metric_name, value):
        response = requests.get(
            f"{self.base_url}/llm/explain-metric",
            params={"metric_name": metric_name, "metric_value": value}
        )
        return response.json()

    def get_suggestions(self, evaluation_id):
        response = requests.post(
            f"{self.base_url}/llm/suggest-improvements/{evaluation_id}"
        )
        return response.json()

CI/CD Integration

# GitHub Actions example
- name: Evaluate Synthetic Data Quality
  run: |
    # Generate evaluation
    EVAL_ID=$(curl -s -X POST $API_URL/evaluations/run \
      -H "Content-Type: application/json" \
      -d '{"generator_id": "$GENERATOR_ID", "dataset_id": "$DATASET_ID"}' | jq -r .evaluation_id)

    # Get AI insights
    curl -X POST $API_URL/llm/suggest-improvements/$EVAL_ID

    # Generate compliance report
    curl -X POST $API_URL/generators/$GENERATOR_ID/compliance-report \
      -H "Content-Type: application/json" \
      -d '{"framework": "GDPR"}'

Next Steps

After exploring AI features:

Generate Compliance Reports - Create audit documentation
Check Tutorials - Learn advanced workflows
Review API Examples - Explore all endpoints

Want to learn more? Try asking the AI chat: "What are the best practices for synthetic data generation?" or check our API Examples for AI endpoint usage.

AI Feature Overview​

Core AI Capabilities​

Interactive Chat​

Chat Interface​

Context-Aware Responses​

Advanced Chat Features​

Multi-turn Conversations​

Generator-Specific Chat​

Smart Suggestions​

Improvement Recommendations​

Suggestion Categories​

Metric Explanations​

Understanding Technical Metrics​

Supported Metrics​

Auto-Documentation​

Model Cards​

Audit Narratives​

Compliance Reports​

Enhanced PII Detection​

AI-Powered Detection​

Detection Comparison​

Guided Workflows​

Step-by-Step Assistance​

Privacy Implementation Guide​

Feature Configuration​

AI Model Selection​

Feature Toggles​

Usage Analytics​

AI Feature Usage Tracking​

Privacy & Security​

AI Data Handling​

Data Minimization​

� Troubleshooting​

Common AI Issues​

Performance Optimization​

Best Practices​

Effective AI Usage​

Workflow Integration​

Compliance Considerations​

Integration Examples​

Development Workflow​

CI/CD Integration​

Next Steps​