Supported Formats Reference

Complete reference of data formats, file types, and data structure requirements supported by Synthetic Data Studio.

File Format Support

CSV Files

Supported Variants

Standard CSV: Comma-separated values with headers
TSV: Tab-separated values
Custom Delimiters: Pipe (|), semicolon (;), etc.

Requirements

# Required: Header row
name,age,income,city
John,25,50000,New York
Jane,30,60000,London
Bob,35,70000,Paris

Encoding Support

UTF-8 (recommended)
UTF-16
Latin-1
Windows-1252

Size Limits

Maximum Rows: 1,000,000
Maximum Columns: 500
Maximum File Size: 100MB (configurable)

JSON Files

Supported Structures

Array of Objects (Recommended):

[
  {"name": "John", "age": 25, "income": 50000},
  {"name": "Jane", "age": 30, "income": 60000},
  {"name": "Bob", "age": 35, "income": 70000}
]

Newline-Delimited JSON:

{"name": "John", "age": 25, "income": 50000}
{"name": "Jane", "age": 30, "income": 60000}
{"name": "Bob", "age": 35, "income": 70000}

Nested Structures

{
  "customer": {
    "name": "John Doe",
    "age": 25,
    "address": {
      "street": "123 Main St",
      "city": "New York",
      "zip": "10001"
    }
  },
  "transactions": [
    {"amount": 100.50, "date": "2023-01-15"},
    {"amount": 250.00, "date": "2023-02-20"}
  ]
}

Note: Nested structures are flattened during processing.

Excel Files

Supported Formats

.xlsx (Excel 2007+)
.xls (Excel 97-2003)

Sheet Selection

Default: First sheet
Named Sheet: Specify sheet name in upload options

Data Requirements

Headers: First row must contain column names
Data Start: Data starts from row 2
Empty Rows: Automatically skipped
Merged Cells: Not supported (will be unmerged)

Parquet Files

Support Level

Basic Support: Simple tabular structures
Complex Types: Limited support for nested structures

Advantages

Compressed: Efficient storage
Columnar: Fast column access
Typed: Preserves data types

Data Type Support

Automatic Type Inference

Synthetic Data Studio automatically detects and converts data types:

Numeric Types

Detected As	Examples	SQL Type	Notes
integer	`25`, `1000`, `-5`	INTEGER	Whole numbers
float	`25.5`, `1000.99`, `3.14`	FLOAT	Decimal numbers
boolean	`true`, `false`, `1`, `0`	BOOLEAN	True/false values

Text Types

Detected As	Examples	SQL Type	Notes
string	`"John"`, `"New York"`	VARCHAR	Any text
categorical	`"red"`, `"blue"`, `"green"`	VARCHAR	Limited unique values
email	`"user@example.com"`	VARCHAR	Email pattern
phone	`"+1-555-123-4567"`	VARCHAR	Phone pattern
url	`"https://example.com"`	VARCHAR	URL pattern

Date/Time Types

Detected As	Examples	SQL Type	Notes
date	`"2023-01-15"`, `"01/15/2023"`	DATE	Date only
datetime	`"2023-01-15 14:30:00"`	DATETIME	Date and time
timestamp	`1642152600`	TIMESTAMP	Unix timestamp

Type Conversion Rules

String to Numeric

# Automatic conversion
"25" → 25 (integer)
"25.5" → 25.5 (float)
"true" → True (boolean)
"false" → False (boolean)

Date Parsing

# Supported formats
"2023-01-15" → date(2023, 1, 15)
"01/15/2023" → date(2023, 1, 15)  # US format
"15/01/2023" → date(2023, 1, 15)  # European format
"2023-01-15 14:30:00" → datetime(2023, 1, 15, 14, 30, 0)

Categorical Detection

# Automatically detected if:
# - Unique values < 10% of total rows
# - String values repeating frequently
# - Explicit category indicators

Data Structure Requirements

Tabular Format

All data must be representable as a table:

Required Structure

Column 1 | Column 2 | Column 3 | ...
---------|----------|----------|---
Value 1  | Value 2  | Value 3  | ...
Value 1  | Value 2  | Value 3  | ...

Column Requirements

Unique Names: Each column must have a unique name
No Empty Names: Column names cannot be empty
Valid Characters: Letters, numbers, underscores, spaces
Case Sensitive: Column names are case-sensitive

Data Quality Standards

Completeness

Missing Values: Supported (represented as NULL)
Empty Strings: Converted to NULL for non-string columns
Sparse Data: Acceptable up to 50% missing values

Consistency

Type Consistency: Values in a column should be same type
Format Consistency: Dates, numbers should follow consistent formats
Encoding Consistency: All text should use same encoding

Validity

Range Checks: Numeric values within reasonable ranges
Format Validation: Emails, phones, URLs follow valid patterns
Logical Consistency: Related columns should have consistent values

Data Preprocessing

Automatic Processing

Type Conversion

# Input data
{"age": "25", "income": "50000.50", "active": "true"}

# Processed data
{"age": 25, "income": 50000.50, "active": True}

Missing Value Handling

# Input with missing values
{"name": "John", "age": "", "income": null}

# Processed data
{"name": "John", "age": null, "income": null}

String Normalization

# Input data
{"name": "  john doe  ", "email": "JOHN@EXAMPLE.COM"}

# Processed data
{"name": "john doe", "email": "john@example.com"}

Manual Preprocessing Options

Data Cleaning Scripts

import pandas as pd

def clean_data(df):
    # Remove duplicates
    df = df.drop_duplicates()

    # Handle missing values
    df = df.fillna({
        'age': df['age'].median(),
        'income': df['income'].mean()
    })

    # Normalize text
    df['name'] = df['name'].str.strip().str.lower()

    return df

Schema Definition

{
  "columns": {
    "customer_id": {"type": "integer", "primary_key": true},
    "name": {"type": "string", "max_length": 100},
    "age": {"type": "integer", "min": 0, "max": 120},
    "email": {"type": "string", "format": "email"},
    "income": {"type": "float", "min": 0}
  },
  "constraints": {
    "unique": ["email"],
    "not_null": ["customer_id", "name"]
  }
}

Dataset Size Guidelines

Small Datasets (< 10,000 rows)

Best for: Prototyping, testing, learning
Processing: Instant (< 1 second)
Quality: May have higher variance
Use Cases: Examples, tutorials, validation

Medium Datasets (10,000 - 100,000 rows)

Best for: Development, moderate analysis
Processing: Fast (1-10 seconds)
Quality: Good balance of speed and accuracy
Use Cases: Application development, research

Large Datasets (100,000 - 1,000,000 rows)

Best for: Production, comprehensive analysis
Processing: Moderate (10-60 seconds)
Quality: High accuracy, stable results
Use Cases: Enterprise applications, large-scale research

Very Large Datasets (> 1,000,000 rows)

Best for: Big data applications
Processing: Extended (1-10 minutes)
Quality: Maximum accuracy
Considerations: Memory usage, processing time
Alternatives: Sampling, distributed processing

� Unsupported Formats

File Types

Images: PNG, JPG, GIF (use metadata extraction)
Videos: MP4, AVI (use metadata extraction)
Audio: MP3, WAV (use metadata extraction)
Documents: PDF, DOCX (use text extraction)
Archives: ZIP, TAR (extract first)

Data Structures

Graphs: Node/edge data (convert to tabular)
Time Series: Irregular intervals (resample first)
Geospatial: Complex geometries (use coordinates)
Hierarchical: Deep nesting (flatten first)

Special Cases

Encrypted Data: Must be decrypted before upload
Compressed Data: Decompress before upload
Binary Data: Convert to text representation
Real-time Streams: Batch into files first

Data Transformation

Schema Mapping

{
  "source_schema": {
    "old_column_name": "new_column_name",
    "customer_name": "name",
    "customer_age": "age"
  },
  "type_conversions": {
    "age": "integer",
    "income": "float"
  },
  "value_mappings": {
    "gender": {
      "M": "Male",
      "F": "Female",
      "O": "Other"
    }
  }
}

Data Validation Rules

{
  "validation_rules": {
    "age": {
      "type": "integer",
      "min": 0,
      "max": 120
    },
    "email": {
      "type": "string",
      "pattern": "^[\\w\\.-]+@[\\w\\.-]+\\.\\w+$"
    },
    "income": {
      "type": "float",
      "min": 0,
      "max": 10000000
    }
  }
}

Best Practices

Data Preparation

Clean Headers: Use descriptive, consistent column names
Consistent Formats: Standardize dates, numbers, and text
Remove Unnecessary Data: Delete unused columns
Validate Data Types: Ensure columns have appropriate types
Check for Outliers: Review extreme values
Handle Missing Data: Decide on imputation strategy

File Organization

One Dataset Per File: Avoid multiple datasets in one file
Consistent Naming: Use descriptive file names
Version Control: Include version numbers in filenames
Documentation: Include data dictionary or README

Quality Assurance

Sample First: Test with small subset before full upload
Validate Types: Check automatic type inference
Review Statistics: Examine generated profiles
Test Synthesis: Run small generation test
Scale Up: Gradually increase dataset size

Troubleshooting

Common Format Issues

"Unable to parse file"

Causes: Corrupted file, unsupported encoding, invalid format
Solutions: Check file integrity, convert encoding, validate format

"Column names missing"

Causes: No header row, empty first row
Solutions: Add header row, remove empty rows

"Type conversion failed"

Causes: Inconsistent data types, invalid values
Solutions: Clean data, standardize formats, handle exceptions

"File too large"

Causes: Exceeds size limit
Solutions: Split file, increase limit, use compression

Data Quality Issues

"High missing values"

Causes: Sparse data, collection issues
Solutions: Imputation, filtering, data collection improvement

"Inconsistent types"

Causes: Mixed data types in columns
Solutions: Separate columns, data cleaning, type standardization

"Outlier values"

Causes: Data entry errors, legitimate extreme values
Solutions: Review validity, apply filtering, use robust statistics

Performance Considerations

File Size Impact

File Size	Rows	Processing Time	Memory Usage
< 1MB	< 10K	< 1 second	< 50MB
1-10MB	10K-100K	1-5 seconds	50-200MB
10-100MB	100K-1M	5-30 seconds	200MB-1GB
> 100MB	> 1M	30s+	1GB+

Optimization Tips

Compress Files: Use gzip for large CSV files
Use Efficient Formats: Parquet for large datasets
Pre-clean Data: Remove unnecessary columns
Batch Processing: Split very large files
Monitor Resources: Watch memory and CPU usage

Integration Examples

Python Data Preparation

import pandas as pd
from pathlib import Path

def prepare_dataset(input_file: str, output_file: str):
    """Prepare dataset for upload to Synthetic Data Studio."""

    # Load data
    df = pd.read_csv(input_file)

    # Clean column names
    df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

    # Handle missing values
    df = df.dropna(subset=['customer_id'])  # Required fields
    df = df.fillna({
        'age': df['age'].median(),
        'income': df['income'].mean()
    })

    # Validate data types
    df['age'] = pd.to_numeric(df['age'], errors='coerce')
    df['income'] = pd.to_numeric(df['income'], errors='coerce')

    # Remove outliers (optional)
    df = df[df['age'].between(18, 100)]
    df = df[df['income'].between(0, 1000000)]

    # Save cleaned data
    df.to_csv(output_file, index=False)

    return df

API Upload with Validation

import requests
import json

def upload_with_validation(file_path: str, api_url: str, token: str):
    """Upload dataset with validation."""

    # Step 1: Validate file locally
    file_size = Path(file_path).stat().st_size
    if file_size > 100 * 1024 * 1024:  # 100MB
        raise ValueError("File too large")

    # Step 2: Check CSV structure
    df = pd.read_csv(file_path, nrows=5)  # Sample
    if df.empty or len(df.columns) == 0:
        raise ValueError("Invalid CSV structure")

    # Step 3: Upload
    with open(file_path, 'rb') as f:
        response = requests.post(
            f"{api_url}/datasets/upload",
            files={'file': f},
            headers={'Authorization': f'Bearer {token}'}
        )

    if response.status_code == 200:
        dataset_id = response.json()['id']
        print(f" Upload successful: {dataset_id}")

        # Step 4: Profile (optional)
        profile_response = requests.post(
            f"{api_url}/datasets/{dataset_id}/profile",
            headers={'Authorization': f'Bearer {token}'}
        )

        if profile_response.status_code == 200:
            print(" Profiling completed")
            return dataset_id
    else:
        raise Exception(f"Upload failed: {response.text}")

Need help with data formats? Check our Data Upload Guide or create an issue on GitHub.

File Format Support​

CSV Files​

Supported Variants​

Requirements​

Encoding Support​

Size Limits​

JSON Files​

Supported Structures​

Nested Structures​

Excel Files​

Supported Formats​

Sheet Selection​

Data Requirements​

Parquet Files​

Support Level​

Advantages​

Data Type Support​

Automatic Type Inference​

Numeric Types​

Text Types​

Date/Time Types​

Type Conversion Rules​

String to Numeric​

Date Parsing​

Categorical Detection​

Data Structure Requirements​

Tabular Format​

Required Structure​

Column Requirements​

Data Quality Standards​

Completeness​

Consistency​

Validity​

Data Preprocessing​

Automatic Processing​

Type Conversion​

Missing Value Handling​

String Normalization​

Manual Preprocessing Options​

Data Cleaning Scripts​

Schema Definition​

Dataset Size Guidelines​

Small Datasets (< 10,000 rows)​

Medium Datasets (10,000 - 100,000 rows)​

Large Datasets (100,000 - 1,000,000 rows)​

Very Large Datasets (> 1,000,000 rows)​

� Unsupported Formats​

File Types​

Data Structures​

Special Cases​

Data Transformation​

Schema Mapping​

Data Validation Rules​

Best Practices​

Data Preparation​

File Organization​

Quality Assurance​

Troubleshooting​

Common Format Issues​

Data Quality Issues​

Performance Considerations​

File Size Impact​

Optimization Tips​

Integration Examples​

Python Data Preparation​

API Upload with Validation​