# Intelligent Schema Detector

## Overview

The **Intelligent Schema Detector** uses **name-based pattern matching** to infer column data types before falling back to value-based analysis. This ensures that columns like `interchange` (exchange rate) are correctly detected as `DECIMAL(10,4)` instead of `DECIMAL(10,2)`, and ID fields remain `VARCHAR` instead of being misidentified as `INT`.

## Detection Priority

```
┌─────────────────────────────────────┐
│  1. NAME-BASED PATTERN MATCHING     │  ← PRIMARY (NEW!)
│     (inferTypeFromName)             │
└─────────────────────────────────────┘
              ↓ (if no match)
┌─────────────────────────────────────┐
│  2. VALUE-BASED ANALYSIS            │  ← FALLBACK
│     (inferType, inferLength)        │
└─────────────────────────────────────┘
```

## Pattern Detection Rules

### RULE 1: ID Fields → VARCHAR(100)
**Patterns:** `*_id`, `id`, `codigo`, `*_code`, `code`, `*_key`, `key`, `folio`

**Examples:**
- `property_id` → `VARCHAR(100)`
- `transaction_code` → `VARCHAR(100)`
- `folio` → `VARCHAR(100)`

**Rationale:** ID fields often contain alphanumeric values (e.g., "PROP-12345-ABC") and should never be stored as integers.

---

### RULE 2: High-Precision Money (Exchange Rates) → DECIMAL(10,4)
**Patterns:** `exchange_rate`, `interchange`, `tipo_cambio`, `tasa`, `*rate*`

**Examples:**
- `interchange` → `DECIMAL(10,4)`
- `exchange_rate` → `DECIMAL(10,4)`
- `tipo_cambio` → `DECIMAL(10,4)`

**Rationale:** Exchange rates require 4 decimal places to preserve precision (e.g., 17.0123 MXN/USD).

---

### RULE 3: Standard Money Fields → DECIMAL(10,2)
**Patterns:** `*_amount`, `*_fee`, `*_payment`, `*_commission`, `*_percentage`, `*_convert`, `*_mxn`, `*_usd`, `*_eur`, `*_price`, `*_cost`, `*_total`, `*_subtotal`, `precio`, `pago`, `comision`

**Examples:**
- `total_amount` → `DECIMAL(10,2)`
- `guest_payment` → `DECIMAL(10,2)`
- `commission_fee` → `DECIMAL(10,2)`
- `mxn_convert` → `DECIMAL(10,2)`

**Rationale:** Standard currency values use 2 decimal places (e.g., $1,234.56).

---

### RULE 4: Date Fields → DATE
**Patterns:** `*_date`, `date`, `check_in`, `check_out`, `checkin`, `checkout`, `fecha`, `created_at`, `updated_at`

**Examples:**
- `check_in` → `DATE`
- `created_at` → `DATE`
- `fecha` → `DATE`

**Rationale:** Date fields should use MySQL's `DATE` type for proper date operations.

---

### RULE 5: Email Fields → VARCHAR(255)
**Patterns:** `*email*`, `*correo*`

**Examples:**
- `guest_email` → `VARCHAR(255)`
- `correo` → `VARCHAR(255)`

**Rationale:** Email addresses can be up to 254 characters (RFC 5321).

---

### RULE 6: Phone Fields → VARCHAR(20)
**Patterns:** `*phone*`, `*telefono*`, `tel`, `*_tel`, `*celular*`

**Examples:**
- `guest_phone` → `VARCHAR(20)`
- `telefono` → `VARCHAR(20)`
- `celular` → `VARCHAR(20)`

**Rationale:** Phone numbers can contain +, -, spaces, and international prefixes.

---

### RULE 7: Boolean/Status Fields → VARCHAR(50)
**Patterns:** `is_*`, `has_*`, `*_status`, `*_estado`, `activo`, `active`, `enabled`

**Examples:**
- `is_active` → `VARCHAR(50)`
- `booking_status` → `VARCHAR(50)`
- `activo` → `VARCHAR(50)`

**Rationale:** Status fields often contain text values like "Active", "Pending", "Cancelled".

---

### RULE 8: URL Fields → VARCHAR(500)
**Patterns:** `*_url`, `*_link`, `website`, `sitio_web`

**Examples:**
- `listing_url` → `VARCHAR(500)`
- `website` → `VARCHAR(500)`

**Rationale:** URLs can be very long, especially with query parameters.

---

### RULE 9: Description/Notes Fields → TEXT
**Patterns:** `*_description`, `*_notes`, `*_comentarios`, `*_observaciones`, `descripcion`, `notas`, `comentarios`, `observaciones`

**Examples:**
- `guest_notes` → `TEXT`
- `property_description` → `TEXT`
- `comentarios` → `TEXT`

**Rationale:** Description and notes fields can contain large amounts of text.

---

### RULE 10: Quantity/Count Fields → INT
**Patterns:** `*_count`, `*_quantity`, `*_qty`, `cantidad`, `numero`

**Examples:**
- `guest_count` → `INT`
- `nights_qty` → `INT`
- `cantidad` → `INT`

**Rationale:** Count and quantity fields are always whole numbers.

---

## Real-World Impact

### Before (Value-Based Only)
```
casitamx_transaction:
  - interchange: DECIMAL(10,2) ❌ → Values 100x too large! (17.0123 rounded to 17.01)
  - property_id: INT ❌ → Would fail on alphanumeric IDs like "PROP-ABC123"
  - check_in: VARCHAR(255) ❌ → Can't use date functions in SQL
```

### After (Name-Based + Value-Based)
```
casitamx_transaction:
  - interchange: DECIMAL(10,4) ✅ → Correct precision for exchange rates
  - property_id: VARCHAR(100) ✅ → Handles alphanumeric IDs correctly
  - check_in: DATE ✅ → Proper date type for SQL operations
```

## Implementation Details

### Modified Files
- **`/lamp/www/importer/lib/SchemaDetector.php`**
  - Added `inferTypeFromName()` method (10 pattern rules)
  - Modified `analyzeColumn()` to prioritize name-based detection

### Test Coverage
- **47 unit tests** covering all 10 rules + edge cases
- **16 real-world tests** using actual casitamx_transaction column names
- **100% pass rate** on both test suites

## Usage

The intelligent schema detector runs automatically during the upload process:

```php
// In upload.php (Gate 2: Arrival)
$schema = SchemaDetector::detectSchema($rows, $headers);

// Result:
[
    'name' => 'interchange',
    'type' => 'DECIMAL',
    'length' => '10,4',  // ← Detected from name pattern!
    'nullable' => true,
    'indexed' => false
]
```

## Fallback Behavior

If a column name doesn't match any pattern (e.g., `random_field`, `guest_name`), the system automatically falls back to **value-based detection**:

1. Analyze actual data values
2. Determine type based on content (INT, DECIMAL, DATE, TEXT, VARCHAR)
3. Calculate appropriate length based on value sizes

This ensures **100% coverage** - every column gets a sensible type, even if the name is non-standard.

## Future Enhancements

Potential improvements:
1. **User-customizable patterns** - Allow users to add their own domain-specific rules
2. **Machine learning** - Learn patterns from historical imports
3. **Multi-language support** - Expand Spanish patterns (French, Portuguese, etc.)
4. **Confidence scoring** - Show detection confidence in UI

## Testing

Run the test suites to verify functionality:

```bash
# Comprehensive unit tests (47 tests)
php /lamp/www/importer/tests/test_intelligent_schema_detector.php

# Real-world tests using casitamx_transaction columns (16 tests)
php /lamp/www/importer/tests/test_real_world_casitamx.php
```

Expected output: **🎉 ALL TESTS PASSED! 🎉**
