# Tabler Amount Parsing Enhancement Plan

## Problem Analysis

Looking at the current output, I can identify several critical issues:

### Issue 1: Reference Numbers Parsed as Amounts

**Example:**
```
12-01-2024  PAGO TRANSFERENCIA SPEI HORA 13:59:30  1,014,195.00  5.17  2.81
```

The `1,014,195` is a **reference number** (SPEI tracking number), NOT an amount. The actual amounts are:
- Cargo (debit): 5.17
- Saldo (balance): 2.81

**Root Cause:** The current heuristic treats any numeric string as a potential amount. Reference numbers like `1014195` are being parsed as `1,014,195.00`.

### Issue 2: Huge Phantom Numbers

**Example:**
```
12-01-9944  ELEYEME ASOCIADOS SA DE CV...  0.00  124,044,891,249,179,951,104.00  124,044,891,249,179,951,104.00
```

This `124,044,891,249,179,951,104.00` is clearly wrong - it's a parsing artifact from concatenated numbers or page headers.

**Root Cause:** Header/footer text is being parsed as transactions.

### Issue 3: Wrong Date Parsing

**Example:**
```
12-01-9944  CODIGO DE CLIENTE NO...
12-01-2026  CUENTA SANTANDER PYME...
```

The year `9944` and `2026` are wrong. These are likely from parsing non-date numbers.

**Root Cause:** Numbers like `9944` in text are being interpreted as years.

### Issue 4: Column Assignment Wrong

Looking at actual Santander format:
```
FECHA       CONCEPTO                           RETIROS    DEPOSITOS    SALDO
10-ENE-2024 ABONO TRANSFERENCIA SPEI...                   277.82       282.01
12-ENE-2024 PAGO TRANSFERENCIA SPEI...         5.17                    2.81
```

The columns are:
1. **FECHA** (Date)
2. **CONCEPTO** (Description) - includes reference numbers
3. **RETIROS** (Withdrawals/Debits) - may be empty
4. **DEPOSITOS** (Deposits/Credits) - may be empty
5. **SALDO** (Balance)

**Root Cause:** Reference numbers in the description are being extracted as amounts.

---

## Solution Strategy

### Strategy 1: Amount Magnitude Filtering

Bank transactions typically have amounts in a reasonable range. Filter out:
- Amounts > 100,000,000 (100 million) - likely parsing errors
- Reference numbers (6-10 digit integers without decimals)

```php
function is_valid_amount($value, $text) {
    // Reject amounts over 100 million
    if (abs($value) > 100000000) {
        return false;
    }
    
    // Reject if original text looks like a reference number
    // Reference numbers: 6-10 digits, no decimal point
    if (preg_match('/^\d{6,10}$/', preg_replace('/[,\s]/', '', $text))) {
        return false;
    }
    
    return true;
}
```

### Strategy 2: Decimal Point Requirement

Real amounts almost always have decimal points (e.g., `5.17`, `277.82`). Reference numbers don't.

```php
function looks_like_amount($text) {
    // Must have a decimal point with 2 digits after
    return preg_match('/\d+[.,]\d{2}$/', $text);
}
```

### Strategy 3: Column Position Detection

In `-layout` output, columns are separated by whitespace. The rightmost 3 numeric columns are:
1. Retiros (Debit) - may be empty
2. Depósitos (Credit) - may be empty  
3. Saldo (Balance) - always present

```php
function parse_santander_line($line) {
    // Split by 2+ spaces
    $columns = preg_split('/\s{2,}/', trim($line));
    
    // Find numeric columns from the right
    $numeric_cols = [];
    for ($i = count($columns) - 1; $i >= 0; $i--) {
        if (looks_like_amount($columns[$i])) {
            $numeric_cols[] = $columns[$i];
        }
        if (count($numeric_cols) >= 3) break;
    }
    
    // Assign: rightmost = balance, then credit, then debit
    $balance = $numeric_cols[0] ?? null;
    $credit = $numeric_cols[1] ?? null;
    $debit = $numeric_cols[2] ?? null;
    
    return compact('debit', 'credit', 'balance');
}
```

### Strategy 4: Header/Footer Filtering

Filter out lines that match header/footer patterns:

```php
function is_header_or_footer($line) {
    $patterns = [
        '/CODIGO DE CLIENTE/',
        '/PERIODO\s*:\s*\d+\s*AL\s*\d+/',
        '/ELEYEME ASOCIADOS SA DE CV/',
        '/BANCO SANTANDER MEXICO/',
        '/UNIDAD ESPECIALIZADA/',
        '/FECHA Y HORA DE EXPEDICION/',
        '/REGIMEN FISCAL/',
    ];
    
    foreach ($patterns as $pattern) {
        if (preg_match($pattern, $line)) {
            return true;
        }
    }
    
    return false;
}
```

### Strategy 5: Date Validation

Validate that parsed dates are within a reasonable range:

```php
function is_valid_date($date_str) {
    // Parse DD-MM-YYYY
    if (preg_match('/^(\d{2})-(\d{2})-(\d{4})$/', $date_str, $m)) {
        $year = (int)$m[3];
        // Reject years outside 2000-2030
        if ($year < 2000 || $year > 2030) {
            return false;
        }
        return true;
    }
    return false;
}
```

### Strategy 6: Balance Continuity Check

Use balance continuity to validate transactions:

```php
function validate_balance_continuity($rows) {
    $prev_balance = null;
    
    foreach ($rows as $i => &$row) {
        if ($prev_balance !== null && $row['balance'] !== null) {
            $delta = $row['balance'] - $prev_balance;
            
            // If delta doesn't match debit/credit, try to fix
            $expected_delta = ($row['credit'] ?? 0) - ($row['debit'] ?? 0);
            
            if (abs($delta - $expected_delta) > 0.01) {
                // Swap debit/credit if that fixes it
                if (abs($delta - (($row['debit'] ?? 0) - ($row['credit'] ?? 0))) < 0.01) {
                    $tmp = $row['debit'];
                    $row['debit'] = $row['credit'];
                    $row['credit'] = $tmp;
                }
            }
        }
        
        $prev_balance = $row['balance'];
    }
    
    return $rows;
}
```

---

## Implementation Plan

### Module 1: Amount Validation Function

**Location:** After `parse_money()` function

```php
function is_valid_transaction_amount($value, $original_text) {
    // Reject null or zero
    if ($value === null || $value == 0) {
        return false;
    }
    
    // Reject amounts over 100 million (likely parsing errors)
    if (abs($value) > 100000000) {
        return false;
    }
    
    // Reject if original text looks like a reference number
    // Reference numbers: 6-10 digits without decimal
    $clean = preg_replace('/[,\s]/', '', $original_text);
    if (preg_match('/^\d{6,10}$/', $clean)) {
        return false;
    }
    
    // Prefer amounts with decimal points
    if (!preg_match('/[.,]\d{2}$/', $original_text)) {
        // No decimal - might be reference number
        // Only accept if small (< 10000)
        if (abs($value) >= 10000) {
            return false;
        }
    }
    
    return true;
}
```

### Module 2: Enhanced Header/Footer Filter

**Location:** Update `is_non_transaction_region()` function

Add patterns:
- `CODIGO DE CLIENTE`
- `PERIODO : XX AL XX`
- Company name patterns
- Bank legal notices
- CFDI/XML metadata

### Module 3: Date Validation

**Location:** After `normalize_date()` function

```php
function validate_transaction_date($date_str) {
    if (!preg_match('/^(\d{2})-(\d{2})-(\d{4})$/', $date_str, $m)) {
        return false;
    }
    
    $day = (int)$m[1];
    $month = (int)$m[2];
    $year = (int)$m[3];
    
    // Validate ranges
    if ($year < 2000 || $year > 2030) return false;
    if ($month < 1 || $month > 12) return false;
    if ($day < 1 || $day > 31) return false;
    
    return true;
}
```

### Module 4: Column-Position Amount Extraction

**Location:** Update `extract_amounts_from_layout()` function

Key changes:
1. Only extract amounts from rightmost 3 numeric columns
2. Require decimal point for amounts
3. Filter out reference numbers

### Module 5: Post-Processing Validation

**Location:** After row assembly, before reconciliation

```php
function validate_and_clean_rows($rows) {
    $cleaned = [];
    
    foreach ($rows as $row) {
        // Skip invalid dates
        if (!validate_transaction_date($row['day'])) {
            continue;
        }
        
        // Skip header/footer rows
        if (is_header_or_footer($row['description'])) {
            continue;
        }
        
        // Validate amounts
        if ($row['balance'] !== null && abs($row['balance']) > 100000000) {
            continue;
        }
        
        $cleaned[] = $row;
    }
    
    return $cleaned;
}
```

---

## Expected Results

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Valid Rows | 98 | ~60-70 | Fewer false positives |
| Reconciliation Failures | 68 (69%) | <10 (<15%) | 85%+ success |
| Date Accuracy | ~80% | 99%+ | No phantom years |
| Amount Accuracy | ~30% | 90%+ | No reference numbers |

---

## Testing Strategy

1. **Unit Test Amount Validation:**
   - `1014195` → rejected (reference number)
   - `5.17` → accepted (valid amount)
   - `124044891249179951104` → rejected (too large)

2. **Unit Test Date Validation:**
   - `12-01-2024` → accepted
   - `12-01-9944` → rejected
   - `12-01-2026` → rejected (future date)

3. **Integration Test:**
   - Run on Santander PDF
   - Verify ~60-70 valid transactions
   - Verify 85%+ reconciliation success

---

## Mermaid Diagram: Enhanced Parsing Flow

```mermaid
flowchart TD
    A[Raw Line] --> B{Is Header/Footer?}
    B -->|Yes| Z[Skip]
    B -->|No| C[Extract Date]
    
    C --> D{Valid Date?}
    D -->|No| Z
    D -->|Yes| E[Split by Whitespace]
    
    E --> F[Find Rightmost Numeric Columns]
    F --> G{Has Decimal Point?}
    
    G -->|No| H{Small Number?}
    H -->|No| I[Treat as Reference]
    H -->|Yes| J[Accept as Amount]
    G -->|Yes| J
    
    I --> K[Add to Description]
    J --> L[Assign to Debit/Credit/Balance]
    
    L --> M{Amount < 100M?}
    M -->|No| Z
    M -->|Yes| N[Create Transaction Row]
    
    N --> O[Balance Continuity Check]
    O --> P[Output]
```

---

## Implementation Order

1. **Module 1:** Amount validation function (15 min)
2. **Module 2:** Enhanced header/footer filter (10 min)
3. **Module 3:** Date validation (10 min)
4. **Module 4:** Column-position extraction (20 min)
5. **Module 5:** Post-processing validation (15 min)

**Total:** ~70 minutes

---

## Risk Mitigation

| Risk | Mitigation |
|------|------------|
| Over-filtering valid transactions | Use conservative thresholds |
| Different bank formats | Keep heuristics general |
| Edge cases | Flag uncertain rows in audit |
| Performance | Filter early in pipeline |
