# Tabler Improvement Plan - Make It Undeniable

**Protocol:** tabler.mother.tongue.protocol.v1  
**Alias:** The Ledger Whisperer  
**Status:** Phase 2 - Hardening

---

## Executive Summary

The initial implementation of tabler.php successfully processes PDFs but has accuracy issues:
- Only 4 rows extracted from a 20-30+ transaction statement
- 3 out of 4 reconciliation failures
- Dates incorrectly parsed (e.g., "29-03-5280" instead of proper dates)
- No header row detected

This plan outlines 6 surgical fixes to achieve **undeniable** PDF parsing.

---

## Root Cause Analysis

### Issue 1: Missing `-layout` Flag in pdftotext

**Current Code (line 262-269):**
```php
function pdf_to_text($filepath, $page = null, $debug = false) {
    $cmd = "pdftotext";
    // ... page selection ...
    $cmd .= " " . escapeshellarg($filepath);
    $cmd .= " -";  // Output to stdout
```

**Problem:** Without `-layout`, pdftotext loses column structure. Bank statements are tabular - columns matter.

**Fix:**
```php
$cmd = "pdftotext -layout";
```

### Issue 2: Spanish Month Abbreviations Not Recognized

**Current Code (line 522-545):**
```php
function normalize_spanish_month($month_name, $day, $year) {
    $months = [
        'january' => 1, 'enero' => 1,
        // ... full month names only
    ];
```

**Problem:** Santander uses abbreviations like `10-ENE-2024`, `31-DIC-2023`. These are not recognized.

**Fix:** Add abbreviation mappings:
```php
$months = [
    // Full names
    'january' => 1, 'enero' => 1,
    // ... existing ...
    
    // Abbreviations (Spanish)
    'ene' => 1, 'feb' => 2, 'mar' => 3, 'abr' => 4,
    'may' => 5, 'jun' => 6, 'jul' => 7, 'ago' => 8,
    'sep' => 9, 'oct' => 10, 'nov' => 11, 'dic' => 12,
    
    // Abbreviations (English)
    'jan' => 1, 'feb' => 2, 'mar' => 3, 'apr' => 4,
    'may' => 5, 'jun' => 6, 'jul' => 7, 'aug' => 8,
    'sep' => 9, 'oct' => 10, 'nov' => 11, 'dec' => 12,
];
```

### Issue 3: Date Pattern Not Matching DD-MMM-YYYY

**Current Code (line 487-495):**
```php
$patterns = [
    // DD/MM/YYYY or DD-MM-YYYY
    '/^(\d{1,2})[\/\-](\d{1,2})[\/\-](\d{4})$/' => 'd-m-y',
    // ...
];
```

**Problem:** Pattern expects numeric month, not text month like `ENE`.

**Fix:** Add pattern for DD-MMM-YYYY:
```php
$patterns = [
    // DD-MMM-YYYY (e.g., 10-ENE-2024)
    '/^(\d{1,2})[-\/]([A-Za-z]{3})[-\/](\d{4})$/' => 'd-M-y',
    // ... existing patterns ...
];
```

### Issue 4: Row Detection Too Strict

**Current Code (line 811-837):**
```php
function is_transaction_row($tokens, $debug = false) {
    // Requires BOTH date AND number
    return $has_date && $has_number;
}
```

**Problem:** Multi-line transactions have date on one line, amounts on another.

**Fix:** Implement state machine for transaction assembly:
```php
// States: WAITING_FOR_DATE, READING_DESCRIPTION, READING_AMOUNTS
// Transition on: date pattern, amount pattern, empty line
```

### Issue 5: Column Detection Fails Without bbox Data

**Current Code (line 724-777):**
```php
function detect_column_positions($tokens, $debug = false) {
    // Relies on bbox data from tokens
    foreach ($tokens as $token) {
        if (!isset($token['bbox'])) {
            continue;
        }
```

**Problem:** When pdftotext is used (not OCR), there's no bbox data. Tokens = 0.

**Fix:** Use whitespace-based column detection for layout text:
```php
function detect_columns_from_layout($line) {
    // Split by 2+ consecutive spaces
    $columns = preg_split('/\s{2,}/', trim($line));
    return $columns;
}
```

### Issue 6: Amount Parsing Heuristic Wrong

**Current Code (line 1039-1066):**
```php
// Heuristic: largest value is balance
usort($amounts, function($a, $b) {
    return $b['value'] - $a['value'];
});
$row['balance'] = $amounts[0]['value'];
```

**Problem:** This assumes balance is always the largest number. Not true for all transactions.

**Fix:** Use column position to determine amount type:
```php
// Column-based assignment:
// - Rightmost numeric column = Balance
// - Second from right = Credit (Abono)
// - Third from right = Debit (Cargo)
```

---

## Implementation Plan

### Module 1: Fix pdftotext Command

**File:** `tabler.php`  
**Function:** `pdf_to_text()`  
**Line:** 262

**Change:**
```php
// Before
$cmd = "pdftotext";

// After
$cmd = "pdftotext -layout";
```

**Impact:** Preserves column structure, enables whitespace-based column detection.

---

### Module 2: Add Spanish Month Abbreviations

**File:** `tabler.php`  
**Function:** `normalize_spanish_month()`  
**Line:** 522

**Change:** Expand `$months` array to include 3-letter abbreviations.

**Impact:** Dates like `10-ENE-2024` will be correctly parsed.

---

### Module 3: Add DD-MMM-YYYY Date Pattern

**File:** `tabler.php`  
**Function:** `normalize_date()`  
**Line:** 487

**Change:** Add new regex pattern for text-month dates.

**Impact:** Dates with month abbreviations will be recognized.

---

### Module 4: Implement Layout-Based Column Detection

**File:** `tabler.php`  
**New Function:** `parse_layout_columns()`

**Logic:**
```php
function parse_layout_columns($line) {
    // Split by 2+ spaces (column boundaries in -layout output)
    $columns = preg_split('/\s{2,}/', trim($line));
    
    // Filter empty columns
    $columns = array_filter($columns, function($col) {
        return strlen(trim($col)) > 0;
    });
    
    return array_values($columns);
}
```

**Impact:** Works without bbox data, uses whitespace as column delimiter.

---

### Module 5: Implement Transaction State Machine

**File:** `tabler.php`  
**New Function:** `assemble_transactions_stateful()`

**States:**
1. `WAITING_FOR_DATE` - Looking for line starting with date
2. `READING_DESCRIPTION` - Collecting description lines
3. `READING_AMOUNTS` - Found amounts, complete transaction

**Transitions:**
- Date pattern → Start new transaction
- Amount pattern → End current transaction
- Empty line → Flush buffer

**Impact:** Handles multi-line transactions correctly.

---

### Module 6: Implement Column-Position Amount Assignment

**File:** `tabler.php`  
**Function:** `assemble_row()`  
**Line:** 995

**Logic:**
```php
// For layout-based parsing:
// 1. Identify numeric columns by position
// 2. Rightmost = Balance
// 3. Second from right = Credit
// 4. Third from right = Debit
// 5. Everything else = Description
```

**Impact:** Correct debit/credit/balance assignment based on column position.

---

## Testing Strategy

### Test 1: Layout Preservation
```bash
pdftotext -layout /tmp/santander_test.pdf - | head -50
```
**Expected:** Columns aligned with whitespace.

### Test 2: Date Parsing
```php
$dates = ['10-ENE-2024', '31-DIC-2023', '01/01/2024'];
foreach ($dates as $d) {
    echo normalize_date($d) . "\n";
}
```
**Expected:** All dates normalized to DD-MM-YYYY.

### Test 3: Full Pipeline
```bash
php tabler.php /tmp/santander_test.pdf /tmp/output.txt --audit=/tmp/audit.json
```
**Expected:**
- 20-30+ rows extracted
- 0-2 reconciliation failures
- All dates in DD-MM-YYYY format

---

## Success Metrics

| Metric | Before | Target | Measurement |
|--------|--------|--------|-------------|
| Rows Extracted | 4 | 20-30+ | Count output lines |
| Reconciliation Failures | 3/4 (75%) | <10% | Audit JSON |
| Date Accuracy | ~50% | 95%+ | Manual verification |
| Amount Accuracy | ~60% | 95%+ | Manual verification |
| Header Detection | No | Yes | Audit JSON |

---

## Mermaid Diagram: Improved Pipeline

```mermaid
flowchart TD
    A[PDF Input] --> B[Stage 1: Ingest]
    B --> C{Text or Scanned?}
    
    C -->|Text| D[pdftotext -layout]
    C -->|Scanned| E[Rasterize + OCR]
    
    D --> F[Layout-Based Column Detection]
    E --> G[bbox-Based Column Detection]
    
    F --> H[Stage 3: Zoom-Out]
    G --> H
    
    H --> I[Transaction State Machine]
    I --> J[Multi-line Assembly]
    
    J --> K[Column-Position Amount Assignment]
    K --> L[Stage 5: Reconciliation]
    
    L --> M{Balance Math OK?}
    M -->|Yes| N[Export TSV]
    M -->|No| O[Infer from Balance Deltas]
    O --> N
    
    N --> P[audit.json]
```

---

## Implementation Order

1. **Module 1:** Fix pdftotext command (5 min)
2. **Module 2:** Add Spanish month abbreviations (10 min)
3. **Module 3:** Add DD-MMM-YYYY date pattern (5 min)
4. **Module 4:** Implement layout-based column detection (20 min)
5. **Module 5:** Implement transaction state machine (30 min)
6. **Module 6:** Implement column-position amount assignment (20 min)

**Total Estimated Implementation:** ~90 minutes

---

## Risk Mitigation

| Risk | Mitigation |
|------|------------|
| Layout varies by bank | Use heuristics + reconciliation to validate |
| Multi-page transactions | State machine handles page breaks |
| Ambiguous columns | Use balance delta inference as fallback |
| OCR errors | Flag low-confidence rows in audit |

---

## Conclusion

These 6 surgical fixes will transform tabler.php from "works sometimes" to "undeniable". The key insight is that **layout preservation** (`-layout` flag) is the foundation - everything else builds on having proper column structure.

After implementation, the script should:
- Extract ALL transactions from any bank statement
- Correctly parse Spanish date formats
- Achieve 90%+ reconciliation success
- Produce audit trails for any failures

**No errors, just execution.**
