Antipatrones Críticos: OCR/NLP (EVITAR)¶

1. OCR sin Fallback¶

# ❌ Crash si PaddleOCR no disponible
text = paddleocr.extract(file)

# ✅ Routing + Result pattern
result = await ocr_router.extract(file)
if result.is_failure():
    result = await tesseract_adapter.extract(file)

2. NER sin Validación Humana¶

# ❌ Confiar 100% en 85% accuracy
demandante = ner.extract_entity("PERSON")[0]
save_to_db(demandante)

# ✅ Interfaz de corrección
extracted = ner.extract_entity("PERSON")
user_confirmed = await gui.validate_field("Demandante", extracted)
if user_confirmed.is_corrected:
    log_correction(extracted, user_confirmed.value)
save_to_db(user_confirmed.value)

3. Thresholds Fijos¶

# ❌ 0.80 para todos
if similarity_score > 0.80: alert_duplicate()

# ✅ Configurables por tipo
THRESHOLDS = {"tutela": 0.75, "habeas_corpus": 0.85, "sentencia": 0.90}
threshold = THRESHOLDS.get(doc_type, 0.80)

4. Railway-Oriented Programming Olvidado¶

# ❌ Try/except anidados
try:
    text = ocr.extract(file)
    try:
        entities = ner.extract(text)
    except NERError:
        pass
except OCRError:
    pass

# ✅ Composición fluida
result = (
    extract_text(file)
    .bind(normalize_text)
    .bind(extract_entities)
    .bind(detect_duplicates)
    .map(build_document)
)

5. Excepciones Silenciosas¶

# ❌ NUNCA
try:
    process_document()
except:
    pass

# ✅ SIEMPRE loguear
try:
    process_document()
except Exception as e:
    logger.error(f"Error: {e}", exc_info=True)

6. Framework de Decisión ML/NLP¶

Antes de integrar cualquier tecnología ML/NLP:

Pregunta	Debe tener respuesta
¿Quién resuelve si precisión baja?	Usuario valida en GUI
¿Qué métrica mejora?	Reducir duplicados no detectados 15%→5%
¿A qué costo?	+2GB RAM, +500ms latencia
¿Existe A/B test vs baseline?	100 docs anotados

Decisiones actuales¶

Tecnología	Estado	Justificación
SpaCy NER	✅ MVP	85-90% accuracy, <100ms, sin GPU
LLM Local (Ollama)	❌ Rechazado	8GB RAM, GPU, +5s por +5% accuracy
Sentence-Transformers	📅 Fase 2	+15-27% dedup semántico