Saltar a contenido

Diagrama de Clases Principales

Descripción

Diagrama UML de clases que muestra las principales entidades, value objects, puertos (protocolos) y adapters del sistema Sherlock-docs.

Enfocado en la arquitectura hexagonal: Core (Domain) → Application → Ports ← Infrastructure.

Diagrama

classDiagram
    %% ============================================
    %% DOMAIN LAYER (Core Entities & Value Objects)
    %% ============================================

    class Document {
        +String id
        +String file_path
        +String file_name
        +DocumentType document_type
        +DocumentStatus status
        +String content
        +float ocr_confidence
        +String file_hash
        +String? demandante
        +String? demandado
        +String? radicado
        +String? juzgado
        +datetime? fecha_documento
        +String hora_radicacion
        +String numero_acta
        +String? cedula
        +String? correo
        +String observaciones
        +String etiquetas
        +String direccion_demandante
        +String direccion_demandado
        +datetime created_at
        +datetime updated_at
        +datetime? imported_at
    }

    note for Document "33 campos totales en BD\n(solo los principales se muestran)"

    class DocumentType {
        <<enumeration>>
        TUTELA
        HABEAS_CORPUS
        UNKNOWN
    }

    class DocumentStatus {
        <<enumeration>>
        PENDING
        PROCESSING
        PROCESSED
        VALIDATED
        ERROR
    }

    class OCRResult {
        <<value object>>
        +String text
        +float confidence
        +OCREngine engine_used
        +int pages_processed
        +float processing_time_ms
        +Tuple~String~ errors
    }

    class EntityResult {
        <<value object>>
        +String? demandante
        +String? demandado
        +String? radicado
        +String? juzgado
        +String? fecha
        +String? direccion_demandante
        +String? direccion_demandado
    }

    class DuplicateCandidate {
        <<value object>>
        +String document_id
        +String source_document_id
        +Dict~str,LevelScore~ level_scores
        +float final_score
        +DuplicateConfidence confidence
        +String detection_method
        +String explanation
        +Dict~String,bool~ matched_entities
    }

    Document --> DocumentType
    Document --> DocumentStatus

    %% ============================================
    %% APPLICATION LAYER (Use Cases & DTOs)
    %% ============================================

    class ProcessDocumentUseCase {
        -OCRPort _ocr_port
        -NERPort _ner_port
        -DuplicateDetectorPort _dedup_port
        -DocumentRepository _repository
        -Callable _detect_provisional_measure
        -Callable _detect_sections
        +execute(request: ProcessDocumentRequest) Result~ProcessDocumentResponse, ProcessingError~
        -_extract_text(file_path: Path) Result~OCRResult, ProcessingError~
        -_extract_entities(text: str) Result~EntityResult, ProcessingError~
        -_calculate_hash(content: str) String
        -_check_duplicate(hash: str) Document?
    }

    class ProcessDocumentRequest {
        <<DTO>>
        +String file_path
        +String document_type
    }

    class ProcessDocumentResponse {
        <<DTO>>
        +String document_id
        +String file_name
        +String content
        +float ocr_confidence
        +String? demandante
        +String? demandado
        +String? radicado
        +String? juzgado
        +String? fecha
        +float processing_time_ms
        +String ocr_engine
        +bool is_duplicate
        +String? duplicate_of
        +List~DuplicateCandidateDTO~? duplicate_candidates
    }

    class ProcessingMetrics {
        <<DTO>>
        +String document_id
        +float ocr_time_ms
        +float ner_time_ms
        +float dedup_time_ms
        +float total_time_ms
        +String ocr_engine
    }

    class ExportExcelRequest {
        <<DTO>>
        +String? start_date
        +String? end_date
        +String? document_type
    }

    class ExportExcelResponse {
        <<DTO>>
        +bytes file_content
        +String file_name
        +int total_exported
    }

    class ImportExcelRequest {
        <<DTO>>
        +String file_path
    }

    class ImportExcelResponse {
        <<DTO>>
        +int total_imported
        +int total_errors
        +List~str~ errors
    }

    class ServiceContainer {
        -Path db_path
        -OCRConfigV2 ocr_config
        -NERConfig ner_config
        -EnsembleConfig dedup_config
        -SQLiteDocumentRepository? _repository
        -OCRRouter? _ocr_router
        -SpaCyAdapter? _ner_adapter
        -HybridEnsembleDetector? _duplicate_detector
        +repository() DocumentRepository
        +ocr_router() OCRRouter
        +ner_adapter() SpaCyAdapter
        +duplicate_detector() HybridEnsembleDetector
        +process_document_use_case() ProcessDocumentUseCase
        +close() None
    }

    ProcessDocumentUseCase --> ProcessDocumentRequest
    ProcessDocumentUseCase --> ProcessDocumentResponse
    ServiceContainer ..> ProcessDocumentUseCase : creates

    %% ============================================
    %% PORTS (Protocols/Interfaces)
    %% ============================================

    class OCRPort {
        <<protocol>>
        +extract_text(file_path: Path) OCRResult
    }

    class DocumentReader {
        <<protocol - ISP>>
        +find_by_id(id: String) Document?
        +find_by_hash(hash: String) Document?
        +get_all(limit: int) List~Document~
        +count() int
    }

    class DocumentWriter {
        <<protocol - ISP>>
        +save(document: Document) Document
    }

    class DocumentSearcher {
        <<protocol - ISP>>
        +search(query: String, limit: int, offset: int) SearchResult
    }

    class CorrectionRepository {
        <<protocol - ISP>>
        +save_correction(correction: Correction) Correction
        +get_corrections_by_document(doc_id: String) List~Correction~
        +get_correction_stats() CorrectionStats
    }

    class MetricsRepository {
        <<protocol - ISP>>
        +save_processing_metrics(metrics: ProcessingMetrics) bool
        +get_average_processing_times() Dict~str,float~
        +get_daily_stats(days: int) Dict~str,object~
    }

    class DocumentRepository {
        <<protocol - combined>>
        %% Inherits from: DocumentReader, DocumentWriter, DocumentSearcher, CorrectionRepository, MetricsRepository
    }

    DocumentReader <|-- DocumentRepository : extends
    DocumentWriter <|-- DocumentRepository : extends
    DocumentSearcher <|-- DocumentRepository : extends
    CorrectionRepository <|-- DocumentRepository : extends
    MetricsRepository <|-- DocumentRepository : extends

    class DuplicateDetectorPort {
        <<protocol>>
        +detect_duplicates(content: str, doc_id: str) DuplicateDetectionResult
        +index_document(doc_id: str, content: str) None
        +index_document_with_fields(doc_id: str, content: str, cedula: str?, correo: str?, demandante: str?, demandado: str?) None
        +index_corpus(corpus: Dict~str,str~) None
    }

    ProcessDocumentUseCase --> OCRPort
    ProcessDocumentUseCase --> DocumentRepository
    ProcessDocumentUseCase --> DuplicateDetectorPort

    %% ============================================
    %% INFRASTRUCTURE LAYER (Adapters)
    %% ============================================

    class OCRRouter {
        -DocumentQualityAnalyzer quality_analyzer
        -TesseractAdapter tesseract_adapter
        -PaddleOCRAdapter paddle_adapter
        -OCRConfigV2 config
        +extract_text(file_path: Path) OCRResult
        -_select_engine(quality: DocumentQuality) RoutingDecision
        -_fallback_extract(file_path: Path) OCRResult
    }

    class BaseOCRAdapter {
        <<abstract - Template Method>>
        #OCRConfigV2 _config
        #Logger _logger
        +extract_text(file_path: Path) OCRResult
        +extract_from_image(image: ndarray) OCRResult
        #_extract_from_image_internal(image: ndarray)* OCRResult
        #_get_engine()* OCREngine
        #_get_engine_name()* str
        #_pre_extraction_check(file_path: Path) None
        +is_available()* bool
    }

    class TesseractAdapter {
        -OCRConfigV2 config
        #_extract_from_image_internal(image: ndarray) OCRResult
        #_get_engine() OCREngine
        #_get_engine_name() str
        +is_available() bool
    }

    class PaddleOCRAdapter {
        -OCRConfigV2 config
        -PaddleOCR? _paddle_model
        #_extract_from_image_internal(image: ndarray) OCRResult
        #_get_engine() OCREngine
        #_get_engine_name() str
        -_lazy_load_model() PaddleOCR
        +is_available() bool
    }

    class DocumentQualityAnalyzer {
        -OCRConfigV2 config
        +analyze_quality(file_path: Path) DocumentQuality
        -_calculate_metrics(image: ndarray) Dict
    }

    OCRPort <|.. OCRRouter : implements
    OCRPort <|.. BaseOCRAdapter : implements
    BaseOCRAdapter <|-- TesseractAdapter : extends
    BaseOCRAdapter <|-- PaddleOCRAdapter : extends
    OCRRouter --> TesseractAdapter
    OCRRouter --> PaddleOCRAdapter
    OCRRouter --> DocumentQualityAnalyzer
    OCRRouter --> OCRResult

    %% ============================================
    %% NER EXTRACTORS (Strategy Pattern) ✅
    %% ============================================

    class EntityExtractor {
        <<protocol>>
        +name() str
        +priority() int
        +extract(text: str, doc: Any?) List~Entity~
    }

    class BaseExtractor {
        <<abstract>>
        #Logger _logger
        +name()* str
        +priority()* int
        +extract(text: str, doc: Any?)* List~Entity~
        #_log_extraction_start(text_length: int) None
        #_log_extraction_result(count: int) None
    }

    class MarkerExtractor {
        -EntityTruncator _entity_truncator
        +name() str
        +priority() int = 1
        +extract(text: str, doc: Any?) List~Entity~
        -_extract_for_entity_type(text: str, type: LegalEntityType) List~Entity~
    }

    class SpaCyNERExtractor {
        -ContextScorer _context_scorer
        -int _context_window_before
        -int _context_window_after
        +name() str
        +priority() int = 2
        +extract(text: str, doc: Any?) List~Entity~
        -_classify_entity(...) Tuple~str,float~
        -_calculate_pos_penalty(...) float
    }

    class RegexExtractor {
        +name() str
        +priority() int = 3
        +extract(text: str, doc: Any?) List~Entity~
    }

    class AddressExtractor {
        +name() str
        +priority() int = 4
        +extract(text: str, doc: Any?) List~Entity~
    }

    class ContactExtractor {
        +name() str
        +priority() int = 5
        +extract(text: str, doc: Any?) List~Entity~
    }

    EntityExtractor <|.. BaseExtractor : implements
    BaseExtractor <|-- MarkerExtractor : extends
    BaseExtractor <|-- SpaCyNERExtractor : extends
    BaseExtractor <|-- RegexExtractor : extends
    BaseExtractor <|-- AddressExtractor : extends
    BaseExtractor <|-- ContactExtractor : extends

    %% ============================================
    %% NER VALIDATORS (Chain of Responsibility) ✅
    %% ============================================

    class EntityValidator {
        <<protocol>>
        +name() str
        +validate(entities: List~Entity~, context: ValidationContext) List~Entity~
    }

    class BaseValidator {
        <<abstract>>
        #Logger _logger
        +name()* str
        +validate(...)* List~Entity~
        #_log_validation_start(count: int) None
        #_log_validation_result(original: int, final: int) None
    }

    class ValidationContext {
        <<dataclass>>
        +str full_text
        +Any? doc
        +List~Section~? sections
        +List~Entity~? date_entities
        +Entity? radicado_entity
    }

    class ValidatorChain {
        -List~EntityValidator~ _validators
        -Logger _logger
        +validators() List~EntityValidator~
        +validate(entities: List~Entity~, context: ValidationContext) List~Entity~
    }

    class StructuralValidator {
        +name() str
        +validate(...) List~Entity~
    }

    class TemporalValidator {
        +name() str
        +validate(...) List~Entity~
    }

    class RadicadoFormatValidator {
        -RadicadoValidator _validator
        -bool _strict_mode
        +name() str
        +validate(...) List~Entity~
    }

    class DeduplicationValidator {
        -EntityLinker _entity_linker
        +name() str
        +validate(...) List~Entity~
    }

    EntityValidator <|.. BaseValidator : implements
    BaseValidator <|-- StructuralValidator : extends
    BaseValidator <|-- TemporalValidator : extends
    BaseValidator <|-- RadicadoFormatValidator : extends
    BaseValidator <|-- DeduplicationValidator : extends
    ValidatorChain o-- EntityValidator : aggregates

    %% ============================================
    %% SPACY ADAPTER (Orchestrator) ✅
    %% ============================================

    class SpaCyAdapter {
        -NERConfig _config
        -spacy.Language? _nlp
        -List~EntityExtractor~ _extractors
        -List~EntityValidator~ _validators
        -ValidatorChain _validator_chain
        -ContextScorer _context_scorer
        -EntityLinker _entity_linker
        -EntityTruncator _entity_truncator
        +extract_entities(text: str) Result~EntityResult,NERError~
        +is_available() bool
        -_get_nlp() spacy.Language
        -_create_default_extractors() List~EntityExtractor~
        -_create_default_validators() List~EntityValidator~
        -_get_first_by_type(entities: List, label: str) str
    }

    SpaCyAdapter o-- EntityExtractor : uses extractors
    SpaCyAdapter o-- ValidatorChain : uses validator chain
    SpaCyAdapter --> EntityResult

    %% ============================================
    %% NER UTILITIES ✅
    %% ============================================

    class EntityTruncator {
        <<utility>>
        -int max_length
        -Tuple COMMON_DELIMITERS
        -Tuple DEMANDADO_DELIMITERS
        -Tuple CORPORATE_SUFFIXES
        -Logger _logger
        +truncate(value: str, entity_type: LegalEntityType) str
        -_truncate_at_delimiter(value: str, delimiters: Tuple) str
        -_ensure_corporate_suffix(value: str, original: str) str
    }

    class ContextScorer {
        <<utility>>
        -Dict weights
        -float threshold
        +score_entity(entity: str, context: str) float
        +predict(entity_text: str, ...) ScoringResult
        -_extract_features(entity: str, context: str) ContextFeatures
    }

    class EntityLinker {
        <<utility>>
        -float similarity_threshold
        +link_mentions(entities: List) List~LinkedEntity~
        +deduplicate_to_entities(entities: List) List~Entity~
        +normalize_text(text: str) str
        -_jaccard_similarity(a: Set, b: Set) float
    }

    class RadicadoValidator {
        <<utility>>
        -bool strict_mode
        -Set DANE_CODES
        -Set CORPORATION_CODES
        +validate(radicado: str, context: str?) RadicadoValidation
        +extract_components(radicado: str) Dict
    }

    class DocumentStructureAnalyzer {
        <<analyzer>>
        -Dict SECTION_PATTERNS
        -List~Section~ _sections
        +detected_sections() List~str~
        +get_section_at_position(pos: int) Section?
        +get_confidence_boost(entity_type: str, position: int) float
    }

    class TemporalFeatures {
        <<module - funciones>>
        +parse_spanish_date(text: str) DateParseResult?
        +normalize_date(date: str) str?
        +validate_year(year: int) bool
        +detect_temporal_context(text: str, position: int) TemporalContext?
        +get_date_confidence_adjustment(...) float
    }

    MarkerExtractor --> EntityTruncator
    SpaCyNERExtractor --> ContextScorer
    DeduplicationValidator --> EntityLinker
    RadicadoFormatValidator --> RadicadoValidator
    StructuralValidator --> DocumentStructureAnalyzer
    TemporalValidator --> TemporalFeatures

    %% ============================================
    %% ENSEMBLE (Implementado Sprint 08)
    %% ============================================

    class EntityExtractorEnsemble {
        <<implemented - Sprint 08>>
        -List~EntityExtractor~ extractors
        -Dict~str,float~ weights
        -str voting_strategy
        +extract(text: str) EntityResult
        -_collect_votes(text: str) Dict
        -_resolve_votes(votes: Dict) EntityResult
        -_weighted_vote(candidates: List) Entity
    }

    EntityExtractorEnsemble o-- EntityExtractor : orchestrates

    %% ============================================
    %% ACTIVE LEARNING (Infraestructura Sprint 08)
    %% ============================================

    class Correction {
        <<value object>>
        +String id
        +String document_id
        +String field_name
        +String original_value
        +String corrected_value
        +String correction_source
        +datetime created_at
    }

    class CorrectionStats {
        <<value object>>
        +int total_corrections
        +Dict~str,int~ corrections_by_field
        +float correction_rate
    }

    class ActiveLearningInfrastructure {
        <<module>>
        +save_correction(correction: Correction) Correction
        +get_corrections_by_document(doc_id: str) List~Correction~
        +get_correction_stats() CorrectionStats
        +analyze_patterns() Dict
    }

    CorrectionRepository <|.. ActiveLearningInfrastructure : implements
    ActiveLearningInfrastructure --> Correction
    ActiveLearningInfrastructure --> CorrectionStats

    class EnsembleConfig {
        <<frozen dataclass>>
        +Dict~str,float~ LEVEL_WEIGHTS
        +Dict~str,Dict~ THRESHOLDS
        +int MINHASH_NUM_PERM
        +float MINHASH_THRESHOLD
        +int TFIDF_MAX_FEATURES
        +float ENTITY_FUZZY_THRESHOLD
        +bool USE_SHORT_CIRCUIT
        +int TOP_K_CANDIDATES
    }

    class ConfigPersistence {
        <<module - dedup.config>>
        +save_config(config, path) bool
        +load_config(path) EnsembleConfig
    }

    ConfigPersistence --> EnsembleConfig

    class HybridEnsembleDetector {
        -EnsembleConfig config
        -ExactMatcher exact_matcher
        -MinHashMatcher minhash_matcher
        -TFIDFMatcher tfidf_matcher
        -EntityMatcher entity_matcher
        +detect_duplicates(content: str, doc_id: str) DuplicateDetectionResult
        +index_document(doc_id: str, content: str) None
        +index_corpus(corpus: Dict) None
        -_ensemble_scores(scores: List~LevelScore~) float
        -_classify_confidence(score: float) DuplicateConfidence
    }

    class ExactMatcher {
        -Dict~str,str~ hash_index
        +match(content: str) LevelScore?
        +index(doc_id: str, content: str) None
    }

    class MinHashMatcher {
        -MinHashLSH lsh_index
        -int n_perm
        -float threshold
        +match(content: str) List~LevelScore~
        +index(doc_id: str, content: str) None
    }

    class TFIDFMatcher {
        -TfidfVectorizer vectorizer
        -scipy.sparse.csr_matrix? tfidf_matrix
        -Dict~str,int~ doc_id_to_index
        +match(content: str) List~LevelScore~
        +index_corpus(corpus: Dict) None
    }

    class EntityMatcher {
        -Dict~str,Dict~ entity_index
        -float boost_value
        +match(entities: Dict, doc_id: str) List~LevelScore~
        +index(doc_id: str, entities: Dict) None
    }

    DuplicateDetectorPort <|.. HybridEnsembleDetector : implements
    HybridEnsembleDetector --> ExactMatcher
    HybridEnsembleDetector --> MinHashMatcher
    HybridEnsembleDetector --> TFIDFMatcher
    HybridEnsembleDetector --> EntityMatcher
    HybridEnsembleDetector --> DuplicateCandidate

    class ConnectionMixin {
        <<mixin>>
        -Path db_path
        -sqlite3.Connection? _connection
        +connection() sqlite3.Connection
        +close() None
        #_safe_get(row, key, default) str
        #_safe_get_int(row, key, default) int
        #_safe_get_bool(row, key, default) bool
        #_safe_get_datetime(row, key, default) datetime?
        #_row_to_document(row) Document
    }

    class CorrectionMixin {
        <<mixin>>
        +save_correction(correction) Correction
        +get_corrections_by_document(doc_id) List~Correction~
        +get_correction_stats() CorrectionStats
        +update_document_field(doc_id, field, value) Document?
    }

    class MetricsMixin {
        <<mixin>>
        +save_processing_metrics(metrics) bool
        +get_processing_metrics(doc_id) ProcessingMetrics?
        +get_average_processing_times() Dict~str,float~
        +get_daily_stats(days) Dict~str,object~
    }

    class SQLiteDocumentRepository {
        +save(document: Document) Document
        +find_by_id(id: String) Document?
        +find_by_hash(hash: String) Document?
        +find_by_radicado(radicado: String) Document?
        +search(query: String) SearchResult
        +get_all(limit: int) List~Document~
        +count() int
        +delete(doc_id: String) bool
    }

    ConnectionMixin <|-- SQLiteDocumentRepository : extends
    CorrectionMixin <|-- SQLiteDocumentRepository : extends
    MetricsMixin <|-- SQLiteDocumentRepository : extends
    DocumentRepository <|.. SQLiteDocumentRepository : implements
    SQLiteDocumentRepository --> Document

    %% ============================================
    %% RELATIONSHIPS
    %% ============================================

    ServiceContainer --> DocumentRepository
    ServiceContainer --> OCRRouter
    ServiceContainer --> SpaCyAdapter
    ServiceContainer --> HybridEnsembleDetector

Descripción de Componentes

Domain Layer (Core)

Entities (Agregados)

  • Document: Entidad principal que representa un documento legal procesado
  • Contiene metadatos, contenido extraído y entidades legales
  • Inmutable después de creación (excepto status)
  • ID generado con UUID4

Value Objects

  • OCRResult: Resultado inmutable de extracción OCR
  • EntityResult: Entidades extraídas por NER
  • DuplicateCandidate: Candidato de duplicado con score y explicación

Enums

  • DocumentType: Tipo de documento (TUTELA, HABEAS_CORPUS, UNKNOWN)
  • DocumentStatus: Estado del procesamiento (PENDING, PROCESSING, PROCESSED, VALIDATED, ERROR)

Application Layer

Use Cases

  • ProcessDocumentUseCase: Orquesta el pipeline completo (OCR → NER → Dedup → Persist)
  • Railway-Oriented Programming con Result pattern
  • Manejo explícito de errores en cada etapa
  • Composición de operaciones con bind y map

Service Container

  • ServiceContainer: Dependency Injection container
  • Lazy loading de recursos costosos (modelos OCR/NER)
  • Singleton pattern para configuraciones
  • Context manager para manejo de recursos

DTOs (Data Transfer Objects)

  • ProcessDocumentRequest: Request inmutable con file_path y document_type
  • ProcessDocumentResponse: Response completo con métricas y duplicados
  • ProcessingMetrics: Métricas de tiempo por etapa (OCR, NER, dedup)
  • ExportExcelRequest/Response: Exportación masiva a Excel con filtros
  • ImportExcelRequest/Response: Importación masiva desde Excel

Ports (Protocols/Interfaces)

Contratos que define Application, implementa Infrastructure:

  • OCRPort: Interfaz para motores OCR
  • DocumentRepository: Interfaz para persistencia
  • DuplicateDetectorPort: Interfaz para detectores de duplicados

Principio: Dependency Inversion (DIP) - Application depende de abstracciones, no implementaciones.

Infrastructure Layer (Adapters)

OCR Adapters

  • OCRRouter: Routing inteligente entre Tesseract y PaddleOCR
  • Analiza calidad del documento
  • Fallback automático si confidence baja
  • BaseOCRAdapter ✅: Clase abstracta con Template Method pattern
  • Implementa extract_text() como skeleton algorithm
  • Subclases implementan _extract_from_image_internal()
  • Hooks: _pre_extraction_check(), is_available()
  • TesseractAdapter: Hereda de BaseOCRAdapter, implementa Tesseract-OCR
  • PaddleOCRAdapter: Hereda de BaseOCRAdapter, implementa PaddleOCR (lazy-loaded)
  • DocumentQualityAnalyzer: Analiza métricas de calidad (contrast, sharpness)

NER Adapters (Refactored 2026-02-04)

  • SpaCyAdapter ✅: Orquestador de extracción NER (342 líneas, reducido 53%)
  • Modelo es_core_news_lg con EntityRuler
  • Usa Strategy Pattern para extractores
  • Usa Chain of Responsibility para validadores
  • Lazy loading del modelo

NER Extractors (Strategy Pattern)

  • EntityExtractor (Protocol): Interfaz para estrategias de extracción
  • BaseExtractor (ABC): Clase base con logging común
  • MarkerExtractor (priority=1): Extrae de marcadores explícitos (ACCIONANTE:, etc.)
  • SpaCyNERExtractor (priority=2): Extracción neuronal con POS validation
  • RegexExtractor (priority=3): Patrones regex para radicados, fechas, juzgados
  • AddressExtractor (priority=4): Direcciones de demandante y demandado
  • ContactExtractor (priority=5): Correo electrónico y cédula (CC/CE/NIT)

NER Validators (Chain of Responsibility)

  • EntityValidator (Protocol): Interfaz para validadores en cadena
  • BaseValidator (ABC): Clase base con logging común
  • ValidatorChain: Orquesta ejecución secuencial de validadores
  • ValidationContext: Dataclass con contexto compartido entre validadores
  • StructuralValidator: Boost de confianza por sección del documento
  • TemporalValidator: Validación de fechas y consistencia de años
  • RadicadoFormatValidator: Validación de formato de radicados colombianos
  • DeduplicationValidator: Deduplicación fuzzy con EntityLinker

NER Utilities

  • EntityTruncator: Truncamiento inteligente por tipo de entidad
  • ContextScorer: Pesos configurables para clasificación de entidades
  • EntityLinker: Agrupación fuzzy de menciones similares
  • RadicadoValidator: Validación de radicados colombianos (formato DANE)
  • DocumentStructureAnalyzer: Detección de secciones del documento
  • TemporalFeatures: Normalización y validación de fechas

Implementado (Sprint 08)

  • EntityExtractorEnsemble ✅: Votación ponderada de extractores

Duplicate Detection

  • HybridEnsembleDetector: Orquesta 4 niveles de detección
  • Ensemble ponderado de scores
  • Clasificación de confidence
  • Short-circuit optimization
  • ExactMatcher: Hash SHA-256
  • MinHashMatcher: MinHash + LSH
  • TFIDFMatcher: TF-IDF + Cosine Similarity
  • EntityMatcher: Boost por entidades legales

Persistence

  • DocumentReader ✅: Interface segregada para lectura (ISP)
  • DocumentWriter ✅: Interface segregada para escritura (ISP)
  • DocumentSearcher ✅: Interface segregada para búsqueda (ISP)
  • CorrectionRepository ✅: Interface segregada para correcciones (ISP)
  • MetricsRepository ✅: Interface segregada para métricas (ISP)
  • DocumentRepository: Interfaz combinada (hereda las 5 anteriores)
  • SQLiteDocumentRepository: Implementación SQLite + FTS5
  • Composición via mixins: ConnectionMixin, CorrectionMixin, MetricsMixin
  • Full-text search con FTS5
  • Gestión de conexiones lazy + thread-safe
  • Búsqueda por hash, ID y radicado

Patrones de Diseño Aplicados

1. Hexagonal Architecture (Ports & Adapters)

  • Ports: Interfaces (OCRPort, DocumentRepository, DuplicateDetectorPort)
  • Adapters: Implementaciones concretas (TesseractAdapter, SQLiteDocumentRepository)
  • Flujo de dependencias: Infrastructure → Ports ← Application

2. Template Method Pattern ✅

  • BaseOCRAdapter: Define skeleton algorithm para extracción OCR
  • extract_text() es el template method
  • _extract_from_image_internal() es el método abstracto que subclases implementan
  • _pre_extraction_check() es hook opcional
  • Permite agregar nuevos motores OCR sin duplicar lógica de PDF→páginas

3. Dependency Injection

  • ServiceContainer: Inyecta dependencias configuradas
  • Constructor injection en Use Cases
  • Lazy loading para optimización

4. Strategy Pattern

  • OCRRouter: Selecciona estrategia (Tesseract vs PaddleOCR) basado en calidad
  • NER Extractors ✅: Estrategias intercambiables de extracción
  • MarkerExtractor (priority=1): Extrae de marcadores explícitos
  • SpaCyNERExtractor (priority=2): Extracción neuronal con POS validation
  • RegexExtractor (priority=3): Patrones regex para entidades estructuradas
  • AddressExtractor (priority=4): Direcciones de demandante y demandado
  • ContactExtractor (priority=5): Correo electrónico y cédula
  • Intercambiable sin modificar Use Case

4.1 Chain of Responsibility Pattern ✅

  • NER Validators: Cadena secuencial de validadores
  • StructuralValidator → Boost por sección del documento
  • TemporalValidator → Consistencia de fechas/años
  • RadicadoFormatValidator → Validación de radicados colombianos
  • DeduplicationValidator → Deduplicación fuzzy
  • ValidatorChain orquesta la ejecución secuencial
  • Cada validador puede modificar/filtrar entidades

5. Repository Pattern

  • DocumentRepository: Abstrae persistencia
  • SQLite intercambiable por PostgreSQL sin cambiar Application

6. Value Object Pattern

  • OCRResult, EntityResult, DuplicateCandidate: Inmutables (@dataclass(frozen=True))
  • Sin identidad, comparables por valor

7. Result Pattern (Railway-Oriented)

  • Use Case: Retorna Result[Success, Failure]
  • Composición con .bind() y .map()
  • Errores como valores, no excepciones

8. Lazy Loading

  • PaddleOCRAdapter: Carga modelo solo cuando se necesita
  • ServiceContainer: Instancia servicios bajo demanda
  • Optimización de memoria y startup time

9. Interface Segregation (ISP) ✅

  • Persistence Layer: 5 interfaces segregadas
  • DocumentReader: Solo operaciones de lectura
  • DocumentWriter: Solo operaciones de escritura
  • DocumentSearcher: Solo búsqueda full-text
  • CorrectionRepository: Solo gestión de correcciones
  • MetricsRepository: Solo métricas de procesamiento
  • Clientes dependen solo de la interfaz mínima que necesitan

10. Mixin Composition (Sprint 17)

  • SQLiteDocumentRepository compuesto de 3 mixins:
  • ConnectionMixin: Gestión lazy de conexiones, row mapping
  • CorrectionMixin: CRUD de correcciones, estadísticas
  • MetricsMixin: Métricas de procesamiento, stats diarios

SOLID Principles

  • S (Single Responsibility): Cada clase tiene una responsabilidad única
  • ExactMatcher solo hace hashing
  • EntityTruncator solo trunca entidades
  • BaseOCRAdapter define skeleton, subclases implementan extracción
  • MarkerExtractor solo extrae de marcadores ✅
  • StructuralValidator solo aplica boost por sección ✅
  • O (Open/Closed): Extensible con nuevos adapters sin modificar Use Cases
  • Template Method en BaseOCRAdapter permite agregar motores OCR
  • Strategy Pattern en NER permite agregar extractores ✅
  • Chain of Responsibility permite agregar validadores ✅
  • L (Liskov Substitution): Todos los adapters implementan correctamente sus Ports
  • TesseractAdapter y PaddleOCRAdapter son intercambiables via BaseOCRAdapter
  • Todos los extractores implementan EntityExtractor Protocol ✅
  • Todos los validadores implementan EntityValidator Protocol ✅
  • I (Interface Segregation): Interfaces segregadas para responsabilidades específicas
  • DocumentReader, DocumentWriter, DocumentSearcher, CorrectionRepository
  • EntityExtractor, EntityValidator (protocolos mínimos) ✅
  • Clientes dependen solo de la interfaz que necesitan
  • D (Dependency Inversion): Use Cases dependen de Ports, no Adapters
  • GUI importa de core/application, no de persistence
  • SpaCyAdapter depende de EntityExtractor y EntityValidator protocols ✅

Relaciones Clave

  1. ServiceContainer crea e inyecta todas las dependencias
  2. ProcessDocumentUseCase orquesta el pipeline usando Ports
  3. OCRRouter delega a adapters concretos (Tesseract/PaddleOCR)
  4. HybridEnsembleDetector coordina 4 matchers independientes
  5. SQLiteDocumentRepository persiste entidades Document

Última actualización: 2026-03-05 (Sprint 19: Document +cedula/correo/imported_at, DTOs import/export, index_document_with_fields NER)