Diagrama de Clases Principales¶
Descripción¶
Diagrama UML de clases que muestra las principales entidades, value objects, puertos (protocolos) y adapters del sistema Sherlock-docs.
Enfocado en la arquitectura hexagonal: Core (Domain) → Application → Ports ← Infrastructure.
Diagrama¶
classDiagram
%% ============================================
%% DOMAIN LAYER (Core Entities & Value Objects)
%% ============================================
class Document {
+String id
+String file_path
+String file_name
+DocumentType document_type
+DocumentStatus status
+String content
+float ocr_confidence
+String file_hash
+String? demandante
+String? demandado
+String? radicado
+String? juzgado
+datetime? fecha_documento
+String hora_radicacion
+String numero_acta
+String? cedula
+String? correo
+String observaciones
+String etiquetas
+String direccion_demandante
+String direccion_demandado
+datetime created_at
+datetime updated_at
+datetime? imported_at
}
note for Document "33 campos totales en BD\n(solo los principales se muestran)"
class DocumentType {
<<enumeration>>
TUTELA
HABEAS_CORPUS
UNKNOWN
}
class DocumentStatus {
<<enumeration>>
PENDING
PROCESSING
PROCESSED
VALIDATED
ERROR
}
class OCRResult {
<<value object>>
+String text
+float confidence
+OCREngine engine_used
+int pages_processed
+float processing_time_ms
+Tuple~String~ errors
}
class EntityResult {
<<value object>>
+String? demandante
+String? demandado
+String? radicado
+String? juzgado
+String? fecha
+String? direccion_demandante
+String? direccion_demandado
}
class DuplicateCandidate {
<<value object>>
+String document_id
+String source_document_id
+Dict~str,LevelScore~ level_scores
+float final_score
+DuplicateConfidence confidence
+String detection_method
+String explanation
+Dict~String,bool~ matched_entities
}
Document --> DocumentType
Document --> DocumentStatus
%% ============================================
%% APPLICATION LAYER (Use Cases & DTOs)
%% ============================================
class ProcessDocumentUseCase {
-OCRPort _ocr_port
-NERPort _ner_port
-DuplicateDetectorPort _dedup_port
-DocumentRepository _repository
-Callable _detect_provisional_measure
-Callable _detect_sections
+execute(request: ProcessDocumentRequest) Result~ProcessDocumentResponse, ProcessingError~
-_extract_text(file_path: Path) Result~OCRResult, ProcessingError~
-_extract_entities(text: str) Result~EntityResult, ProcessingError~
-_calculate_hash(content: str) String
-_check_duplicate(hash: str) Document?
}
class ProcessDocumentRequest {
<<DTO>>
+String file_path
+String document_type
}
class ProcessDocumentResponse {
<<DTO>>
+String document_id
+String file_name
+String content
+float ocr_confidence
+String? demandante
+String? demandado
+String? radicado
+String? juzgado
+String? fecha
+float processing_time_ms
+String ocr_engine
+bool is_duplicate
+String? duplicate_of
+List~DuplicateCandidateDTO~? duplicate_candidates
}
class ProcessingMetrics {
<<DTO>>
+String document_id
+float ocr_time_ms
+float ner_time_ms
+float dedup_time_ms
+float total_time_ms
+String ocr_engine
}
class ExportExcelRequest {
<<DTO>>
+String? start_date
+String? end_date
+String? document_type
}
class ExportExcelResponse {
<<DTO>>
+bytes file_content
+String file_name
+int total_exported
}
class ImportExcelRequest {
<<DTO>>
+String file_path
}
class ImportExcelResponse {
<<DTO>>
+int total_imported
+int total_errors
+List~str~ errors
}
class ServiceContainer {
-Path db_path
-OCRConfigV2 ocr_config
-NERConfig ner_config
-EnsembleConfig dedup_config
-SQLiteDocumentRepository? _repository
-OCRRouter? _ocr_router
-SpaCyAdapter? _ner_adapter
-HybridEnsembleDetector? _duplicate_detector
+repository() DocumentRepository
+ocr_router() OCRRouter
+ner_adapter() SpaCyAdapter
+duplicate_detector() HybridEnsembleDetector
+process_document_use_case() ProcessDocumentUseCase
+close() None
}
ProcessDocumentUseCase --> ProcessDocumentRequest
ProcessDocumentUseCase --> ProcessDocumentResponse
ServiceContainer ..> ProcessDocumentUseCase : creates
%% ============================================
%% PORTS (Protocols/Interfaces)
%% ============================================
class OCRPort {
<<protocol>>
+extract_text(file_path: Path) OCRResult
}
class DocumentReader {
<<protocol - ISP>>
+find_by_id(id: String) Document?
+find_by_hash(hash: String) Document?
+get_all(limit: int) List~Document~
+count() int
}
class DocumentWriter {
<<protocol - ISP>>
+save(document: Document) Document
}
class DocumentSearcher {
<<protocol - ISP>>
+search(query: String, limit: int, offset: int) SearchResult
}
class CorrectionRepository {
<<protocol - ISP>>
+save_correction(correction: Correction) Correction
+get_corrections_by_document(doc_id: String) List~Correction~
+get_correction_stats() CorrectionStats
}
class MetricsRepository {
<<protocol - ISP>>
+save_processing_metrics(metrics: ProcessingMetrics) bool
+get_average_processing_times() Dict~str,float~
+get_daily_stats(days: int) Dict~str,object~
}
class DocumentRepository {
<<protocol - combined>>
%% Inherits from: DocumentReader, DocumentWriter, DocumentSearcher, CorrectionRepository, MetricsRepository
}
DocumentReader <|-- DocumentRepository : extends
DocumentWriter <|-- DocumentRepository : extends
DocumentSearcher <|-- DocumentRepository : extends
CorrectionRepository <|-- DocumentRepository : extends
MetricsRepository <|-- DocumentRepository : extends
class DuplicateDetectorPort {
<<protocol>>
+detect_duplicates(content: str, doc_id: str) DuplicateDetectionResult
+index_document(doc_id: str, content: str) None
+index_document_with_fields(doc_id: str, content: str, cedula: str?, correo: str?, demandante: str?, demandado: str?) None
+index_corpus(corpus: Dict~str,str~) None
}
ProcessDocumentUseCase --> OCRPort
ProcessDocumentUseCase --> DocumentRepository
ProcessDocumentUseCase --> DuplicateDetectorPort
%% ============================================
%% INFRASTRUCTURE LAYER (Adapters)
%% ============================================
class OCRRouter {
-DocumentQualityAnalyzer quality_analyzer
-TesseractAdapter tesseract_adapter
-PaddleOCRAdapter paddle_adapter
-OCRConfigV2 config
+extract_text(file_path: Path) OCRResult
-_select_engine(quality: DocumentQuality) RoutingDecision
-_fallback_extract(file_path: Path) OCRResult
}
class BaseOCRAdapter {
<<abstract - Template Method>>
#OCRConfigV2 _config
#Logger _logger
+extract_text(file_path: Path) OCRResult
+extract_from_image(image: ndarray) OCRResult
#_extract_from_image_internal(image: ndarray)* OCRResult
#_get_engine()* OCREngine
#_get_engine_name()* str
#_pre_extraction_check(file_path: Path) None
+is_available()* bool
}
class TesseractAdapter {
-OCRConfigV2 config
#_extract_from_image_internal(image: ndarray) OCRResult
#_get_engine() OCREngine
#_get_engine_name() str
+is_available() bool
}
class PaddleOCRAdapter {
-OCRConfigV2 config
-PaddleOCR? _paddle_model
#_extract_from_image_internal(image: ndarray) OCRResult
#_get_engine() OCREngine
#_get_engine_name() str
-_lazy_load_model() PaddleOCR
+is_available() bool
}
class DocumentQualityAnalyzer {
-OCRConfigV2 config
+analyze_quality(file_path: Path) DocumentQuality
-_calculate_metrics(image: ndarray) Dict
}
OCRPort <|.. OCRRouter : implements
OCRPort <|.. BaseOCRAdapter : implements
BaseOCRAdapter <|-- TesseractAdapter : extends
BaseOCRAdapter <|-- PaddleOCRAdapter : extends
OCRRouter --> TesseractAdapter
OCRRouter --> PaddleOCRAdapter
OCRRouter --> DocumentQualityAnalyzer
OCRRouter --> OCRResult
%% ============================================
%% NER EXTRACTORS (Strategy Pattern) ✅
%% ============================================
class EntityExtractor {
<<protocol>>
+name() str
+priority() int
+extract(text: str, doc: Any?) List~Entity~
}
class BaseExtractor {
<<abstract>>
#Logger _logger
+name()* str
+priority()* int
+extract(text: str, doc: Any?)* List~Entity~
#_log_extraction_start(text_length: int) None
#_log_extraction_result(count: int) None
}
class MarkerExtractor {
-EntityTruncator _entity_truncator
+name() str
+priority() int = 1
+extract(text: str, doc: Any?) List~Entity~
-_extract_for_entity_type(text: str, type: LegalEntityType) List~Entity~
}
class SpaCyNERExtractor {
-ContextScorer _context_scorer
-int _context_window_before
-int _context_window_after
+name() str
+priority() int = 2
+extract(text: str, doc: Any?) List~Entity~
-_classify_entity(...) Tuple~str,float~
-_calculate_pos_penalty(...) float
}
class RegexExtractor {
+name() str
+priority() int = 3
+extract(text: str, doc: Any?) List~Entity~
}
class AddressExtractor {
+name() str
+priority() int = 4
+extract(text: str, doc: Any?) List~Entity~
}
class ContactExtractor {
+name() str
+priority() int = 5
+extract(text: str, doc: Any?) List~Entity~
}
EntityExtractor <|.. BaseExtractor : implements
BaseExtractor <|-- MarkerExtractor : extends
BaseExtractor <|-- SpaCyNERExtractor : extends
BaseExtractor <|-- RegexExtractor : extends
BaseExtractor <|-- AddressExtractor : extends
BaseExtractor <|-- ContactExtractor : extends
%% ============================================
%% NER VALIDATORS (Chain of Responsibility) ✅
%% ============================================
class EntityValidator {
<<protocol>>
+name() str
+validate(entities: List~Entity~, context: ValidationContext) List~Entity~
}
class BaseValidator {
<<abstract>>
#Logger _logger
+name()* str
+validate(...)* List~Entity~
#_log_validation_start(count: int) None
#_log_validation_result(original: int, final: int) None
}
class ValidationContext {
<<dataclass>>
+str full_text
+Any? doc
+List~Section~? sections
+List~Entity~? date_entities
+Entity? radicado_entity
}
class ValidatorChain {
-List~EntityValidator~ _validators
-Logger _logger
+validators() List~EntityValidator~
+validate(entities: List~Entity~, context: ValidationContext) List~Entity~
}
class StructuralValidator {
+name() str
+validate(...) List~Entity~
}
class TemporalValidator {
+name() str
+validate(...) List~Entity~
}
class RadicadoFormatValidator {
-RadicadoValidator _validator
-bool _strict_mode
+name() str
+validate(...) List~Entity~
}
class DeduplicationValidator {
-EntityLinker _entity_linker
+name() str
+validate(...) List~Entity~
}
EntityValidator <|.. BaseValidator : implements
BaseValidator <|-- StructuralValidator : extends
BaseValidator <|-- TemporalValidator : extends
BaseValidator <|-- RadicadoFormatValidator : extends
BaseValidator <|-- DeduplicationValidator : extends
ValidatorChain o-- EntityValidator : aggregates
%% ============================================
%% SPACY ADAPTER (Orchestrator) ✅
%% ============================================
class SpaCyAdapter {
-NERConfig _config
-spacy.Language? _nlp
-List~EntityExtractor~ _extractors
-List~EntityValidator~ _validators
-ValidatorChain _validator_chain
-ContextScorer _context_scorer
-EntityLinker _entity_linker
-EntityTruncator _entity_truncator
+extract_entities(text: str) Result~EntityResult,NERError~
+is_available() bool
-_get_nlp() spacy.Language
-_create_default_extractors() List~EntityExtractor~
-_create_default_validators() List~EntityValidator~
-_get_first_by_type(entities: List, label: str) str
}
SpaCyAdapter o-- EntityExtractor : uses extractors
SpaCyAdapter o-- ValidatorChain : uses validator chain
SpaCyAdapter --> EntityResult
%% ============================================
%% NER UTILITIES ✅
%% ============================================
class EntityTruncator {
<<utility>>
-int max_length
-Tuple COMMON_DELIMITERS
-Tuple DEMANDADO_DELIMITERS
-Tuple CORPORATE_SUFFIXES
-Logger _logger
+truncate(value: str, entity_type: LegalEntityType) str
-_truncate_at_delimiter(value: str, delimiters: Tuple) str
-_ensure_corporate_suffix(value: str, original: str) str
}
class ContextScorer {
<<utility>>
-Dict weights
-float threshold
+score_entity(entity: str, context: str) float
+predict(entity_text: str, ...) ScoringResult
-_extract_features(entity: str, context: str) ContextFeatures
}
class EntityLinker {
<<utility>>
-float similarity_threshold
+link_mentions(entities: List) List~LinkedEntity~
+deduplicate_to_entities(entities: List) List~Entity~
+normalize_text(text: str) str
-_jaccard_similarity(a: Set, b: Set) float
}
class RadicadoValidator {
<<utility>>
-bool strict_mode
-Set DANE_CODES
-Set CORPORATION_CODES
+validate(radicado: str, context: str?) RadicadoValidation
+extract_components(radicado: str) Dict
}
class DocumentStructureAnalyzer {
<<analyzer>>
-Dict SECTION_PATTERNS
-List~Section~ _sections
+detected_sections() List~str~
+get_section_at_position(pos: int) Section?
+get_confidence_boost(entity_type: str, position: int) float
}
class TemporalFeatures {
<<module - funciones>>
+parse_spanish_date(text: str) DateParseResult?
+normalize_date(date: str) str?
+validate_year(year: int) bool
+detect_temporal_context(text: str, position: int) TemporalContext?
+get_date_confidence_adjustment(...) float
}
MarkerExtractor --> EntityTruncator
SpaCyNERExtractor --> ContextScorer
DeduplicationValidator --> EntityLinker
RadicadoFormatValidator --> RadicadoValidator
StructuralValidator --> DocumentStructureAnalyzer
TemporalValidator --> TemporalFeatures
%% ============================================
%% ENSEMBLE (Implementado Sprint 08)
%% ============================================
class EntityExtractorEnsemble {
<<implemented - Sprint 08>>
-List~EntityExtractor~ extractors
-Dict~str,float~ weights
-str voting_strategy
+extract(text: str) EntityResult
-_collect_votes(text: str) Dict
-_resolve_votes(votes: Dict) EntityResult
-_weighted_vote(candidates: List) Entity
}
EntityExtractorEnsemble o-- EntityExtractor : orchestrates
%% ============================================
%% ACTIVE LEARNING (Infraestructura Sprint 08)
%% ============================================
class Correction {
<<value object>>
+String id
+String document_id
+String field_name
+String original_value
+String corrected_value
+String correction_source
+datetime created_at
}
class CorrectionStats {
<<value object>>
+int total_corrections
+Dict~str,int~ corrections_by_field
+float correction_rate
}
class ActiveLearningInfrastructure {
<<module>>
+save_correction(correction: Correction) Correction
+get_corrections_by_document(doc_id: str) List~Correction~
+get_correction_stats() CorrectionStats
+analyze_patterns() Dict
}
CorrectionRepository <|.. ActiveLearningInfrastructure : implements
ActiveLearningInfrastructure --> Correction
ActiveLearningInfrastructure --> CorrectionStats
class EnsembleConfig {
<<frozen dataclass>>
+Dict~str,float~ LEVEL_WEIGHTS
+Dict~str,Dict~ THRESHOLDS
+int MINHASH_NUM_PERM
+float MINHASH_THRESHOLD
+int TFIDF_MAX_FEATURES
+float ENTITY_FUZZY_THRESHOLD
+bool USE_SHORT_CIRCUIT
+int TOP_K_CANDIDATES
}
class ConfigPersistence {
<<module - dedup.config>>
+save_config(config, path) bool
+load_config(path) EnsembleConfig
}
ConfigPersistence --> EnsembleConfig
class HybridEnsembleDetector {
-EnsembleConfig config
-ExactMatcher exact_matcher
-MinHashMatcher minhash_matcher
-TFIDFMatcher tfidf_matcher
-EntityMatcher entity_matcher
+detect_duplicates(content: str, doc_id: str) DuplicateDetectionResult
+index_document(doc_id: str, content: str) None
+index_corpus(corpus: Dict) None
-_ensemble_scores(scores: List~LevelScore~) float
-_classify_confidence(score: float) DuplicateConfidence
}
class ExactMatcher {
-Dict~str,str~ hash_index
+match(content: str) LevelScore?
+index(doc_id: str, content: str) None
}
class MinHashMatcher {
-MinHashLSH lsh_index
-int n_perm
-float threshold
+match(content: str) List~LevelScore~
+index(doc_id: str, content: str) None
}
class TFIDFMatcher {
-TfidfVectorizer vectorizer
-scipy.sparse.csr_matrix? tfidf_matrix
-Dict~str,int~ doc_id_to_index
+match(content: str) List~LevelScore~
+index_corpus(corpus: Dict) None
}
class EntityMatcher {
-Dict~str,Dict~ entity_index
-float boost_value
+match(entities: Dict, doc_id: str) List~LevelScore~
+index(doc_id: str, entities: Dict) None
}
DuplicateDetectorPort <|.. HybridEnsembleDetector : implements
HybridEnsembleDetector --> ExactMatcher
HybridEnsembleDetector --> MinHashMatcher
HybridEnsembleDetector --> TFIDFMatcher
HybridEnsembleDetector --> EntityMatcher
HybridEnsembleDetector --> DuplicateCandidate
class ConnectionMixin {
<<mixin>>
-Path db_path
-sqlite3.Connection? _connection
+connection() sqlite3.Connection
+close() None
#_safe_get(row, key, default) str
#_safe_get_int(row, key, default) int
#_safe_get_bool(row, key, default) bool
#_safe_get_datetime(row, key, default) datetime?
#_row_to_document(row) Document
}
class CorrectionMixin {
<<mixin>>
+save_correction(correction) Correction
+get_corrections_by_document(doc_id) List~Correction~
+get_correction_stats() CorrectionStats
+update_document_field(doc_id, field, value) Document?
}
class MetricsMixin {
<<mixin>>
+save_processing_metrics(metrics) bool
+get_processing_metrics(doc_id) ProcessingMetrics?
+get_average_processing_times() Dict~str,float~
+get_daily_stats(days) Dict~str,object~
}
class SQLiteDocumentRepository {
+save(document: Document) Document
+find_by_id(id: String) Document?
+find_by_hash(hash: String) Document?
+find_by_radicado(radicado: String) Document?
+search(query: String) SearchResult
+get_all(limit: int) List~Document~
+count() int
+delete(doc_id: String) bool
}
ConnectionMixin <|-- SQLiteDocumentRepository : extends
CorrectionMixin <|-- SQLiteDocumentRepository : extends
MetricsMixin <|-- SQLiteDocumentRepository : extends
DocumentRepository <|.. SQLiteDocumentRepository : implements
SQLiteDocumentRepository --> Document
%% ============================================
%% RELATIONSHIPS
%% ============================================
ServiceContainer --> DocumentRepository
ServiceContainer --> OCRRouter
ServiceContainer --> SpaCyAdapter
ServiceContainer --> HybridEnsembleDetector
Descripción de Componentes¶
Domain Layer (Core)¶
Entities (Agregados)
- Document: Entidad principal que representa un documento legal procesado
- Contiene metadatos, contenido extraído y entidades legales
- Inmutable después de creación (excepto status)
- ID generado con UUID4
Value Objects
- OCRResult: Resultado inmutable de extracción OCR
- EntityResult: Entidades extraídas por NER
- DuplicateCandidate: Candidato de duplicado con score y explicación
Enums
- DocumentType: Tipo de documento (TUTELA, HABEAS_CORPUS, UNKNOWN)
- DocumentStatus: Estado del procesamiento (PENDING, PROCESSING, PROCESSED, VALIDATED, ERROR)
Application Layer¶
Use Cases
- ProcessDocumentUseCase: Orquesta el pipeline completo (OCR → NER → Dedup → Persist)
- Railway-Oriented Programming con
Resultpattern - Manejo explícito de errores en cada etapa
- Composición de operaciones con
bindymap
Service Container
- ServiceContainer: Dependency Injection container
- Lazy loading de recursos costosos (modelos OCR/NER)
- Singleton pattern para configuraciones
- Context manager para manejo de recursos
DTOs (Data Transfer Objects)
- ProcessDocumentRequest: Request inmutable con file_path y document_type
- ProcessDocumentResponse: Response completo con métricas y duplicados
- ProcessingMetrics: Métricas de tiempo por etapa (OCR, NER, dedup)
- ExportExcelRequest/Response: Exportación masiva a Excel con filtros
- ImportExcelRequest/Response: Importación masiva desde Excel
Ports (Protocols/Interfaces)¶
Contratos que define Application, implementa Infrastructure:
- OCRPort: Interfaz para motores OCR
- DocumentRepository: Interfaz para persistencia
- DuplicateDetectorPort: Interfaz para detectores de duplicados
Principio: Dependency Inversion (DIP) - Application depende de abstracciones, no implementaciones.
Infrastructure Layer (Adapters)¶
OCR Adapters
- OCRRouter: Routing inteligente entre Tesseract y PaddleOCR
- Analiza calidad del documento
- Fallback automático si confidence baja
- BaseOCRAdapter ✅: Clase abstracta con Template Method pattern
- Implementa
extract_text()como skeleton algorithm - Subclases implementan
_extract_from_image_internal() - Hooks:
_pre_extraction_check(),is_available() - TesseractAdapter: Hereda de BaseOCRAdapter, implementa Tesseract-OCR
- PaddleOCRAdapter: Hereda de BaseOCRAdapter, implementa PaddleOCR (lazy-loaded)
- DocumentQualityAnalyzer: Analiza métricas de calidad (contrast, sharpness)
NER Adapters (Refactored 2026-02-04)
- SpaCyAdapter ✅: Orquestador de extracción NER (342 líneas, reducido 53%)
- Modelo
es_core_news_lgcon EntityRuler - Usa Strategy Pattern para extractores
- Usa Chain of Responsibility para validadores
- Lazy loading del modelo
NER Extractors (Strategy Pattern) ✅
- EntityExtractor (Protocol): Interfaz para estrategias de extracción
- BaseExtractor (ABC): Clase base con logging común
- MarkerExtractor (priority=1): Extrae de marcadores explícitos (ACCIONANTE:, etc.)
- SpaCyNERExtractor (priority=2): Extracción neuronal con POS validation
- RegexExtractor (priority=3): Patrones regex para radicados, fechas, juzgados
- AddressExtractor (priority=4): Direcciones de demandante y demandado
- ContactExtractor (priority=5): Correo electrónico y cédula (CC/CE/NIT)
NER Validators (Chain of Responsibility) ✅
- EntityValidator (Protocol): Interfaz para validadores en cadena
- BaseValidator (ABC): Clase base con logging común
- ValidatorChain: Orquesta ejecución secuencial de validadores
- ValidationContext: Dataclass con contexto compartido entre validadores
- StructuralValidator: Boost de confianza por sección del documento
- TemporalValidator: Validación de fechas y consistencia de años
- RadicadoFormatValidator: Validación de formato de radicados colombianos
- DeduplicationValidator: Deduplicación fuzzy con EntityLinker
NER Utilities ✅
- EntityTruncator: Truncamiento inteligente por tipo de entidad
- ContextScorer: Pesos configurables para clasificación de entidades
- EntityLinker: Agrupación fuzzy de menciones similares
- RadicadoValidator: Validación de radicados colombianos (formato DANE)
- DocumentStructureAnalyzer: Detección de secciones del documento
- TemporalFeatures: Normalización y validación de fechas
Implementado (Sprint 08)
- EntityExtractorEnsemble ✅: Votación ponderada de extractores
Duplicate Detection
- HybridEnsembleDetector: Orquesta 4 niveles de detección
- Ensemble ponderado de scores
- Clasificación de confidence
- Short-circuit optimization
- ExactMatcher: Hash SHA-256
- MinHashMatcher: MinHash + LSH
- TFIDFMatcher: TF-IDF + Cosine Similarity
- EntityMatcher: Boost por entidades legales
Persistence
- DocumentReader ✅: Interface segregada para lectura (ISP)
- DocumentWriter ✅: Interface segregada para escritura (ISP)
- DocumentSearcher ✅: Interface segregada para búsqueda (ISP)
- CorrectionRepository ✅: Interface segregada para correcciones (ISP)
- MetricsRepository ✅: Interface segregada para métricas (ISP)
- DocumentRepository: Interfaz combinada (hereda las 5 anteriores)
- SQLiteDocumentRepository: Implementación SQLite + FTS5
- Composición via mixins: ConnectionMixin, CorrectionMixin, MetricsMixin
- Full-text search con FTS5
- Gestión de conexiones lazy + thread-safe
- Búsqueda por hash, ID y radicado
Patrones de Diseño Aplicados¶
1. Hexagonal Architecture (Ports & Adapters)¶
- Ports: Interfaces (
OCRPort,DocumentRepository,DuplicateDetectorPort) - Adapters: Implementaciones concretas (
TesseractAdapter,SQLiteDocumentRepository) - Flujo de dependencias: Infrastructure → Ports ← Application
2. Template Method Pattern ✅¶
- BaseOCRAdapter: Define skeleton algorithm para extracción OCR
extract_text()es el template method_extract_from_image_internal()es el método abstracto que subclases implementan_pre_extraction_check()es hook opcional- Permite agregar nuevos motores OCR sin duplicar lógica de PDF→páginas
3. Dependency Injection¶
- ServiceContainer: Inyecta dependencias configuradas
- Constructor injection en Use Cases
- Lazy loading para optimización
4. Strategy Pattern¶
- OCRRouter: Selecciona estrategia (Tesseract vs PaddleOCR) basado en calidad
- NER Extractors ✅: Estrategias intercambiables de extracción
MarkerExtractor(priority=1): Extrae de marcadores explícitosSpaCyNERExtractor(priority=2): Extracción neuronal con POS validationRegexExtractor(priority=3): Patrones regex para entidades estructuradasAddressExtractor(priority=4): Direcciones de demandante y demandadoContactExtractor(priority=5): Correo electrónico y cédula- Intercambiable sin modificar Use Case
4.1 Chain of Responsibility Pattern ✅¶
- NER Validators: Cadena secuencial de validadores
StructuralValidator→ Boost por sección del documentoTemporalValidator→ Consistencia de fechas/añosRadicadoFormatValidator→ Validación de radicados colombianosDeduplicationValidator→ Deduplicación fuzzyValidatorChainorquesta la ejecución secuencial- Cada validador puede modificar/filtrar entidades
5. Repository Pattern¶
- DocumentRepository: Abstrae persistencia
- SQLite intercambiable por PostgreSQL sin cambiar Application
6. Value Object Pattern¶
- OCRResult, EntityResult, DuplicateCandidate: Inmutables (
@dataclass(frozen=True)) - Sin identidad, comparables por valor
7. Result Pattern (Railway-Oriented)¶
- Use Case: Retorna
Result[Success, Failure] - Composición con
.bind()y.map() - Errores como valores, no excepciones
8. Lazy Loading¶
- PaddleOCRAdapter: Carga modelo solo cuando se necesita
- ServiceContainer: Instancia servicios bajo demanda
- Optimización de memoria y startup time
9. Interface Segregation (ISP) ✅¶
- Persistence Layer: 5 interfaces segregadas
DocumentReader: Solo operaciones de lecturaDocumentWriter: Solo operaciones de escrituraDocumentSearcher: Solo búsqueda full-textCorrectionRepository: Solo gestión de correccionesMetricsRepository: Solo métricas de procesamiento- Clientes dependen solo de la interfaz mínima que necesitan
10. Mixin Composition (Sprint 17)¶
- SQLiteDocumentRepository compuesto de 3 mixins:
ConnectionMixin: Gestión lazy de conexiones, row mappingCorrectionMixin: CRUD de correcciones, estadísticasMetricsMixin: Métricas de procesamiento, stats diarios
SOLID Principles¶
- S (Single Responsibility): Cada clase tiene una responsabilidad única
ExactMatchersolo hace hashingEntityTruncatorsolo trunca entidadesBaseOCRAdapterdefine skeleton, subclases implementan extracciónMarkerExtractorsolo extrae de marcadores ✅StructuralValidatorsolo aplica boost por sección ✅- O (Open/Closed): Extensible con nuevos adapters sin modificar Use Cases
- Template Method en
BaseOCRAdapterpermite agregar motores OCR - Strategy Pattern en NER permite agregar extractores ✅
- Chain of Responsibility permite agregar validadores ✅
- L (Liskov Substitution): Todos los adapters implementan correctamente sus Ports
TesseractAdapteryPaddleOCRAdapterson intercambiables viaBaseOCRAdapter- Todos los extractores implementan
EntityExtractorProtocol ✅ - Todos los validadores implementan
EntityValidatorProtocol ✅ - I (Interface Segregation): Interfaces segregadas para responsabilidades específicas
DocumentReader,DocumentWriter,DocumentSearcher,CorrectionRepositoryEntityExtractor,EntityValidator(protocolos mínimos) ✅- Clientes dependen solo de la interfaz que necesitan
- D (Dependency Inversion): Use Cases dependen de Ports, no Adapters
- GUI importa de
core/application, no depersistence SpaCyAdapterdepende deEntityExtractoryEntityValidatorprotocols ✅
Relaciones Clave¶
- ServiceContainer crea e inyecta todas las dependencias
- ProcessDocumentUseCase orquesta el pipeline usando Ports
- OCRRouter delega a adapters concretos (Tesseract/PaddleOCR)
- HybridEnsembleDetector coordina 4 matchers independientes
- SQLiteDocumentRepository persiste entidades
Document
Última actualización: 2026-03-05 (Sprint 19: Document +cedula/correo/imported_at, DTOs import/export, index_document_with_fields NER)