Client-Server Architecture
This document details the client-server architecture of DeepFix, including component responsibilities, communication protocols, and design decisions.
Overview
DeepFix follows a client-server architecture that separates artifact computation (client) from AI-powered analysis (server). This design enables scalability, flexibility, and maintainability.
Architecture Decision
Server State Management: Hybrid (Stateless + In-Memory Cache)
- Stateless core API for horizontal scalability
- In-memory LRU cache for KnowledgeBridge (upgradeable to Redis)
- MLflow as the persistent artifact store
- Local-first deployment model
Key Design Choices:
| Aspect | Decision | Rationale |
|---|---|---|
| Communication | REST API | Simple, HTTP-based, widely supported |
| Artifact Storage | Server pulls from MLflow | Simplifies client, centralizes access control |
| State Management | Stateless + cache | Enables scaling, simpler deployment |
| Deployment | Local-first | Matches current usage, easier migration |
Server Responsibilities
The DeepFix server is responsible for:
1. Artifact Retrieval
- Connect to MLflow tracking server using provided URI
- Fetch artifacts for specified run_id or dataset_name
- Download and cache artifacts locally (temporary storage)
- Handle missing or incomplete artifacts gracefully
Implementation Notes:
- Reuse ArtifactsManager from deepfix-core/artifacts/
- Use MLflow Python client for artifact downloads
- Implement parallel downloads for multiple artifact types
- Store downloaded artifacts in ephemeral temp directory (clean up after analysis)
2. AI-Powered Analysis
- Execute multi-agent analysis pipeline
- Coordinate agent execution (parallel where possible)
- Aggregate agent results into unified output
- Generate natural language summaries
NOT responsible for: Training models or computing metrics
Implementation Notes:
- Reuse agent system from deepfix-server/agents/
- Maintain agent execution order:
- Parallel: TrainingAnalyzer, DatasetAnalyzer, DeepchecksAnalyzer
- Sequential: CrossArtifactIntegration → OptimizationAdvisor
- Implement timeout handling per agent
- Support partial results when some agents fail
3. Knowledge Retrieval
- Query knowledge base using KnowledgeBridge
- Cache knowledge retrieval results
- Validate retrieved knowledge against agent context
- Provide knowledge citations in responses
NOT responsible for: Knowledge base updates or curation
Implementation Notes:
- Reuse KnowledgeBridge from deepfix-kb/
- Implement in-memory LRU cache for query results
- Cache key: hash(query + domain + query_type)
- TTL: 24 hours
4. Result Formatting
- Transform agent results into API response format
- Generate natural language summaries
- Prioritize findings by severity and confidence
- Format recommendations with actionable steps
NOT responsible for: Result persistence or visualization
5. Error Handling
- Validate incoming requests against schema
- Handle MLflow connection failures
- Manage agent execution errors
- Provide detailed error messages with recovery suggestions
NOT responsible for: Client-side error recovery
Server Constraints
Performance Constraints
| Metric | Constraint | Rationale |
|---|---|---|
| Response Time | <60s (typical) | User expectation for interactive analysis |
| Timeout | 300s (max) | Hard limit to prevent resource exhaustion |
| Concurrent Requests | 10 simultaneous | Local-first deployment assumption |
| Memory Usage | <4GB per analysis | Supports deployment on standard machines |
| Knowledge Query Time | <2s | Interactive query experience |
Resource Constraints
- Cache Size: Max 1GB or 1000 entries (LRU eviction)
- Artifact Storage: Temporary only, cleaned after analysis
- Database: No persistent state (optional SQLite for artifact tracking)
- Network: Must handle MLflow server on different host
Operational Constraints
- Stateless Design: No session state between requests
- No User Authentication: Relies on network security (add later)
- No Artifact Storage: Server doesn't persist artifacts beyond analysis
- Single Tenant: No multi-tenancy support in v1
Compatibility Constraints
- Python Version: 3.11+
- MLflow Version: Compatible with 2.0+
- Artifact Formats: Must handle legacy and current formats
- API Version: Semantic versioning (v1 = stable)
Server Boundaries
What Server DOES:
- ✅ Fetch artifacts from MLflow
- ✅ Run AI analysis on artifacts
- ✅ Query and cache knowledge
- ✅ Return structured results
- ✅ Health monitoring
What Server DOES NOT:
- ❌ Compute or generate artifacts
- ❌ Store artifacts permanently
- ❌ Log to MLflow
- ❌ Train models
- ❌ Manage user sessions
- ❌ Persist analysis history (v1)
- ❌ Update knowledge base (v1)
Client Responsibilities
The DeepFix SDK (client) is responsible for:
1. Artifact Computation
- Generate datasets, deepchecks reports, model checkpoints
- Compute training metrics and logs
- Run data quality checks
- Extract dataset statistics
Implementation Notes: - Use existing data loading and processing pipelines - Integrate with PyTorch Lightning for training artifacts - Use Deepchecks for data quality reports - Generate artifacts in MLflow-compatible formats
2. Artifact Recording
- Store artifacts in MLflow tracking server
- Tag artifacts with metadata (dataset_name, model_name, etc.)
- Version artifacts appropriately
- Handle artifact storage failures
Implementation Notes: - Use MLflow Python API for artifact logging - Implement retry logic for failed uploads - Support offline mode with local artifact storage - Clean up temporary artifacts after successful upload
3. Workflow Integration
- Integrate with PyTorch Lightning callbacks
- Integrate with MLflow experiments
- Support Jupyter notebooks and scripts
- Provide command-line interface
Implementation Notes: - Create Lightning callback for automatic analysis - Provide context managers for MLflow integration - Support both synchronous and asynchronous workflows - Handle workflow-specific errors gracefully
4. Client Communication
- Send analysis requests to DeepFix server
- Handle server responses and errors
- Implement retry logic for transient failures
- Support offline mode with graceful degradation
Implementation Notes:
- Use requests library for HTTP communication
- Implement exponential backoff retry strategy
- Cache server responses when appropriate
- Provide clear error messages to users
5. Result Processing
- Parse server responses
- Format results for display
- Integrate results into workflows
- Store results in MLflow (optional)
NOT responsible for: Running AI analysis or querying knowledge base
Client Boundaries
What Client DOES:
- ✅ Compute artifacts (datasets, checks, metrics)
- ✅ Store artifacts in MLflow
- ✅ Send analysis requests
- ✅ Display results to users
- ✅ Integrate with ML workflows
What Client DOES NOT:
- ❌ Run AI analysis
- ❌ Query knowledge base directly
- ❌ Manage server state
- ❌ Store analysis results permanently (optional)
Communication Protocol
REST API
The client and server communicate via REST API:
Endpoint: POST /v1/analyse
Request Format:
{
"dataset_name": "my-dataset",
"model_name": "my-model",
"dataset_artifacts": {
"metadata": {...},
"statistics": {...}
},
"deepchecks_artifacts": {
"reports": [...],
"checks": [...]
},
"model_checkpoint_artifacts": {
"checkpoint_path": "...",
"metadata": {...}
},
"training_artifacts": {
"metrics": {...},
"logs": [...]
},
"language": "english"
}
Response Format:
{
"agent_results": {
"DatasetArtifactsAnalyzer": {
"findings": [...],
"confidence": 0.95,
"severity": "high"
},
"DeepchecksArtifactsAnalyzer": {...},
...
},
"summary": "Cross-artifact summary...",
"additional_outputs": {
"recommendations": [...],
"citations": [...]
},
"error_messages": {}
}
Error Handling
Server Errors:
400 Bad Request: Invalid request format404 Not Found: Artifacts not found in MLflow500 Internal Server Error: Server processing error503 Service Unavailable: Server overloaded or unavailable504 Gateway Timeout: Request timeout
Client Errors:
- Connection errors: Retry with exponential backoff
- Timeout errors: Increase timeout or retry
- Validation errors: Fix request format
- Server errors: Log error and notify user
Workflow Patterns
1. Synchronous Analysis
Use Case: Immediate feedback after training
2. Asynchronous Analysis
Use Case: Long-running analysis
3. Batch Analysis
Use Case: Analyzing multiple experiments
Design Rationale
Why Client-Server?
- Separation of Concerns: Clear separation of computation and analysis
- Scalability: Independent scaling of analysis service
- Flexibility: Client can work offline with graceful degradation
- Maintainability: Easier to update and maintain components
Why Stateless Server?
- Scalability: Easy horizontal scaling
- Reliability: No session state to manage
- Simplicity: Easier deployment and maintenance
- Fault Tolerance: No state corruption issues
Why MLflow for Artifacts?
- Standardization: Industry-standard artifact storage
- Integration: Works with existing ML workflows
- Persistence: Reliable artifact storage and versioning
- Tooling: Rich ecosystem of tools
Why In-Memory Cache?
- Performance: Fast knowledge retrieval
- Simplicity: No external cache dependency
- Upgradeable: Can migrate to Redis if needed
- Cost: No additional infrastructure needed
Related Documentation
- Architecture Overview - High-level system architecture
- Agent System - Agent architecture
- API Reference - API documentation
Note: Workflow and Service specifications are available in the
specs/directory at the repository root.