Architecture Overview
This document provides a high-level overview of the DeepFix architecture, including system components, data flow, and design principles.
System Overview
DeepFix is a distributed system for AI-powered ML artifact analysis. It follows a client-server architecture that separates artifact computation from intelligent analysis.
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ DeepFix SDK │────────▶│ DeepFix Server │────────▶│ LLM Provider │
│ (Client) │ │ (Analysis) │ │ (OpenAI/etc) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ MLflow │ │ Knowledge Base │
│ (Artifact Store)│ │ (Best Practices│
└─────────────────┘ └──────────────────┘
Core Components
1. DeepFix SDK (Client)
The SDK is responsible for:
- Artifact Computation: Generating datasets, running checks, collecting metrics
- Artifact Recording: Storing artifacts in MLflow
- Workflow Integration: Integrating with PyTorch Lightning, MLflow, etc.
- Client Communication: Sending analysis requests to the server
Location: deepfix-sdk/
2. DeepFix Server
The server is responsible for:
- Artifact Retrieval: Fetching artifacts from MLflow
- AI-Powered Analysis: Running specialized analysis agents
- Knowledge Retrieval: Querying best practices knowledge base
- Result Synthesis: Combining agent results into actionable insights
Location: deepfix-server/
3. DeepFix Core
Shared models and types:
- Data Models: APIRequest, APIResponse, artifact types
- Type Definitions: DataType, ArtifactPath, etc.
Location: deepfix-core/
4. Knowledge Base
Stores best practices and domain knowledge:
- Architecture Best Practices: Model design patterns
- Data Quality Best Practices: Dataset quality standards
- Training Best Practices: Training optimization strategies
Location: deepfix-kb/, documents/
Architecture Principles
1. Separation of Concerns
- Client: Handles computation and workflow integration
- Server: Focuses on AI-powered analysis
- Clear boundaries between components
2. Stateless Server
- No session state between requests
- Enables horizontal scaling
- Easier deployment and maintenance
3. Artifact Storage
- MLflow as the single source of truth for artifacts
- Server pulls artifacts on-demand
- Client handles artifact generation and storage
4. Agentic Analysis
- Specialized agents for different artifact types
- Parallel agent execution where possible
- Cross-artifact reasoning for holistic insights
5. Local-First Design
- Designed for local deployment
- Can scale to cloud if needed
- Minimal external dependencies
Data Flow
Analysis Request Flow
1. Client computes artifacts (datasets, checks, metrics)
↓
2. Client stores artifacts in MLflow
↓
3. Client sends analysis request to server
↓
4. Server retrieves artifacts from MLflow
↓
5. Server runs analysis agents in parallel
↓
6. Server queries knowledge base
↓
7. Server synthesizes results
↓
8. Server returns structured response to client
↓
9. Client displays results to user
Agent Execution Flow
AnalyseArtifactsAPI
↓
AgentContext (decode request)
↓
ArtifactAnalysisCoordinator
↓
┌─────────────────────────────────────┐
│ Parallel Agent Execution │
│ - DatasetArtifactsAnalyzer │
│ - DeepchecksArtifactsAnalyzer │
│ - ModelCheckpointArtifactsAnalyzer │
│ - TrainingArtifactsAnalyzer │
└─────────────────────────────────────┘
↓
CrossArtifactReasoningAgent (sequential)
↓
Synthesize results
↓
APIResponse
Technology Stack
Client (SDK)
- Language: Python 3.11+
- Key Libraries:
requestsfor HTTP communicationmlflowfor artifact trackingpydanticfor data validation
Server
- Language: Python 3.11+
- Framework: FastAPI (via LitServe)
- Key Libraries:
dspyfor LLM orchestrationlitservefor API servingpydantic v2for validationllama-index-retrievers-bm25for knowledge retrieval
Core
- Language: Python 3.11+
- Key Libraries:
pydanticfor data models
Communication Protocol
REST API
- Protocol: HTTP/HTTPS
- Format: JSON
- Endpoints:
POST /v1/analyse- Analyze artifacts
Request Format
{
"dataset_name": "my-dataset",
"dataset_artifacts": {...},
"deepchecks_artifacts": {...},
"model_checkpoint_artifacts": {...},
"training_artifacts": {...},
"language": "english"
}
Response Format
{
"agent_results": {
"DatasetArtifactsAnalyzer": {...},
"DeepchecksArtifactsAnalyzer": {...},
...
},
"summary": "Cross-artifact summary",
"additional_outputs": {...},
"error_messages": {...}
}
Deployment Architecture
Local Deployment
┌─────────────────────────────────────┐
│ Local Machine │
│ │
│ ┌──────────┐ ┌──────────────┐ │
│ │ Client │───▶│ Server │ │
│ └──────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────────┐ │
│ │ MLflow │ │ Knowledge KB │ │
│ └──────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────┘
Docker Deployment
┌─────────────────────────────────────┐
│ Docker Compose │
│ │
│ ┌──────────────┐ ┌─────────────┐ │
│ │ deepfix- │ │ mlflow │ │
│ │ server │ │ server │ │
│ │ container │ │ container │ │
│ └──────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────┘
Design Decisions
Why Client-Server?
- Scalability: Independent scaling of analysis service
- Separation: Clear separation of computation and analysis
- Flexibility: Client can work offline with graceful degradation
Why MLflow for Artifacts?
- Standardization: Industry-standard artifact storage
- Integration: Works with existing ML workflows
- Persistence: Reliable artifact storage and versioning
Why Stateless Server?
- Scalability: Easy horizontal scaling
- Reliability: No session state to manage
- Simplicity: Easier to deploy and maintain
Why Agentic Architecture?
- Specialization: Each agent focuses on specific artifact type
- Parallelism: Agents can run in parallel
- Extensibility: Easy to add new agents
Future Extensions
Planned Enhancements
- Multi-Tenancy: Support for multiple users/tenants
- Authentication: User authentication and authorization
- Result Persistence: Store analysis history
- Streaming Analysis: Real-time analysis updates
- Cloud Deployment: Support for cloud platforms
Scalability Path
- Current: Single server, local deployment
- Next: Horizontal scaling with load balancer
- Future: Distributed agents with message queue
Related Documentation
- Client-Server Architecture - Detailed client-server design
- Agent System - Agent architecture and execution
- API Reference - API documentation
- Deployment Guide - Deployment instructions