Architecture Overview

This document provides a high-level overview of the DeepFix architecture, including system components, data flow, and design principles.

System Overview

DeepFix is a distributed system for AI-powered ML artifact analysis. It follows a client-server architecture that separates artifact computation from intelligent analysis.

┌─────────────────┐         ┌──────────────────┐         ┌─────────────────┐
│   DeepFix SDK   │────────▶│  DeepFix Server  │────────▶│  LLM Provider   │
│   (Client)      │         │   (Analysis)     │         │   (OpenAI/etc)  │
└─────────────────┘         └──────────────────┘         └─────────────────┘
        │                            │
        │                            │
        ▼                            ▼
┌─────────────────┐         ┌──────────────────┐
│     MLflow      │         │  Knowledge Base  │
│ (Artifact Store)│         │   (Best Practices│
└─────────────────┘         └──────────────────┘

Core Components

1. DeepFix SDK (Client)

The SDK is responsible for:

Artifact Computation: Generating datasets, running checks, collecting metrics
Artifact Recording: Storing artifacts in MLflow
Workflow Integration: Integrating with PyTorch Lightning, MLflow, etc.
Client Communication: Sending analysis requests to the server

Location: deepfix-sdk/

2. DeepFix Server

The server is responsible for:

Artifact Retrieval: Fetching artifacts from MLflow
AI-Powered Analysis: Running specialized analysis agents
Knowledge Retrieval: Querying best practices knowledge base
Result Synthesis: Combining agent results into actionable insights

Location: deepfix-server/

3. DeepFix Core

Shared models and types:

Data Models: APIRequest, APIResponse, artifact types
Type Definitions: DataType, ArtifactPath, etc.

Location: deepfix-core/

4. Knowledge Base

Stores best practices and domain knowledge:

Architecture Best Practices: Model design patterns
Data Quality Best Practices: Dataset quality standards
Training Best Practices: Training optimization strategies

Location: deepfix-kb/, documents/

Architecture Principles

1. Separation of Concerns

Client: Handles computation and workflow integration
Server: Focuses on AI-powered analysis
Clear boundaries between components

2. Stateless Server

No session state between requests
Enables horizontal scaling
Easier deployment and maintenance

3. Artifact Storage

MLflow as the single source of truth for artifacts
Server pulls artifacts on-demand
Client handles artifact generation and storage

4. Agentic Analysis

Specialized agents for different artifact types
Parallel agent execution where possible
Cross-artifact reasoning for holistic insights

5. Local-First Design

Designed for local deployment
Can scale to cloud if needed
Minimal external dependencies

Data Flow

Analysis Request Flow

1. Client computes artifacts (datasets, checks, metrics)
   ↓
2. Client stores artifacts in MLflow
   ↓
3. Client sends analysis request to server
   ↓
4. Server retrieves artifacts from MLflow
   ↓
5. Server runs analysis agents in parallel
   ↓
6. Server queries knowledge base
   ↓
7. Server synthesizes results
   ↓
8. Server returns structured response to client
   ↓
9. Client displays results to user

Agent Execution Flow

AnalyseArtifactsAPI
    ↓
AgentContext (decode request)
    ↓
ArtifactAnalysisCoordinator
    ↓
┌─────────────────────────────────────┐
│  Parallel Agent Execution           │
│  - DatasetArtifactsAnalyzer         │
│  - DeepchecksArtifactsAnalyzer      │
│  - ModelCheckpointArtifactsAnalyzer │
│  - TrainingArtifactsAnalyzer        │
└─────────────────────────────────────┘
    ↓
CrossArtifactReasoningAgent (sequential)
    ↓
Synthesize results
    ↓
APIResponse

Technology Stack

Client (SDK)

Language: Python 3.11+
Key Libraries:
requests for HTTP communication
mlflow for artifact tracking
pydantic for data validation

Server

Language: Python 3.11+
Framework: FastAPI (via LitServe)
Key Libraries:
dspy for LLM orchestration
litserve for API serving
pydantic v2 for validation
llama-index-retrievers-bm25 for knowledge retrieval

Core

Language: Python 3.11+
Key Libraries:
pydantic for data models

Communication Protocol

REST API

Protocol: HTTP/HTTPS
Format: JSON
Endpoints:
POST /v1/analyse - Analyze artifacts

Request Format

{
  "dataset_name": "my-dataset",
  "dataset_artifacts": {...},
  "deepchecks_artifacts": {...},
  "model_checkpoint_artifacts": {...},
  "training_artifacts": {...},
  "language": "english"
}

Response Format

{
  "agent_results": {
    "DatasetArtifactsAnalyzer": {...},
    "DeepchecksArtifactsAnalyzer": {...},
    ...
  },
  "summary": "Cross-artifact summary",
  "additional_outputs": {...},
  "error_messages": {...}
}

Deployment Architecture

Local Deployment

┌─────────────────────────────────────┐
│         Local Machine               │
│                                     │
│  ┌──────────┐    ┌──────────────┐  │
│  │  Client  │───▶│    Server    │  │
│  └──────────┘    └──────────────┘  │
│       │                 │           │
│       ▼                 ▼           │
│  ┌──────────┐    ┌──────────────┐  │
│  │  MLflow  │    │ Knowledge KB │  │
│  └──────────┘    └──────────────┘  │
│                                     │
└─────────────────────────────────────┘

Docker Deployment

┌─────────────────────────────────────┐
│      Docker Compose                 │
│                                     │
│  ┌──────────────┐  ┌─────────────┐ │
│  │ deepfix-     │  │ mlflow      │ │
│  │ server       │  │ server      │ │
│  │ container    │  │ container   │ │
│  └──────────────┘  └─────────────┘ │
│                                     │
└─────────────────────────────────────┘

Design Decisions

Why Client-Server?

Scalability: Independent scaling of analysis service
Separation: Clear separation of computation and analysis
Flexibility: Client can work offline with graceful degradation

Why MLflow for Artifacts?

Standardization: Industry-standard artifact storage
Integration: Works with existing ML workflows
Persistence: Reliable artifact storage and versioning

Why Stateless Server?

Scalability: Easy horizontal scaling
Reliability: No session state to manage
Simplicity: Easier to deploy and maintain

Why Agentic Architecture?

Specialization: Each agent focuses on specific artifact type
Parallelism: Agents can run in parallel
Extensibility: Easy to add new agents

Future Extensions

Planned Enhancements

Multi-Tenancy: Support for multiple users/tenants
Authentication: User authentication and authorization
Result Persistence: Store analysis history
Streaming Analysis: Real-time analysis updates
Cloud Deployment: Support for cloud platforms

Scalability Path

Current: Single server, local deployment
Next: Horizontal scaling with load balancer
Future: Distributed agents with message queue

Client-Server Architecture - Detailed client-server design
Agent System - Agent architecture and execution
API Reference - API documentation
Deployment Guide - Deployment instructions