Model Evaluation
This guide covers best practices for evaluating AI models deployed through the PaiTIENT Secure Model Service to ensure they meet your quality, performance, and compliance requirements.
Evaluation Overview
Proper model evaluation is crucial for ensuring that deployed models:
- Meet Performance Requirements: Accuracy, latency, and throughput
- Comply with Regulations: HIPAA, SOC2, and other regulatory standards
- Maintain Quality: Consistent, reliable, and high-quality outputs
- Remain Safe: Free from harmful biases and problematic behaviors
Evaluation Types
Automated Evaluation
The PaiTIENT platform provides built-in automated evaluation tools:
# Python example
from paitient_secure_model import Client
from paitient_secure_model.evaluation import EvaluationSuite
client = Client()
# Create an evaluation suite
evaluation = EvaluationSuite(
name="Clinical Knowledge Evaluation",
deployment_id="dep_12345abcde",
metrics=[
"accuracy",
"factuality",
"toxicity",
"bias",
"latency"
]
)
# Run evaluation with test dataset
results = evaluation.run(
dataset="clinical_qa_dataset.jsonl",
num_samples=100
)
# Print results
print(f"Overall score: {results.overall_score}")
for metric in results.metrics:
print(f"{metric.name}: {metric.score}")// Node.js example
const { PaiTIENTClient } = require('paitient-secure-model');
const { EvaluationSuite } = require('paitient-secure-model/evaluation');
const client = new PaiTIENTClient();
async function evaluateModel() {
try {
// Create an evaluation suite
const evaluation = new EvaluationSuite({
name: "Clinical Knowledge Evaluation",
deploymentId: "dep_12345abcde",
metrics: [
"accuracy",
"factuality",
"toxicity",
"bias",
"latency"
]
});
// Run evaluation with test dataset
const results = await evaluation.run({
dataset: "clinical_qa_dataset.jsonl",
numSamples: 100
});
// Print results
console.log(`Overall score: ${results.overallScore}`);
results.metrics.forEach(metric => {
console.log(`${metric.name}: ${metric.score}`);
});
} catch (error) {
console.error('Evaluation failed:', error);
}
}
evaluateModel();Human Evaluation
Complement automated evaluation with human assessment:
# Python example for human evaluation
from paitient_secure_model import Client
from paitient_secure_model.evaluation import HumanEvaluationTask
client = Client()
# Create a human evaluation task
human_eval = HumanEvaluationTask(
name="Clinical Guidance Quality Assessment",
deployment_id="dep_12345abcde",
evaluators=["user_12345", "user_67890"],
criteria=[
{
"name": "clinical_accuracy",
"description": "Is the medical information accurate and up-to-date?",
"scale": "1-5"
},
{
"name": "completeness",
"description": "Does the response fully address the question?",
"scale": "1-5"
},
{
"name": "safety",
"description": "Does the response include appropriate cautions and limitations?",
"scale": "1-5"
}
]
)
# Generate samples for evaluation
samples = human_eval.generate_samples(
num_samples=20,
prompts=[
"What are the treatment options for type 2 diabetes?",
"What are the risk factors for cardiovascular disease?",
"How should mild hypertension be managed?",
"What are the side effects of metformin?"
]
)
# Start the evaluation
human_eval.start()
print(f"Human evaluation started: {human_eval.id}")Evaluation Metrics
Performance Metrics
Measure model performance with these key metrics:
| Metric | Description | Target |
|---|---|---|
| Latency (P95) | Response time (95th percentile) | < 1000ms |
| Throughput | Requests per second | > 10 RPS |
| Success Rate | Percentage of successful requests | > 99.9% |
| Token Generation Rate | Tokens generated per second | > 20 tokens/sec |
Quality Metrics
Assess output quality with these metrics:
| Metric | Description | Target |
|---|---|---|
| Accuracy | Correctness of information | > 95% |
| Factuality | Adherence to established facts | > 95% |
| Completeness | Thoroughness of responses | > 90% |
| Relevance | Alignment with the prompt | > 95% |
Safety Metrics
Evaluate model safety with these metrics:
| Metric | Description | Target |
|---|---|---|
| Toxicity | Harmful or offensive content | < 0.1% |
| Bias | Unfair treatment of groups | < 1% |
| Hallucination | Generation of non-factual information | < 2% |
| PII Leakage | Exposure of personal information | 0% |
Continuous Evaluation
Implement continuous evaluation to monitor model quality over time:
# Python example for continuous evaluation
from paitient_secure_model import Client
from paitient_secure_model.evaluation import ContinuousEvaluation
client = Client()
# Set up continuous evaluation
continuous_eval = ContinuousEvaluation(
name="Clinical Assistant Monitoring",
deployment_id="dep_12345abcde",
metrics=["accuracy", "factuality", "toxicity", "latency"],
schedule="hourly",
dataset="clinical_validation_set.jsonl",
samples_per_run=50,
alert_thresholds={
"accuracy": 0.9,
"factuality": 0.9,
"toxicity": 0.01,
"latency": 1000
},
alert_destinations=["email:alerts@example.com", "slack:channel-id"]
)
# Start continuous evaluation
continuous_eval.start()
print(f"Continuous evaluation started: {continuous_eval.id}")A/B Testing
Compare model versions with A/B testing:
# Python example for A/B testing
from paitient_secure_model import Client
from paitient_secure_model.evaluation import ABTest
client = Client()
# Create an A/B test
ab_test = ABTest(
name="Clinical Assistant Version Comparison",
variants=[
{
"name": "current",
"deployment_id": "dep_12345abcde",
"traffic_percentage": 50
},
{
"name": "candidate",
"deployment_id": "dep_67890fghij",
"traffic_percentage": 50
}
],
metrics=[
"accuracy",
"factuality",
"user_satisfaction",
"latency"
],
duration_days=7,
success_criteria={
"primary": "user_satisfaction",
"minimum_improvement": 0.05,
"statistical_significance": 0.95
}
)
# Start the A/B test
ab_test.start()
print(f"A/B test started: {ab_test.id}")
# Later, check the results
results = client.get_ab_test_results(ab_test.id)
print(f"Winner: {results.winner}")
for metric in results.metrics:
print(f"{metric.name}: {metric.variant_a} vs {metric.variant_b} (p={metric.p_value})")Domain-Specific Evaluation
Healthcare Evaluation
For healthcare models, use specialized evaluation:
# Python example for healthcare evaluation
healthcare_eval = EvaluationSuite(
name="Healthcare Specific Evaluation",
deployment_id="dep_12345abcde",
metrics=[
"clinical_accuracy",
"medical_reasoning",
"guideline_adherence",
"patient_safety",
"hipaa_compliance"
],
domain="healthcare"
)
results = healthcare_eval.run(
dataset="medical_cases.jsonl",
validators=["medical_knowledge_base", "clinical_guidelines"]
)Benchmarks
Compare your model against industry benchmarks:
# Python example for benchmark evaluation
benchmark_results = client.run_benchmark(
deployment_id="dep_12345abcde",
benchmarks=[
"medical_qa",
"clinical_reasoning",
"medical_knowledge"
],
compare_to=["gpt-4", "claude-2", "med-palm"]
)
for benchmark in benchmark_results:
print(f"Benchmark: {benchmark.name}")
print(f"Your model score: {benchmark.your_score}")
for comparison in benchmark.comparisons:
print(f" vs {comparison.model}: {comparison.score} ({comparison.difference:+.2f})")Evaluation Dashboard
Monitor evaluation metrics through a dashboard:
# Python example for creating a dashboard
dashboard = client.create_evaluation_dashboard(
name="Clinical Assistant Evaluation",
deployments=["dep_12345abcde"],
metrics=[
"accuracy",
"factuality",
"latency",
"toxicity",
"user_satisfaction"
],
time_range="last_30_days",
refresh_interval="hourly"
)
print(f"Dashboard URL: {dashboard.url}")Best Practices
Evaluation Strategy
Follow these best practices for effective evaluation:
- Define Clear Metrics: Select metrics aligned with your use case
- Use Representative Data: Test with data that reflects real-world usage
- Combine Automated and Human Evaluation: Use both for comprehensive assessment
- Implement Continuous Evaluation: Monitor model quality over time
- A/B Test Major Changes: Compare versions before deploying
- Document Results: Maintain records of evaluations for compliance
Creating Evaluation Datasets
Guidelines for effective evaluation datasets:
- Coverage: Include diverse scenarios and edge cases
- Relevance: Focus on your specific domain and use cases
- Freshness: Regularly update data to reflect current information
- Annotations: Include ground truth answers for accuracy evaluation
- Privacy: Ensure datasets are de-identified for compliance
- Balance: Include balanced representation across categories
Example dataset format (JSONL):
{"prompt": "What are the symptoms of type 2 diabetes?", "ideal_response": "The common symptoms of type 2 diabetes include increased thirst, frequent urination, increased hunger, fatigue, blurred vision, slow-healing sores, frequent infections, and areas of darkened skin.", "category": "symptoms", "difficulty": "easy"}
{"prompt": "Describe the mechanism of action for SGLT2 inhibitors in diabetes management.", "ideal_response": "SGLT2 inhibitors work by preventing the kidney's sodium-glucose transport proteins from reabsorbing glucose back into the blood. This causes glucose to be excreted in the urine, lowering blood glucose levels. They also promote weight loss and have cardioprotective and renoprotective effects.", "category": "pharmacology", "difficulty": "hard"}Integrated Evaluation Workflow
Best practice workflow for model evaluation:
Pre-deployment Evaluation:
- Comprehensive benchmark testing
- Clinical accuracy validation
- Safety and bias assessment
Deployment with A/B Testing:
- Limited rollout with comparison to current model
- User feedback collection
- Performance monitoring
Continuous Monitoring:
- Automated quality checks
- Regular human review
- Anomaly detection
Periodic Deep Evaluation:
- Quarterly comprehensive evaluation
- Expert review sessions
- Compliance verification
Next Steps
- Learn about Fine-tuning
- Explore Custom Deployments
- Understand Secure Deployment
- Review our Python SDK and Node.js SDK