# PHsymm: Protein Symmetry Type Prediction Using Path Homology and ESM2

## Project Overview

### Model Functionality

PHsymm is a deep learning-based protein symmetry type prediction model that can directly predict the three-dimensional symmetry type of a protein from its amino acid sequence. PHsymm is the first method to apply path homology to protein sequence feature extraction, without relying on structure or Multiple Sequence Alignment (MSA) information.

### Input and Output

**Input**:
- Protein sequence file in FASTA format (or sequence string)
- Supports batch prediction for single or multiple sequences

**Output**:
- Predicted symmetry type label (one of 15 categories)
- Prediction confidence (probability value)
- Top-5 candidate predictions and their probability distribution (optional)
- Batch prediction results CSV file (optional)

**Output Format Example**:
```
Predicted Symmetry Type: C2
Confidence: 0.8234 (82.34%)

Top 5 Predictions:
Rank     Symmetry Type        Probability      Percentage
----------------------------------------------------------------------
 ←  1    C2                   0.8234           82.34%
    2    C4                   0.0567           5.67%
    3    D3                   0.0321           3.21%
    ...
```

## Environment Dependencies

### Python Version
- Python 3.7+

### Main Dependencies and Versions

```bash
# Deep Learning Framework
torch>=1.9.0
torch-geometric>=2.0.0

# Data Processing
numpy>=1.19.0
pandas>=1.2.0
scipy>=1.6.0
scikit-learn>=0.24.0

# ESM2 Model
fair-esm>=0.1.0
# or
transformers>=4.20.0

# Bioinformatics
biopython>=1.78

# Configuration and Utilities
pyyaml>=5.4.0
tqdm>=4.60.0
matplotlib>=3.3.0

# Training Framework
pytorch-ignite>=0.4.0  # For training loop and evaluation
```

## Installation

### Method 1: Using Conda Environment (Recommended)

```bash
# Create conda environment
conda create -n PHsymm python=3.8
conda activate PHsymm

# Install PyTorch (select according to CUDA version)
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Install other dependencies
pip install torch-geometric
pip install fair-esm
pip install biopython pyyaml tqdm matplotlib scikit-learn
pip install pytorch-ignite
```

### Method 2: Direct pip Installation

```bash
# Install PyTorch (select according to your system)
pip install torch torchvision torchaudio

# Install other dependencies
pip install torch-geometric numpy pandas scipy scikit-learn
pip install fair-esm biopython pyyaml tqdm matplotlib
pip install pytorch-ignite
```

### Verify Installation

```bash
python -c "import torch; import esm; print('Installation successful!')"
```

## Data Description

### Data Source

The model is trained using protein sequences and their corresponding symmetry type labels.

### Data Format

#### Label File (CSV Format)

**Format**: Two columns, with optional header

```csv
protein_id,label
1A0P,C2
2ABC,D3
3DEF,C4
...
```

**Column Description**:
- `protein_id`: Protein ID (case-insensitive)
- `label`: Symmetry type label (C1-C6, C7-C9, C10-C17, D2-D5, D6-D8, H, T)

#### ESM2 Feature Files

**Location**: `data/esm2_dir/` or directory specified in config file

**Format**: CSV file with a single row of 1280 values (comma-separated)

**Filename**: `<protein_id_lower>.csv` (protein ID in lowercase)

**Example**: `1a0p.csv`
```
0.123, -0.456, 0.789, ..., 0.321  # 1280 values
```

#### Path Homology Feature Files

**Location**: `data/path_homology_dir/` or directory specified in config file

**Format**: CSV file with 360 values (comma-separated, no header)

**Filename**: `<protein_id_upper>.csv` (protein ID in uppercase)

**Example**: `1A0P.csv`
```
0.12, 3.45, 0.67, ..., 1.23  # 360 values
```

## Usage

### 1. Inference and Prediction

#### Single Sequence Prediction

```bash
# From FASTA file
./predict_protein.sh -f protein.fasta

# From sequence string
./predict_protein.sh -s "MKTAYIAKQR..."

# Using GPU
./predict_protein.sh -f protein.fasta --gpu
```

#### Batch Prediction

When a FASTA file contains multiple sequences, use the `--output` parameter to save batch results:

```bash
./predict_protein.sh -f proteins.fasta --output results.csv
```

The output CSV contains prediction results for each sequence: sequence name, predicted label, confidence, etc.

#### Parameter Description

| Parameter | Description | Required | Default |
|-----------|-------------|----------|---------|
| `-f, --fasta` | FASTA file path | Either `-f` or `-s` | - |
| `-s, --sequence` | Protein sequence string | Either `-f` or `-s` | - |
| `-m, --model` | Model checkpoint path | No | `trained_models/.../best_checkpoint_1_macro_auc_pr=0.5918.pt` |
| `-c, --config` | Config file path | No | `config/protein_classifier_config.yaml` |
| `-o, --output` | Output CSV file path | No | - |
| `-g, --gpu` | Use GPU acceleration | No | CPU |

### 2. Model Training

#### Prepare Data

1. Prepare label file (CSV format): `protein_id, label`
2. Prepare ESM2 feature files (if needed)
3. Prepare Path Homology feature files (computed using `ph-protein-sequence.py`)

#### Configuration File Setup

Edit `config/protein_classifier_config.yaml`:

```yaml
data:
  csv_file: 'path/to/labels.csv'
  esm2_dir: 'path/to/esm2/features/'
  path_homology_dir: 'path/to/path_homology/features/'
  train_ratio: 0.8
  val_ratio: 0.1
  test_ratio: 0.1

training:
  batch_size: 8
  learning_rate: 0.0002
  epoch: 200
  ...
```

#### Start Training

```bash
python train_protein_improved.py \
    --config config/protein_classifier_config.yaml \
    --gpu 0
```

### 3. Compute Path Homology Features

To compute Path Homology features for FASTA files:

```bash
# Modify the paths in ph-protein-sequence.py, then run
python ph-protein-sequence.py
```

This script will process all FASTA files in the specified directory and compute 360-dimensional Path Homology features.

## Experimental Results

### Main Performance Metrics

Based on the training checkpoint filename (`best_checkpoint_1_macro_auc_pr=0.5918.pt`), the best model performance:

- **Macro AUC-PR**: **0.5918**
- **Model Architecture**: MLP-based hybrid feature fusion network
- **Features**: ESM2 (1280-dim) + Path Homology (360-dim)
- **Number of Classes**: 15 symmetry type categories

### Training Configuration

- **Loss Function**: Focal Loss (γ=3.0, α=effective_number)
- **Optimizer**: Adam (lr=0.0002, weight_decay=2e-4)
- **Learning Rate Schedule**: Cosine Annealing (T_max=200)
- **Early Stopping Monitor**: Macro AUC-PR (patience=30)
- **Data Sampling Strategy**: 
  - C1: Undersample 60%
  - C2: No sampling
  - C3, D2: Oversample 5x
  - Others: Oversample 20x

### Model Checkpoints

Trained models are saved in the `trained_models/protein_classifier_unified/` directory:

- `best_checkpoint_1_macro_auc_pr=0.5918.pt` - Best model (Macro AUC-PR=0.5918)
- `final_model.pth` - Final model
- `*.png` - Training curves and confusion matrix visualizations


### Key Scripts

| File | Function |
|------|----------|
| `predict_single_protein.py` | Predict symmetry type from FASTA file or sequence string |
| `train_protein_improved.py` | Main training script supporting multiple loss functions and training strategies |
| `model.py` | QCformer model definition, implementing ESM2 and Path Homology feature fusion |
| `ph-protein-sequence.py` | Script to compute Path Homology features |
| `GLMYnonregular.py` | Core implementation of path homology algorithm |
| `protein_dataset.py` | Dataset loading, processing, and batching |
| `focal_loss.py` | Multi-class Focal Loss implementation |

## 15 Symmetry Type Categories

The model predicts the following 15 symmetry types:

| Class Index | Label Name | Description | Original Label Mapping |
|------------|-----------|-------------|------------------------|
| 0 | C1 | No symmetry | C1 |
| 1 | C10-C17 | 10-17-fold rotational symmetry (merged) | C10, C11, C12, C13, C14, C15, C16, C17 |
| 2 | C2 | 2-fold rotational symmetry | C2 |
| 3 | C3 | 3-fold rotational symmetry | C3 |
| 4 | C4 | 4-fold rotational symmetry | C4 |
| 5 | C5 | 5-fold rotational symmetry | C5 |
| 6 | C6 | 6-fold rotational symmetry | C6 |
| 7 | C7-C9 | 7-9-fold rotational symmetry (merged) | C7, C8, C9 |
| 8 | D2 | Dihedral 2-fold symmetry | D2 |
| 9 | D3 | Dihedral 3-fold symmetry | D3 |
| 10 | D4 | Dihedral 4-fold symmetry | D4 |
| 11 | D5 | Dihedral 5-fold symmetry | D5 |
| 12 | D6-D8 | Dihedral 6-8-fold symmetry (merged) | D6, D7, D8 |
| 13 | H | Icosahedral symmetry | H |
| 14 | T | Tetrahedral symmetry | T |

## Citation and License

### Citation

If you use PHsymm in your research, please cite:

```bibtex
@article{phsymm2025,
  title={PHsymm: Protein Symmetry Type Prediction Using Path Homology and ESM2},
  author={...},
  journal={...},
  year={2025},
  ...
}
```


## Contact

For questions or suggestions, please contact:

- Email: [2022000744@ruc.edu.cn]

---

**Note**: On first run, the ESM2 model will be automatically downloaded (approximately 1.3GB). Please ensure you have a stable internet connection.
