# recognize-anything
[](https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text)
[](https://colab.research.google.com/github/mhd-medfa/recognize-anything/blob/main/recognize_anything_demo.ipynb)
Official PyTorch Implementation of Recognize Anything: A Strong Image Tagging Model and Tag2Text: Guiding Vision-Language Model via Image Tagging.
- **Recognize Anything Model(RAM)** is an image tagging model, which can recognize any common category with high accuracy.
- **Tag2Text** is a vision-language model guided by tagging, which can support caption, retrieval and tagging.
Both Tag2Text and RAM exihibit strong recognition ability.
We have combined Tag2Text and RAM with localization models (Grounding-DINO and SAM) and developed a strong visual semantic analysis pipeline in the [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) project.

## :bulb: Highlight of RAM
RAM is a strong image tagging model, which can recognize any common category with high accuracy.
- **Strong and general.** RAM exhibits exceptional image tagging capabilities with powerful zero-shot generalization;
- RAM showcases impressive zero-shot performance, significantly outperforming CLIP and BLIP.
- RAM even surpasses the fully supervised manners (ML-Decoder).
- RAM exhibits competitive performance with the Google tagging API.
- **Reproducible and affordable.** RAM requires Low reproduction cost with open-source and annotation-free dataset;
- **Flexible and versatile.** RAM offers remarkable flexibility, catering to various application scenarios.
 |
(Green color means fully supervised learning and Blue color means zero-shot performance.)
RAM significantly improves the tagging ability based on the Tag2text framework.
- **Accuracy.** RAM utilizes a **data engine** to **generate** additional annotations and **clean** incorrect ones, **higher accuracy** compared to Tag2Text.
- **Scope.** RAM upgrades the number of fixed tags from 3,400+ to **[6,400+](./ram/data/ram_tag_list.txt)** (synonymous reduction to 4,500+ different semantic tags), covering **more valuable categories**.
Moreover, RAM is equipped with **open-set capability**, feasible to recognize tags not seen during training
## :sunrise: Highlight of Tag2text
Tag2Text is an efficient and controllable vision-language model with tagging guidance.
- **Tagging.** Tag2Text recognizes **[3,400+](./ram/data/tag_list.txt)** commonly human-used categories without manual annotations.
- **Captioning.** Tag2Text integrates **tags information** into text generation as the **guiding elements**, resulting in **more controllable and comprehensive descriptions**.
- **Retrieval.** Tag2Text provides **tags** as **additional visible alignment indicators** for image-text retrieval.
## :writing_hand: TODO
- [x] Release Tag2Text demo.
- [x] Release checkpoints.
- [x] Release inference code.
- [x] Release RAM demo and checkpoints.
- [x] Release training codes.
- [ ] Release training datasets.
## :toolbox: Checkpoints
|
Name |
Backbone |
Data |
Illustration |
Checkpoint |
1 |
RAM-14M |
Swin-Large |
COCO, VG, SBU, CC-3M, CC-12M |
Provide strong image tagging ability. |
Download link |
2 |
Tag2Text-14M |
Swin-Base |
COCO, VG, SBU, CC-3M, CC-12M |
Support comprehensive captioning and tagging. |
Download link |
## :running: Model Inference
### **Setting Up** ###
1. Install the dependencies::
pip install -r requirements.txt
2. Download RAM pretrained checkpoints.
3. (Optional) To use RAM and Tag2Text in other projects, better to install recognize-anything as a package:
```bash
pip install -e .
```
Then the RAM and Tag2Text model can be imported in other projects:
```python
from ram.models import ram, tag2text
```
### **RAM Inference** ##
Get the English and Chinese outputs of the images:
python inference_ram.py --image images/demo/demo1.jpg \
--pretrained pretrained/ram_swin_large_14m.pth
### **RAM Inference on Unseen Categories (Open-Set)** ##
Firstly, custom recognition categories in [build_openset_label_embedding](./ram/utils/openset_utils.py), then get the tags of the images:
python inference_ram_openset.py --image images/openset_example.jpg \
--pretrained pretrained/ram_swin_large_14m.pth
### **Tag2Text Inference** ##
Get the tagging and captioning results:
python inference_tag2text.py --image images/demo/demo1.jpg \
--pretrained pretrained/tag2text_swin_14m.pth
Or get the tagging and sepcifed captioning results (optional):
python inference_tag2text.py --image images/demo/demo1.jpg \
--pretrained pretrained/tag2text_swin_14m.pth \
--specified-tags "cloud,sky"
### **Batch Inference and Evaluation** ##
We release two datasets `OpenImages-common` (214 seen classes) and `OpenImages-rare` (200 unseen classes). Copy or sym-link test images of [OpenImages v6](https://storage.googleapis.com/openimages/web/download_v6.html) to `datasets/openimages_common_214/imgs/` and `datasets/openimages_rare_200/imgs`.
To evaluate RAM on `OpenImages-common`:
```bash
python batch_inference.py \
--model-type ram \
--checkpoint pretrained/ram_swin_large_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/ram
```
To evaluate RAM open-set capability on `OpenImages-rare`:
```bash
python batch_inference.py \
--model-type ram \
--checkpoint pretrained/ram_swin_large_14m.pth \
--open-set \
--dataset openimages_rare_200 \
--output-dir outputs/ram_openset
```
To evaluate Tag2Text on `OpenImages-common`:
```bash
python batch_inference.py \
--model-type tag2text \
--checkpoint pretrained/tag2text_swin_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/tag2text
```
Please refer to `batch_inference.py` for more options. To get P/R in table 3 of our paper, pass `--threshold=0.86` for RAM and `--threshold=0.68` for Tag2Text.
To batch inference custom images, you can set up you own datasets following the given two datasets.
## :golfing: Model Training/Finetuning
### **Tag2Text** ##
At present, we can only open source [the forward function of Tag2Text](./ram/models/tag2text.py#L141) as much as possible.
To train/finetune Tag2Text on a custom dataset, you can refer to the complete training codebase of [BLIP](https://github.com/salesforce/BLIP/tree/main) and make the following modifications:
1. Replace the "models/blip.py" file with the current "[tag2text.py](./ram/models/tag2text.py)" model file;
2. Load additional tags based on the original dataloader.
### **RAM** ##
The training code of RAM cannot be open-sourced temporarily as it is in the company's process.