# recognize-anything [![Web Demo](https://img.shields.io/badge/🤗-HuggingFace%20Space-cyan.svg)](https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mhd-medfa/recognize-anything/blob/main/recognize_anything_demo.ipynb) Official PyTorch Implementation of Recognize Anything: A Strong Image Tagging Model and Tag2Text: Guiding Vision-Language Model via Image Tagging. - **Recognize Anything Model(RAM)** is an image tagging model, which can recognize any common category with high accuracy. - **Tag2Text** is a vision-language model guided by tagging, which can support caption, retrieval and tagging. Both Tag2Text and RAM exihibit strong recognition ability. We have combined Tag2Text and RAM with localization models (Grounding-DINO and SAM) and developed a strong visual semantic analysis pipeline in the [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) project. ![](./images/ram_grounded_sam.jpg) ## :bulb: Highlight of RAM RAM is a strong image tagging model, which can recognize any common category with high accuracy. - **Strong and general.** RAM exhibits exceptional image tagging capabilities with powerful zero-shot generalization; - RAM showcases impressive zero-shot performance, significantly outperforming CLIP and BLIP. - RAM even surpasses the fully supervised manners (ML-Decoder). - RAM exhibits competitive performance with the Google tagging API. - **Reproducible and affordable.** RAM requires Low reproduction cost with open-source and annotation-free dataset; - **Flexible and versatile.** RAM offers remarkable flexibility, catering to various application scenarios.

(Green color means fully supervised learning and Blue color means zero-shot performance.)

RAM significantly improves the tagging ability based on the Tag2text framework. - **Accuracy.** RAM utilizes a **data engine** to **generate** additional annotations and **clean** incorrect ones, **higher accuracy** compared to Tag2Text. - **Scope.** RAM upgrades the number of fixed tags from 3,400+ to **[6,400+](./ram/data/ram_tag_list.txt)** (synonymous reduction to 4,500+ different semantic tags), covering **more valuable categories**. Moreover, RAM is equipped with **open-set capability**, feasible to recognize tags not seen during training ## :sunrise: Highlight of Tag2text Tag2Text is an efficient and controllable vision-language model with tagging guidance. - **Tagging.** Tag2Text recognizes **[3,400+](./ram/data/tag_list.txt)** commonly human-used categories without manual annotations. - **Captioning.** Tag2Text integrates **tags information** into text generation as the **guiding elements**, resulting in **more controllable and comprehensive descriptions**. - **Retrieval.** Tag2Text provides **tags** as **additional visible alignment indicators** for image-text retrieval.

## :writing_hand: TODO - [x] Release Tag2Text demo. - [x] Release checkpoints. - [x] Release inference code. - [x] Release RAM demo and checkpoints. - [x] Release training codes. - [ ] Release training datasets. ## :toolbox: Checkpoints
Name Backbone Data Illustration Checkpoint
1 RAM-14M Swin-Large COCO, VG, SBU, CC-3M, CC-12M Provide strong image tagging ability. Download link
2 Tag2Text-14M Swin-Base COCO, VG, SBU, CC-3M, CC-12M Support comprehensive captioning and tagging. Download link
## :running: Model Inference ### **Setting Up** ### 1. Install the dependencies::
pip install -r requirements.txt
2. Download RAM pretrained checkpoints. 3. (Optional) To use RAM and Tag2Text in other projects, better to install recognize-anything as a package: ```bash pip install -e . ``` Then the RAM and Tag2Text model can be imported in other projects: ```python from ram.models import ram, tag2text ``` ### **RAM Inference** ## Get the English and Chinese outputs of the images:
python inference_ram.py  --image images/demo/demo1.jpg \
--pretrained pretrained/ram_swin_large_14m.pth
### **RAM Inference on Unseen Categories (Open-Set)** ## Firstly, custom recognition categories in [build_openset_label_embedding](./ram/utils/openset_utils.py), then get the tags of the images:
python inference_ram_openset.py  --image images/openset_example.jpg \
--pretrained pretrained/ram_swin_large_14m.pth
### **Tag2Text Inference** ## Get the tagging and captioning results:
python inference_tag2text.py  --image images/demo/demo1.jpg \
--pretrained pretrained/tag2text_swin_14m.pth
Or get the tagging and sepcifed captioning results (optional):
python inference_tag2text.py  --image images/demo/demo1.jpg \
--pretrained pretrained/tag2text_swin_14m.pth \
--specified-tags "cloud,sky"
### **Batch Inference and Evaluation** ## We release two datasets `OpenImages-common` (214 seen classes) and `OpenImages-rare` (200 unseen classes). Copy or sym-link test images of [OpenImages v6](https://storage.googleapis.com/openimages/web/download_v6.html) to `datasets/openimages_common_214/imgs/` and `datasets/openimages_rare_200/imgs`. To evaluate RAM on `OpenImages-common`: ```bash python batch_inference.py \ --model-type ram \ --checkpoint pretrained/ram_swin_large_14m.pth \ --dataset openimages_common_214 \ --output-dir outputs/ram ``` To evaluate RAM open-set capability on `OpenImages-rare`: ```bash python batch_inference.py \ --model-type ram \ --checkpoint pretrained/ram_swin_large_14m.pth \ --open-set \ --dataset openimages_rare_200 \ --output-dir outputs/ram_openset ``` To evaluate Tag2Text on `OpenImages-common`: ```bash python batch_inference.py \ --model-type tag2text \ --checkpoint pretrained/tag2text_swin_14m.pth \ --dataset openimages_common_214 \ --output-dir outputs/tag2text ``` Please refer to `batch_inference.py` for more options. To get P/R in table 3 of our paper, pass `--threshold=0.86` for RAM and `--threshold=0.68` for Tag2Text. To batch inference custom images, you can set up you own datasets following the given two datasets. ## :golfing: Model Training/Finetuning ### **Tag2Text** ## At present, we can only open source [the forward function of Tag2Text](./ram/models/tag2text.py#L141) as much as possible. To train/finetune Tag2Text on a custom dataset, you can refer to the complete training codebase of [BLIP](https://github.com/salesforce/BLIP/tree/main) and make the following modifications: 1. Replace the "models/blip.py" file with the current "[tag2text.py](./ram/models/tag2text.py)" model file; 2. Load additional tags based on the original dataloader. ### **RAM** ## The training code of RAM cannot be open-sourced temporarily as it is in the company's process.