streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL
-
Updated
Jan 26, 2026 - Python
streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL
Tag manager and captioner for image datasets
AI-Powered Watermark Remover using Florence-2 and LaMA: Remove watermarks from images and videos, including AI-generated content from Sora, Runway, and others. Features a modern PyWebview GUI.
[ICLR 2026] The offical Implementation of "Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model"
Florence-2 is a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.
Use Segment Anything 2, grounded with Florence-2, to auto-label data for use in training vision models.
VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vision-Language Model. Includes a Gradio-based interface for querying and analyzing video footage.
Watermark remover tool that leverages the capabilities of Microsoft Florence and Lama Cleaner models.
Use Florence 2 to auto-label data for use in training fine-tuned object detection models.
vision language models finetuning notebooks & use cases (Medgemma - paligemma - florence .....)
A Python base cli tool for caption images with WD series, Joy-caption-pre-alpha,meta Llama 3.2 Vision Instruct and Qwen2 VL Instruct models.
Local LLM Discord Bot
Run SOTA Vision-Language Model Florence-2 on your data!
Simple Video Summarization using Text-to-Segment Anything (Florence2 + SAM2) This project provides a video processing tool that utilizes advanced AI models, specifically Florence2 and SAM2, to detect and segment specific objects or activities in a video based on textual descriptions.
ONNX deploys for Florence 2 visual multimodal
This application utilizes the powerful Florence-2 vision-language model from Microsoft to generate comprehensive captions for images. The model is capable of understanding visual content and expressing it in natural language.
TextSnap: Demo for Florence 2 model used in OCR tasks to extract and visualize text from images.
An MCP server for processing images using Florence-2
Simple Gradio application integrated with Hugging Face Multimodals to support visual question answering chatbot and more features
Add a description, image, and links to the florence-2 topic page so that developers can more easily learn about it.
To associate your repository with the florence-2 topic, visit your repo's landing page and select "manage topics."