Implementation overview

Model Overview

The implementation fine-tunes the Qwen2.5-VL-7B-Instruct model. This model leverages a Transformer-based architecture, integrating a vision tower for image processing with a decoder-only language model for text generation. The vision tower encodes images into embeddings, which are combined with text token embeddings for processing by the Transformer layers.

The base architecture remains the same loaded via Hugging Face’s Transformers library. The implementation uses bfloat16 precision for mixed-precision training. No structural changes are made to the Transformer backbone.

Fine-Tuning Approach

The implementation uses supervised fine-tuning with low-rank adaptation to fine-tune the pre-trained model to fit a custom dataset. lora adds trainable low-rank matrices to certain layers, like attention or feed-forward layers, which cuts down the number of parameters that need updating during training. this way, it uses less memory but still keeps the model's pre-trained abilities.

Training Data and Preprocessing

The implementation downloads a Reflow events dataset file from Hugging Face with screenshot-text pairs. Preprocessing filters records into "action_description" (with click/bounding box annotations) or "context_extraction", splits them 90% training/10% validation, and saves processed JPEGs with JSONL annotations.

Training Setup

Training is done on a GPU (NVIDIA A40). The training uses PyTorch lightning, Transformers, and Accelerate, AdamW optimizer.

i used this guide https://arxiv.org/html/2408.13296v1#Ch7.S3 to put together this training setup.

    training_config = {
        "max_epochs": 10,            # Train for 10 full dataset passes
        "batch_size": 1,             # kept small due to model size it process 1 sample per step
        "lr": 2e-4,                  # Learning rate of 0.0002 (how much the weights are adjusted per step)
        "check_val_every_n_epoch": 2,# Validate every 2 epochs to monitor progress
        "gradient_clip_val": 1.0,    # Clip gradients at 1.0 for stability to prevent extreme updates that could destabilize the model
        "accumulate_grad_batches": 8,# Accumulate gradients over 8 steps (effective batch size 8)
        "warmup_steps": 50,          # Learning rate increasee after the first 50 stps
    }

In each step, the model processes one image-text pair, computes the loss, and calculates gradients. Instead of updating immediately, these gradients are stored and added to those from the next 7 steps. After the 8th step, the optimizer apply one weight update, simulating a larger batch size of 8. individual gradients can be noisy and lead to unstable training. Accumulating over 8 steps smooths this out, mimicking the stability of a larger batch size without requiring more memory at once. It improves learning consistency.