Released capabilities

#42
by ludeksvoboda - opened

Hi,
as I understand the released model is not capable of the OCR, bbox_to_text and text_to_bbox, correct?
Are there any resources as how to go about finetuning the model for this?
Nice work and thank you!

Hi @ludeksvoboda , with the recent transformers release (run a pip install --upgrade transformers) the model should! Given bbox coordinates, it will perform OCR within that bbox.

from PIL import Image
import requests
import io
from transformers import FuyuForCausalLM, FuyuProcessor

pretrained_path = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(pretrained_path)
model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map='auto')


bbox_prompt = "When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\\n<box>388, 428, 404, 488</box>"
bbox_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bbox_sample_image.jpeg"
bbox_image_pil = Image.open(io.BytesIO(requests.get(bbox_image_url).content))
model_inputs = processor(text=bbox_prompt, images=bbox_image_pil).to('cuda')


model_outputs = processor.batch_decode(model.generate(
    **model_inputs, max_new_tokens=10)[:, -10:], skip_special_tokens=True)[0]
prediction = model_outputs.split('\x04 ', 1)[1] if '\x04' in model_outputs else ''

This should output Williams, the text contained within coordinates. text_to_bbox should work as well, with processor.post_process_box_coordinates. Have fun!

That is absolutely awesome @Molbap !
Thank you for the comprehensive reply!

Hi, nice work!
I wonder how to use text_to_bbox to locate items. I try:

bbox_prompt = "When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\\n 561 Dillman"
bbox_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bbox_sample_image.jpeg"
bbox_image_pil = Image.open(io.BytesIO(requests.get(bbox_image_url).content))
model_inputs = processor(text=bbox_prompt, images=bbox_image_pil).to('cuda')

model_outputs = model.generate(**model_inputs, max_new_tokens=20)[:, -20:]
model_outputs = processor.post_process_box_coordinates(model_outputs)
model_outputs = processor.batch_decode(model_outputs, skip_special_tokens=True)[0]
print(model_outputs)

And it outputs text, generate the corresponding bounding box.\n Williams<box>388, 428, 404, 900</box>, is this the right way to use it?

@cckevinn try to have a look at this https://huggingface.co/adept/fuyu-8b/discussions/38 , but essentially you have it correct I think.
I have tried the linked solution and it works somewhat on the resized image (to 1/2 of the original size), very likely it does even better on the fullsized image. Tried also to crop the test image so it contains only the part filled with text (removed the white space on both sides) and it fails to generate any bbox, I either get empty string or part of some text. I think the model has problems with different image sizes.
Only thing I had to tinker with was permuting the coordinates for ploting.

def permute_bbox(bbox):
    return (bbox[1], bbox[0], bbox[3], bbox[2])

def plot_bbox(img, bbox):
    """simplest way to plot bounding box on the image"""
    if isinstance(img, np.ndarray):
        img = Image.fromarray(img)
    draw = ImageDraw.Draw(img)
    draw.rectangle(bbox, outline='red')
    return img

The bounding box tasks are very sensitive to input resolution, because the model was trained on screenshots with a height of 1080 and not fine-tuned on other content. The best way to use these features is to scale your input image so its height is close to 1080. If the image is smaller, then padding to 1920x1080 works well.

This is the strategy used in the demo for this task, these are the lines that rescale and pad so the input to the model is always 1920x1080: https://huggingface.co/spaces/adept/fuyu-8b-demo/blob/main/app.py#L71-L72

@pcuenq Oh, thank you for the clarification!

Sign up or log in to comment