brails.processors.vlm_image_classifier.clip.clip module

brails.processors.vlm_image_classifier.clip.clip.available_models()

Returns the names of available CLIP models

brails.processors.vlm_image_classifier.clip.clip.load(name, device='cuda', jit=False, download_root=None)

Load a CLIP model

Parameters:
  • name (str) – A model name listed by clip.available_models(), or the path to a model checkpoint containing the state_dict

  • device (Union[str, torch.device]) – The device to put the loaded model

  • jit (bool) – Whether to load the optimized JIT model or more hackable non-JIT model (default).

  • download_root (str) – path to download the model files; by default, it uses “~/.cache/clip”

Returns:

  • model (torch.nn.Module) – The CLIP model

  • preprocess (Callable[[PIL.Image], torch.Tensor]) – A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input

brails.processors.vlm_image_classifier.clip.clip.tokenize(texts, context_length=77, truncate=False)

Returns the tokenized representation of given input string(s)

Parameters:
  • texts (Union[str, List[str]]) – An input string or a list of input strings to tokenize

  • context_length (int) – The context length to use; all CLIP models use 77 as the context length

  • truncate (bool) – Whether to truncate the text in case its encoding is longer than the context length

Returns:

  • A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length].

  • We return LongTensor when torch version is <1.8.0, since older index_select requires indices to be long.