brails.processors.vlm_image_classifier.clip.clip module

brails.processors.vlm_image_classifier.clip.clip.available_models() → List[str]: Returns the names of available CLIP models

brails.processors.vlm_image_classifier.clip.clip.load(name: str, device: str | device = 'cpu', jit: bool = False, download_root: str | None = None)

Load a CLIP model

Parameters

namestr: A model name listed by clip.available_models(), or the path to a model checkpoint containing the state_dict
deviceUnion[str, torch.device]: The device to put the loaded model
jitbool: Whether to load the optimized JIT model or more hackable non-JIT model (default).
download_root: str: path to download the model files; by default, it uses “~/.cache/clip”

Returns

modeltorch.nn.Module: The CLIP model
preprocessCallable[[PIL.Image], torch.Tensor]: A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input

brails.processors.vlm_image_classifier.clip.clip.tokenize(texts: str | List[str], context_length: int = 77, truncate: bool = False) → IntTensor | LongTensor

Returns the tokenized representation of given input string(s)

Parameters

textsUnion[str, List[str]]: An input string or a list of input strings to tokenize
context_lengthint: The context length to use; all CLIP models use 77 as the context length
truncate: bool: Whether to truncate the text in case its encoding is longer than the context length

Returns

A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length]. We return LongTensor when torch version is <1.8.0, since older index_select requires indices to be long.