brails.processors.vlm_segmenter.segment_anything.predictor module
- class brails.processors.vlm_segmenter.segment_anything.predictor.SamPredictor(sam_model: Sam)
Bases:
object
- property device: device
- get_image_embedding() Tensor
Returns the image embeddings for the currently set image, with shape 1xCxHxW, where C is the embedding dimension and (H,W) are the embedding spatial dimension of SAM (typically C=256, H=W=64).
- predict(point_coords: ndarray | None = None, point_labels: ndarray | None = None, box: ndarray | None = None, mask_input: ndarray | None = None, multimask_output: bool = True, return_logits: bool = False, hq_token_only: bool = False) Tuple[ndarray, ndarray, ndarray]
Predict masks for the given input prompts, using the currently set image.
- Arguments:
- point_coords (np.ndarray or None): A Nx2 array of point prompts to the
model. Each point is in (X,Y) in pixels.
- point_labels (np.ndarray or None): A length N array of labels for the
point prompts. 1 indicates a foreground point and 0 indicates a background point.
- box (np.ndarray or None): A length 4 array given a box prompt to the
model, in XYXY format.
- mask_input (np.ndarray): A low resolution mask input to the model, typically
coming from a previous prediction iteration. Has form 1xHxW, where for SAM, H=W=256.
- multimask_output (bool): If true, the model will return three masks.
For ambiguous input prompts (such as a single click), this will often produce better masks than a single prediction. If only a single mask is needed, the model’s predicted quality score can be used to select the best mask. For non-ambiguous prompts, such as multiple input prompts, multimask_output=False can give better results.
- return_logits (bool): If true, returns un-thresholded masks logits
instead of a binary mask.
- Returns:
- (np.ndarray): The output masks in CxHxW format, where C is the
number of masks, and (H, W) is the original image size.
- (np.ndarray): An array of length C containing the model’s
predictions for the quality of each mask.
- (np.ndarray): An array of shape CxHxW, where C is the number
of masks and H=W=256. These low resolution logits can be passed to a subsequent iteration as mask input.
- predict_torch(point_coords: Tensor | None, point_labels: Tensor | None, boxes: Tensor | None = None, mask_input: Tensor | None = None, multimask_output: bool = True, return_logits: bool = False, hq_token_only: bool = False) Tuple[Tensor, Tensor, Tensor]
Predict masks for the given input prompts, using the currently set image. Input prompts are batched torch tensors and are expected to already be transformed to the input frame using ResizeLongestSide.
- Arguments:
- point_coords (torch.Tensor or None): A BxNx2 array of point prompts to the
model. Each point is in (X,Y) in pixels.
- point_labels (torch.Tensor or None): A BxN array of labels for the
point prompts. 1 indicates a foreground point and 0 indicates a background point.
- boxes (np.ndarray or None): A Bx4 array given a box prompt to the
model, in XYXY format.
- mask_input (np.ndarray): A low resolution mask input to the model, typically
coming from a previous prediction iteration. Has form Bx1xHxW, where for SAM, H=W=256. Masks returned by a previous iteration of the predict method do not need further transformation.
- multimask_output (bool): If true, the model will return three masks.
For ambiguous input prompts (such as a single click), this will often produce better masks than a single prediction. If only a single mask is needed, the model’s predicted quality score can be used to select the best mask. For non-ambiguous prompts, such as multiple input prompts, multimask_output=False can give better results.
- return_logits (bool): If true, returns un-thresholded masks logits
instead of a binary mask.
- Returns:
- (torch.Tensor): The output masks in BxCxHxW format, where C is the
number of masks, and (H, W) is the original image size.
- (torch.Tensor): An array of shape BxC containing the model’s
predictions for the quality of each mask.
- (torch.Tensor): An array of shape BxCxHxW, where C is the number
of masks and H=W=256. These low res logits can be passed to a subsequent iteration as mask input.
- reset_image() None
Resets the currently set image.
- set_image(image: ndarray, image_format: str = 'RGB') None
Calculates the image embeddings for the provided image, allowing masks to be predicted with the ‘predict’ method.
- Arguments:
- image (np.ndarray): The image for calculating masks. Expects an
image in HWC uint8 format, with pixel values in [0, 255].
image_format (str): The color format of the image, in [‘RGB’, ‘BGR’].
- set_torch_image(transformed_image: Tensor, original_image_size: Tuple[int, ...]) None
Calculates the image embeddings for the provided image, allowing masks to be predicted with the ‘predict’ method. Expects the input image to be already transformed to the format expected by the model.
- Arguments:
- transformed_image (torch.Tensor): The input image, with shape
1x3xHxW, which has been transformed with ResizeLongestSide.
- original_image_size (tuple(int, int)): The size of the image
before transformation, in (H, W) format.