Upgrading to full-body recognition
In Part 1 of this series we covered the essentials of face encoding and recognition. We hinted at a body-level upgrade. Now things get serious. In this post we deliver more than promised. We not only add TorchReID for full-body person recognition. We also pair it with an enhanced, hardened face encoder. One that’s tolerant of real-world images (odd color channels, strides, blur, small boxes). The result is a less-basic but far more reliable two-signal system. This identifies people even when faces are small, averted, or occluded.
Why body ReID?
Faces aren’t always usable. Thus body Re-ID adds a second, complementary signal based on body, clothing and silhouette. While FastReID is an excellent research toolbox, TorchReID uses the OSNet convolutional neural network (CNN) architecture introduced in 2019. Since then a family of models is being developed that are much better suited for body and person re-identification. Picking the best, we built a wrapper (TorchreidBodyExtractor) around osnet_ain_x1_0, that hides the tricky bits (like preprocessing, model loading, L2-normalization). Hence, you can focus on using the features instead of babysitting them.
Why a robust face encoder?
Standard pipelines break on everyday pitfalls: non-contiguous arrays, non-standard BGR/RGBA inputs, overly blurry faces, or detectors that need a second pass. That’s why our encoder normalizes input, escalates histograms of oriented gradients (HOG) to CNN when needed, aligns chips to a canonical frame, and (optionally) returns embeddings—so face recognition is consistent and auditable.
Before moving on to the main topic, here is the source code for our two encoders.
TorchReIDBodyExtractor (consistent L2-normalized features)
The source code provides a Torchreid body feature extractor with consistent L2-normalized outputs:
- picks CPU/GPU automatically;
- applies the right 256×128 resize & ImageNet normalization;
- returns L2-normalized float32 embeddings;
- exposes __call__, extract_batch, and extract_paths, plus cosine helpers.
So called level 2 normalization, should not really bother us here but is needed because it ensures that the magnitude of the feature vectors is consistent, which is crucial for effective similarity computation in person re-identification tasks. This can be show with a so called visualizing activation map. Given an input image, the activation map can be used to analyze where the CNN focuses on to extract features. An activation map is computed by taking the sum of absolute-valued feature maps along the channel dimension, followed by a spatial L2 normalization. An example obtained by OSNet is shown below. Image regions with warmer colors have higher activation values, which contribute the most to the generation of final feature vectors. Whereas the regions with cold colors are likely to contain less important/reliable regions for re-ID. (See: Torchreid: A Library for Deep Learning Person Re-Identification in Pytorch, Kaiyang Zhou, Tao Xiang, University of Surrey)

Since first published, a little over a fortnight ago, an even more professional version of this wrapper code has been developed. It imports either the previous or the most recent TorchReID models (with instructions for building it yourself), while also silencing some annoying console printed messages. Also additional inline documentation has been added, explaining parameters and return values.
# Torchreid body feature extractor with consistent L2-normalized outputs,
# with adjusted import/init to suppress noisy log prints from TorchReID.
from __future__ import annotations
from typing import Iterable, List, Optional, Callable
import io
import importlib, importlib.util, io, contextlib
import numpy as np
import torch
from torch.utils.data import DataLoader
from PIL import Image
from torchvision import transforms
# import either previous or current torchreid models while silencing annoying console printed messages
def _quiet_import_torchreid_models(quiet: bool = True, log_fn: Optional[Callable[[str], None]] = None):
def capture_import(modname: str):
if not quiet:
return importlib.import_module(modname)
buf = io.StringIO()
with contextlib.redirect_stdout(buf), contextlib.redirect_stderr(buf):
mod = importlib.import_module(modname)
msg = buf.getvalue().strip()
if msg and log_fn:
log_fn(f"[ReID] {msg.splitlines()[-1]}")
return mod
# Ensure base package exists
if importlib.util.find_spec("torchreid") is None:
raise ImportError(
"TorchReID not found in this Python environment. Install with:\n"
" py -m pip install torch torchvision\n"
" py -m pip install \"git+https://github.com/KaiyangZhou/deep-person-reid.git\""
"note: be sure MSVC buildtools are installed!"
)
# Try modern then legacy module paths
candidates = ["torchreid.models", "torchreid.reid.models"]
for name in candidates:
if importlib.util.find_spec(name) is not None:
try:
if name != "torchreid.models" and log_fn:
log_fn(f"[ReID] Using fallback module path: {name}")
return capture_import(name)
except Exception:
pass # try next
# Last resort: import base and look for attribute
pkg = capture_import("torchreid")
for attr_path in ("models", "reid.models"):
obj = pkg
ok = True
for part in attr_path.split("."):
obj = getattr(obj, part, None)
if obj is None:
ok = False
break
if ok:
if log_fn:
log_fn(f"[ReID] Using models via torchreid.{attr_path}")
return obj
raise ImportError(
"Could not import TorchReID models (tried torchreid.models and torchreid.reid.models). "
"Reinstall from GitHub to get the canonical layout:\n"
" py -m pip uninstall -y torchreid\n"
" py -m pip install \"git+https://github.com/KaiyangZhou/deep-person-reid.git\""
"note: be sure MSVC buildtools are installed!"
)
class TorchreidBodyExtractor:
"""
Thin wrapper around a TorchReID backbone to produce L2-normalized 1-D float32 embeddings.
__call__(PIL.Image) -> (D,) float32
extract_batch(Iterable[PIL.Image], batch_size=32) -> (N, D) float32
extract_paths(Iterable[str], batch_size=32) -> (N, D) float32
Parameters:
model_name : str e.g., "osnet_ain_x1_0" (CUDA/CPU) or "osnet_x1_0" (good for DirectML).
device : Optional[str] "cuda:0", "cpu", etc. If None, picks automatically.
height, width : int Resize for ReID models (default 256x128).
use_inference_mode : bool Use torch.inference_mode() when available (otherwise no_grad()).
quiet : bool Capture stdout/stderr during TorchReID import + model construction to silence prints.
log_fn : Optional[Callable[[str], None]] If provided, receives a short one-line summary of any captured output.
"""
def __init__(self, model_name: str = "osnet_ain_x1_0", device: Optional[str] = None, height: int = 256, width: int = 128,
use_inference_mode: bool = True, quiet: bool = True, log_fn: Optional[Callable[[str], None]] = None, ) -> None:
self.model_name = model_name
self.use_inference_mode = bool(use_inference_mode)
self.quiet = bool(quiet)
self._log_fn = log_fn
if device is None:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
self.device = torch.device(device)
# --- Quiet import of torchreid.models
models = _quiet_import_torchreid_models(self.quiet, self._log_fn)
# --- Quiet model construction
if self.quiet:
buf = io.StringIO()
with contextlib.redirect_stdout(buf), contextlib.redirect_stderr(buf):
self._build_model(models, height, width)
msg = buf.getvalue().strip()
if msg and self._log_fn:
self._log_fn(f"[ReID] {msg.splitlines()[-1]}")
else:
self._build_model(models, height, width)
# Internal: build model + transform
def _build_model(self, models_module, height: int, width: int) -> None:
if self.model_name not in models_module.__dict__:
raise ValueError(f"Unknown model '{self.model_name}'. Available: {sorted(models_module.__dict__.keys())}")
self.model = models_module.__dict__[self.model_name](pretrained=True) # downloads weights if needed
self.model.eval().to(self.device)
# Preprocess consistent with TorchReID
self.transform = transforms.Compose([
transforms.Resize((height, width), interpolation=transforms.InterpolationMode.BILINEAR),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# -----------------------------
# Public API
# -----------------------------
# Extract a single body embedding. Returns shape (D,) float32, L2-normalized.
def __call__(self, image: Image.Image) -> np.ndarray:
t = self.transform(image.convert("RGB")).unsqueeze(0).to(self.device, non_blocking=True)
ctx = torch.inference_mode() if self.use_inference_mode and hasattr(torch, "inference_mode") else torch.no_grad()
with ctx:
feat = self.model(t) # [1, D]
feat = torch.nn.functional.normalize(feat, dim=1)
return feat.squeeze(0).detach().cpu().numpy().astype("float32")
# Vectorized extraction from a list/iterable of PIL images → (N, D) float32, L2-normalized.
def extract_batch(self, images: Iterable[Image.Image], batch_size: int = 32, num_workers: int = 0) -> np.ndarray:
tensors = []
for im in images:
if im is None:
continue
tensors.append(self.transform(im.convert("RGB")))
if not tensors:
return np.zeros((0, 0), dtype=np.float32)
loader = DataLoader(tensors, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=(self.device.type == "cuda"), collate_fn=lambda x: torch.stack(x, dim=0),)
ctx = torch.inference_mode() if self.use_inference_mode and hasattr(torch, "inference_mode") else torch.no_grad()
feats: List[torch.Tensor] = []
with ctx:
for batch in loader:
batch = batch.to(self.device, non_blocking=True)
f = self.model(batch) # [B, D]
f = torch.nn.functional.normalize(f, dim=1)
feats.append(f.detach().cpu())
feats = torch.cat(feats, dim=0).numpy().astype("float32", copy=False)
return feats
# Read images from paths and call extract_batch.
def extract_paths(self, paths: Iterable[str], batch_size: int = 32, num_workers: int = 0) -> np.ndarray:
imgs: List[Image.Image] = []
for p in paths:
try:
with Image.open(p) as im:
imgs.append(im.convert("RGB").copy())
except Exception:
continue
return self.extract_batch(imgs, batch_size=batch_size, num_workers=num_workers)
# Convenience for cosine similarity (on already L2-normalized features)
@staticmethod
def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b))
@staticmethod
def euclid_dist(a: np.ndarray, b: np.ndarray) -> float:
return float(np.linalg.norm(a - b, ord=2))
PythonFace Encoder
The face encoder and aligner provides code that:
- uses robust Dlib input, providing HOG→CNN fallback;
- normalizes images into a Dlib-friendly RGB buffer (and thus fixes Windows stride, ownership quirks);
- filters blurry faces via Laplacian variance;
- can directly compute a face embedding per aligned chip.
The purpose of integrating Histogram of Oriented Gradients (HOG) with Convolutional Neural Networks (CNNs) in face recognition is to enhance feature extraction and improve recognition accuracy by combining the strengths of both methods. HOG is effective at capturing local shape and edge information by analyzing gradient orientations in localized image regions, which is particularly useful for modeling facial structures. HOG features, represented as a n×n matrix, are fed into a CNN to further extract robust and high-level discriminative features. This fusion leverages the CNN’s ability to learn hierarchical representations, resulting in improved performance. See the image below for an example of the use of HOG.

from __future__ import annotations
import math
from io import BytesIO
from typing import Any, Dict, List, Tuple, Optional
import cv2
import numpy as np
import face_recognition
from PIL import Image
# silence deprecation noise from face_recognition_models
import warnings
warnings.filterwarnings(
"ignore",
category=UserWarning,
message=r"pkg_resources is deprecated as an API.*",
module=r"face_recognition_models(\.|$)"
)
# Produce a *fresh*, C-contiguous, writeable uint8 RGB array (H,W,3) by reloading via face_recognition.load_image_file.
# This sidesteps all stride/ownership weirdness that can upset dlib on Windows.
def _fr_ready_rgb(img: Any) -> np.ndarray:
if hasattr(img, "mode"): # PIL.Image
pil = img.convert("RGB")
else:
arr = np.asarray(img)
if arr.ndim == 2:
pil = Image.fromarray(arr, mode="L").convert("RGB")
elif arr.ndim == 3 and arr.shape[2] == 3:
# if it looks like BGR, flip to RGB first
if float(arr[..., 0].mean() or 0) > 1.1 * float(arr[..., 2].mean() or 1e-6):
arr = cv2.cvtColor(arr, cv2.COLOR_BGR2RGB)
pil = Image.fromarray(arr, mode="RGB")
elif arr.ndim == 3 and arr.shape[2] == 4:
try:
arr = cv2.cvtColor(arr, cv2.COLOR_BGRA2RGB)
except Exception:
arr = arr[..., :3]
pil = Image.fromarray(arr, mode="RGB")
else:
raise RuntimeError(f"Unsupported input for dlib: shape={arr.shape if 'arr' in locals() else 'n/a'}")
buf = BytesIO()
pil.save(buf, format="PNG") # lossless, fast
buf.seek(0)
arr = face_recognition.load_image_file(buf) # -> uint8 RGB (H,W,3), contiguous
return np.require(arr, dtype=np.uint8, requirements=["C", "O", "W"])
def _lap_var(img_rgb: np.ndarray) -> float:
g = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2GRAY)
return float(cv2.Laplacian(g, cv2.CV_64F).var())
def _mean_luma(img_rgb: np.ndarray) -> float:
g = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2GRAY)
return float(np.mean(g))
def detect_and_align_faces(image: Any, model: str = "hog", upsample: int = 0, desired_size: int = 160, min_box: int = 40, lap_var_thresh: Optional[float] = 80.0, eye_pos: Tuple[float, float] = (0.5, 0.4), eye_dist_ratio: float = 0.35,
resize_max: Optional[int] = 800, adaptive_blur_factor: float = 0.5, retry_if_empty: bool = True, compute_embedding: bool = False, embedding_model: str = "small", num_jitters: int = 1,) -> List[Dict[str, Any]]:
# 1) Normalize to a dlib-friendly RGB buffer
rgb_full = _fr_ready_rgb(image)
H_full, W_full = rgb_full.shape[:2]
# 2) Optional downscale for detection speed
scale = 1.0
max_side = max(H_full, W_full)
if resize_max is not None and max_side > resize_max:
scale = resize_max / float(max_side)
new_w, new_h = int(W_full * scale), int(H_full * scale)
rgb_det = cv2.resize(rgb_full, (new_w, new_h), interpolation=cv2.INTER_AREA)
rgb_det = _fr_ready_rgb(rgb_det) # ensure fresh buffer after resize
else:
rgb_det = rgb_full
# 3) Detection with robust fallbacks
try:
locs = face_recognition.face_locations(rgb_det, number_of_times_to_upsample=upsample, model=model)
except Exception:
if model == "hog":
# HOG also supports 8-bit gray; retry there
gray = cv2.cvtColor(rgb_det, cv2.COLOR_RGB2GRAY)
gray = np.require(gray, dtype=np.uint8, requirements=["C", "O", "W"])
locs = face_recognition.face_locations(gray, number_of_times_to_upsample=upsample, model="hog")
else:
raise
if not locs and model == "hog":
# escalate to CNN on RGB
try:
locs = face_recognition.face_locations(rgb_det, number_of_times_to_upsample=max(upsample, 1), model="cnn")
except Exception as e:
raise RuntimeError(
f"dlib CNN detector rejected image: dtype={rgb_det.dtype}, shape={rgb_det.shape}, "
f"C={rgb_det.flags.c_contiguous}, strides={rgb_det.strides}"
) from e
if not locs and retry_if_empty and upsample == 0:
locs = face_recognition.face_locations(rgb_det, number_of_times_to_upsample=1, model=model)
if not locs:
return []
# 4) Landmarks (always on RGB)
all_landmarks = face_recognition.face_landmarks(rgb_det, face_locations=locs, model="large") or []
# 5) Align chips on full-res image
results: List[Dict[str, Any]] = []
Wt = Ht = int(desired_size)
dest_eye_x = Wt * eye_pos[0]
dest_eye_y = Ht * eye_pos[1]
desired_dist = eye_dist_ratio * Wt
global_lap_var = _lap_var(rgb_full)
effective_blur_thresh: Optional[float] = (adaptive_blur_factor * global_lap_var if lap_var_thresh is None else float(lap_var_thresh))
for (top, right, bottom, left), lm in zip(locs, all_landmarks):
# back-map bbox to original scale
top_o = int(round(top / scale))
right_o = int(round(right / scale))
bottom_o = int(round(bottom / scale))
left_o = int(round(left / scale))
w_o, h_o = (right_o - left_o), (bottom_o - top_o)
if w_o < min_box or h_o < min_box:
continue
if not lm or ("left_eye" not in lm or "right_eye" not in lm):
continue
# eye centers in detection scale
left_eye = np.mean(np.array(lm["left_eye"]), axis=0)
right_eye = np.mean(np.array(lm["right_eye"]), axis=0)
dY = right_eye[1] - left_eye[1]
dX = right_eye[0] - left_eye[0]
angle = math.degrees(math.atan2(dY, dX))
dist = (dX ** 2 + dY ** 2) ** 0.5
if dist < 1e-6:
continue
scale_aff = desired_dist / dist
eyes_center = ((left_eye[0] + right_eye[0]) * 0.5, (left_eye[1] + right_eye[1]) * 0.5)
M = cv2.getRotationMatrix2D(eyes_center, angle, scale_aff)
M[0, 2] += (dest_eye_x - eyes_center[0])
M[1, 2] += (dest_eye_y - eyes_center[1])
# rescale translation to apply on full-res
M_full = M.copy()
if scale != 1.0:
M_full[:, 2] /= scale
aligned = cv2.warpAffine(rgb_full, M_full, (Wt, Ht), flags=cv2.INTER_LINEAR, borderMode=cv2.BORDER_REFLECT)
blur_var = _lap_var(aligned)
if effective_blur_thresh is not None and blur_var < effective_blur_thresh:
continue
out: Dict[str, Any] = {
"aligned": aligned,
"bbox": (top_o, right_o, bottom_o, left_o),
"landmarks": {k: [(int(round(px / scale)), int(round(py / scale))) for (px, py) in v] for k, v in lm.items()},
"transform": M_full.astype("float32"),
"scale": float(scale),
"blur_var": float(blur_var),
"mean_luma": _mean_luma(aligned),
}
if compute_embedding:
# correct bbox order: (top, right, bottom, left) = (0, Wt, Ht, 0)
enc = face_recognition.face_encodings(aligned, known_face_locations=[(0, Wt, Ht, 0)], num_jitters=num_jitters, model=embedding_model)
if enc:
out["embedding"] = enc[0].astype("float32")
results.append(out)
# One more pass if everything got filtered
if not results and retry_if_empty and upsample == 0:
return detect_and_align_faces(
image=image, model=model, upsample=1, desired_size=desired_size, min_box=min_box,
lap_var_thresh=lap_var_thresh, eye_pos=eye_pos, eye_dist_ratio=eye_dist_ratio,
resize_max=resize_max, adaptive_blur_factor=adaptive_blur_factor,
retry_if_empty=False, compute_embedding=compute_embedding,
embedding_model=embedding_model, num_jitters=num_jitters,
)
return results
PythonWhat we’ll build
We use the toolbox shown above to:
- Create a knowledge base of both face and body embeddings for your named people.
- Recognize a new image by face distance and body cosine similarity, with simple score fusion.
- Run a batch pass over a folder with a bunch of photos. File each image under predicted names (or Unknown when recognition fails).
We’ll start from the dataset layout introduced in Part I and extend it with body features. Processing follows the same road laid out in Part I: encode → recognize → batch (face + body). These code snippets below use modules above as-is (torchreid_extractor.py, face_encoder.py). Paths and thresholds are easy to tweak.
Encode known people (both signals)
# build_kb.py
import os, pathlib, pickle, numpy as np
from PIL import Image
from face_encoder import detect_and_align_faces # our robust face pipeline
from torchreid_extractor import TorchreidBodyExtractor # our TorchReID wrapper
DATASET = pathlib.Path("persons_dataset") # PersonName/ *.jpg
OUT_FACE = "kb_faces.pkl"
OUT_BODY = "kb_bodies.pkl"
# --- Face KB ---
face_vecs, face_names = [], []
for person in sorted(p for p in DATASET.iterdir() if p.is_dir()):
for img in person.iterdir():
if img.suffix.lower() not in {".jpg",".jpeg",".png",".webp"}: continue
res = detect_and_align_faces(Image.open(img), compute_embedding=True, embedding_model="small")
for r in res:
if "embedding" in r:
face_vecs.append(r["embedding"])
face_names.append(person.name)
with open(OUT_FACE, "wb") as f:
pickle.dump({"emb": np.asarray(face_vecs, dtype="float32"),
"names": np.asarray(face_names)}, f)
# --- Body KB ---
body_extractor = TorchreidBodyExtractor(model_name="osnet_ain_x1_0", device=None)
body_vecs, body_names = [], []
for person in sorted(p for p in DATASET.iterdir() if p.is_dir()):
imgs = [str(img) for p in person.iterdir() if p.suffix.lower() in {".jpg",".jpeg",".png",".webp"}]
if not imgs: continue
E = body_extractor.extract_paths(imgs, batch_size=16) # already L2-normalized (your class)
body_vecs.append(E)
body_names += [person.name]*len(E)
import numpy as np
if body_vecs:
B = np.vstack(body_vecs).astype("float32", copy=False)
else:
B = np.zeros((0,0), dtype="float32")
with open(OUT_BODY, "wb") as f:
pickle.dump({"emb": B, "names": np.asarray(body_names)}, f)
print("KB built:",
OUT_FACE, len(face_vecs), "faces;",
OUT_BODY, (0 if B.size==0 else len(B)), "bodies")
PythonRecognize a single image (fusion)
# recognize_image_fused.py
import pickle, numpy as np
from PIL import Image
from face_encoder import detect_and_align_faces
from torchreid_extractor import TorchreidBodyExtractor
# Load KBs
F = pickle.load(open("kb_faces.pkl", "rb"))
B = pickle.load(open("kb_bodies.pkl", "rb"))
F_EMB, F_NAMES = F["emb"], F["names"]
B_EMB, B_NAMES = B["emb"], B["names"]
# Thresholds (good starting points)
FACE_TOL = 0.55 # Euclidean distance (<= match)
BODY_SIM_TOL = 0.78 # cosine similarity (>= match)
W_FACE, W_BODY = 0.6, 0.4
body_extractor = TorchreidBodyExtractor(model_name="osnet_ain_x1_0", device=None)
def face_match_score(face_vec):
if F_EMB.size == 0: return ("Unknown", 0.0)
d = np.linalg.norm(F_EMB - face_vec[None, :], axis=1)
j = int(np.argmin(d)); dist = float(d[j])
if dist <= FACE_TOL:
# convert to similarity for fusion
return (F_NAMES[j], 1.0 / (1.0 + dist))
return ("Unknown", 0.0)
def body_match_score(image_path):
if B_EMB.size == 0: return ("Unknown", 0.0)
e = body_extractor.extract_paths([image_path])[0] # (512,) L2-normalized
sims = (B_EMB @ e)
j = int(np.argmax(sims)); sim = float(sims[j])
if sim >= BODY_SIM_TOL:
return (B_NAMES[j], sim)
return ("Unknown", 0.0)
def recognize(image_path):
# 1) Face: pick the best chip (highest face similarity)
face_best_name, face_sim = "Unknown", 0.0
chips = detect_and_align_faces(Image.open(image_path), compute_embedding=True, embedding_model="small")
for r in chips:
if "embedding" in r:
nm, sc = face_match_score(r["embedding"])
if sc > face_sim:
face_best_name, face_sim = nm, sc
# 2) Body (full frame; optionally replace with a person crop later)
body_name, body_sim = body_match_score(image_path)
# 3) Fuse
cand = {}
if face_best_name != "Unknown": cand[face_best_name] = cand.get(face_best_name, 0) + W_FACE*face_sim
if body_name != "Unknown": cand[body_name] = cand.get(body_name, 0) + W_BODY*body_sim
if not cand:
return "Unknown", {"face": (face_best_name, face_sim), "body": (body_name, body_sim)}
name = max(cand, key=cand.get)
score = cand[name]
return name, {"face": (face_best_name, face_sim), "body": (body_name, body_sim), "fused": score}
if __name__ == "__main__":
print(recognize("test_image.jpg"))
PythonModels & thresholds
We start with the osnet_ain_x1_0 model and and set the BODY_SIM_TOL= 0.78. You should try out settings between this and 0.82, tune on a small validation set to maximize results.
For FACE_TOL tune with settings between 0.50 and 0.60 (lower = stricter).
You can explore other models from the TorchReID Model Zoo. The OSNet family is a strong default. The documentation will suggest plausible candidates.
Later replace the full-body frame pass with a person detector crop for further gains. For now, using the full image still works and demonstrates the upgrade path clearly.
Batch process a folder (upgrade of Part I)
# batch_fused.py
import os, shutil, pathlib, time, pickle
from recognize_image_fused import recognize # the function above
INPUT_DIR = pathlib.Path("E:/persons_unknown")
OUTPUT_DIR = pathlib.Path("E:/persons_processed")
VALID_EXT = {".jpg", ".jpeg", ".png", ".webp"}
MOVE_FILES = False
def save_to_bucket(img_path, label):
d = OUTPUT_DIR / label
d.mkdir(parents=True, exist_ok=True)
dst = d / img_path.name
(shutil.move if MOVE_FILES else shutil.copy2)(str(img_path), dst)
def main():
start = time.time()
images = [p for p in INPUT_DIR.rglob("*") if p.suffix.lower() in VALID_EXT]
for i, p in enumerate(images, 1):
try:
name, dbg = recognize(str(p))
save_to_bucket(p, name)
print(f"{i}/{len(images)} → {name} | face={dbg['face']} body={dbg['body']}")
except Exception as e:
print(f"ERROR {p}: {e}")
print(f"Done in {time.time()-start:0.1f}s")
if __name__ == "__main__":
main()
PythonRemarks
This keeps the original ‘file-to-person-folder’ behavior from Part I, now powered by fused face+body signals. The new robust face encoder aligns chips and screens out blurry detections; this dramatically reduces false positives downstream. You improve body encoding by applying cropping to a person box. A simple person detector improves body similarity by removing background. For now, full-frame works to demonstrate the concept.
Also consider including an audit log. Keep a .txt or .csv with for instance for each image file processed {file_name, face_dist, face_name, body_sim, body_name, fused_score} for transparent threshold tuning.
In the next post we’ll introduce a graphical user interface (GUI) and use it to develop an extensible, modular, application for person encoding and recognition.