Hi FME Innovators
I was wondering do we have any ideas/thoughts in future to implement/release any FME transformers and tools in AI Era, Object Detection is evolving alot and Instead of using External API , FME Owned transformer and Labelling.
The Era of AI is evolving, I have kept few notes of Object detection for which i was exploring towards it and I was thinking of FME will have its own API , tools and methods in labelling(Training a model within FME) rather than depending on external Api connectors.
What is Object Detection?
Object Detection is a method in computer vision that detects and identifies objects in an image or video. While image classification predicts a single label for an entire image, object detection finds several objects in a single image, giving each of them a bounding box and a class label.
Object detection takes care of two main functions:
Localisation
-
Localization - Where is the object?
-
Classification - What is the object?
Traditional Machine Learning for Object Detection
Before the emergence of deep learning, the traditional approach to object detection was handcrafted features and classical ML algorithms. Traditional object detection techniques require you to do manual feature extraction and suffer from problems with too much variation, such as lighting changes, scale changes, and background changes.
Haar Cascades
-
Introduced by Viola and Jones (2001).
-
Utilised for initial face detection (e.g. OpenCV’s face detector).
-
Based on Haar-like features and a cascade of classifiers.
Histogram of Oriented Gradients (HOG) + SVM
-
Detect objects by utilising gradient orientations.
-
Popularised by Dalal and Triggs for pedestrian detection.
-
More compact and robust than Haar, but computationally expensive.
-
Selective search + SVM
-
Provides region proposals, which are classified.
-
Helped bridge the gap between traditional machine learning and deep learning.
While these machine learning methods set the groundwork, they simply could not outpace both the accuracy and scale of the now deep learning models.
Deep Learning for Object Detection
Deep learning has transformed object detection by automating feature extraction via Convolutional Neural Networks (CNNs). Deep learning models automatically learn progressively abstract features from the data, improving speed and accuracy.
Two-Stage Detectors
Two-stage detectors separate the region proposal from classification.
-
R-CNN (Regions with CNN Features)
-
Uses Selective Search to propose region proposals.
-
Uses a CNN to extract features from each proposed region and classify each region.
-
Very accurate, but slow (each region is processed independently).
-
Fast R-CNN
-
This model shared convolutional computation across the image plane.
-
It adds an ROI pooling layer to extract features using shared feature maps.
-
Faster than R-CNN, but still not real-time, close to real-time.
-
Faster R-CNN
-
Introduces a Region Proposal Network for end-to-end training and prediction.
-
Achieves accuracies very close to real-time performance.
Single-Stage Detectors
Single-stage detectors eliminate the need for region proposal and are capable of predicting bounding boxes and class labels directly.
-
YOLO (You Only Look Once)
This system is targeted for real-time detection. YOLO divides images into a grid and makes predictions about bounding boxes for each cell in the grid. The versions began with YOLOv3, then to YOLOv4, YOLOv5, and continue to the latest - YOLOv8 (the most recent versions now leverage Transformer-based modifications).
-
SSD (Single Shot MultiBox Detector)
SSD uses feature maps from multiple convolutional layers to perform detection. SSD offers a good tradeoff between speed and accuracy.
-
RetinaNet
RetinaNet introduced 'Focal Loss', or re-weighted losses, to aid in addressing the issue of class imbalance during training. RetinaNet shows good results across a range of benchmarks.
Innovative Architectures and Trends (2025)
Modern architectures combine CNNs, Transformers, and self-supervised learning techniques for better generalisation.
-
DETR (Detection Transformer)
-
An end-to-end object detection pipeline that employs Transformers.
-
Negates the need for anchor boxes and Non-Max Suppression (NMS).
-
Very accurate but less computationally efficient than YOLO.
-
Vision Transformers (ViT)
-
Attention mechanism (global feature extraction).
-
Used with a hybrid CNN backbone for efficiency.
-
Self-supervised learning (SSL)
-
Models that are pretrained on unlabeled data (MAE, SimCLR) will transfer better with limited labelled datasets.
Tools and Frameworks
Here are some popular frameworks for implementing object detection:
-
TensorFlow Object Detection API
-
PyTorch + TorchVision
-
Ultralytics YOLOv8
-
Detectron2 (by Meta AI)
-
MMDetection
Thanks


