Bo Dong | Publications

2023

ICCV’23

Single Depth-Image 3D Reflection Symmetry and Shape Prediction

Zhang, Zhaoxuan, Dong, Bo, Li, Tong, Heide, Felix, Peers, Pieter, Yin, Baocai, and Yang, Xin

In IEEE International Conference on Computer Vision (ICCV) 2023

Abs HTML PDF

In this paper, we present Iterative Symmetry Completion Network (ISCNet), a single depth-image shape completion method that exploits reflective symmetry cues to obtain more detailed shapes. The efficacy of single depth-image shape completion methods is often sensitive to the accuracy of the symmetry plane. ISCNet therefore jointly estimates the symmetry plane and shape completion iteratively; more complete shapes contribute to more robust symmetry plane estimates and vice versa. Furthermore, our shape completion method operates in the image domain, enabling more efficient high-resolution, detailed geometry reconstruction. We perform the shape completion from pairs of viewpoints, reflected across the symmetry plane, predicted by a rein- forcement learning agent to improve robustness and to simultaneously explicitly leverage symmetry. We demonstrate the effectiveness of ISCNet on a variety of object categories on both synthetic and real-scanned datasets.
ICCV’23

Multi-view Spectral Polarization Propagation for Video Glass Segmentation

Qiao, Yu, Dong, Bo, Jin, Ao, Fu, Yu, Baek, Seung-Hwan, Heide, Felix, Peers, Pieter, Wei, Xiaopeng, and Yang, Xin

In IEEE International Conference on Computer Vision (ICCV) 2023

Abs HTML PDF

In this paper, we present the first polarization-guided video glass segmentation propagation solution (PGVS-Net) that can robustly and coherently propagate glass segmentation in RGB-P video sequences. By leveraging spatio-temporal polarization and color information, our method combines multi-view polarization cues and thus can alleviate the view dependence of single-input intensity variations on glass objects. We demonstrate that our model can outperform glass segmentation on RGB-only video sequences as well as produce more robust segmentation than per-frame RGB-P single-image segmentation methods. To train and validate PGVS-Net, we introduce a novel RGB-P Glass Video dataset (PGV-117) containing 117 video sequences of scenes captured with different types of camera paths, lighting conditions, dynamics, and glass types.
TPAMI

Point Cloud Scene Completion with Joint Color and Semantic Estimation from Single RGB-D Image

Zhang, Zhaoxuan, Han, Xiaoguang, Dong, Bo, Li, Tong, Yin, Baocai, and Yang, Xin

In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2023

Abs PDF

We present a deep reinforcement learning method of progressive view inpainting for colored semantic point cloud scene completion under volume guidance, achieving high-quality scene reconstruction from only a single RGB-D image with severe occlusion. Our approach is end-to-end, consisting of three modules: 3D scene volume reconstruction, 2D RGB-D and segmentation image inpainting, and multi-view selection for completion. Given a single RGB-D image, our method first predicts its semantic segmentation map and goes through the 3D volume branch to obtain a volumetric scene reconstruction as a guide to the next view inpainting step, which attempts to make up the missing information; the third step involves projecting the volume under the same view of the input, concatenating them to complete the current view RGB-D and segmentation map, and integrating all RGB-D and segmentation maps into the point cloud. Since the occluded areas are unavailable, we resort to a A3C network to glance around and pick the next best view for large hole completion progressively until a scene is adequately reconstructed while guaranteeing validity. All steps are learned jointly to achieve robust and consistent results. We perform qualitative and quantitative evaluations with extensive experiments on the 3D-FUTURE data, obtaining better results than state-of-the-arts.
SIGGRAPH’23

In the Blink of an Eye: Event-based Emotion Recognition

Zhang, Haiwei, Zhang, Jiqing, Dong, Bo, Peers, Pieter, Wu, Wenwei, Wei, Xiaopeng, Heide, Felix, and Yang, Xin

In ACM SIGGRAPH 2023

Abs HTML PDF

We introduce a wearable single-eye emotion recognition device and a real-time approach to recognizing emotions from partial observations of an emotion that is robust to changes in lighting conditions. At the heart of our method is a bio-inspired event-based camera setup and a newly designed lightweight Spiking Eye Emotion Network (SEEN). Compared to conventional cameras, event-based cameras offer a higher dynamic range (up to 140 dB vs. 80 dB) and a higher temporal resolution (in the order of \mus vs. 10s of ms). Thus, the captured events can encode rich temporal cues under challenging lighting conditions. However, these events lack texture information, posing problems in decoding temporal information effectively. SEEN tackles this issue from two different perspectives. First, we adopt convolutional spiking layers to take advantage of the spiking neural network’s ability to decode pertinent temporal information. Second, SEEN learns to extract essential spatial cues from corresponding intensity frames and leverages a novel weight-copy scheme to convey spatial attention to the convolutional spiking layers during training and inference. We extensively validate and demonstrate the effectiveness of our approach on a specially collected Single-eye Event-based Emotion (SEE) dataset. To the best of our knowledge, our method is the first eye-based emotion recognition method that leverages event-based cameras and spiking neural networks.
TOMM

A Geometrical Approach to Evaluate the Adversarial Robustness of Deep Neural Networks

Wang, Yang, Dong, Bo, Xu, Ke, Piao, Haiyin, Ding, Yufei, Yin, Baocai, and Yang, Xin

ACM Trans. Multimedia Comput. Commun. Appl. 2023

Abs PDF

Deep neural networks (DNNs) are widely used for computer vision tasks. However, it has been shown that deep models are vulnerable to adversarial attacks—that is, their performances drop when imperceptible perturbations are made to the original inputs, which may further degrade the following visual tasks or introduce new problems such as data and privacy security. Hence, metrics for evaluating the robustness of deep models against adversarial attacks are desired. However, previous metrics are mainly proposed for evaluating the adversarial robustness of shallow networks on the small-scale datasets. Although the Cross Lipschitz Extreme Value for nEtwork Robustness (CLEVER) metric has been proposed for large-scale datasets (e.g., the ImageNet dataset), it is computationally expensive and its performance relies on a tractable number of samples. In this article, we propose the Adversarial Converging Time Score (ACTS), an attack-dependent metric that quantifies the adversarial robustness of a DNN on a specific input. Our key observation is that local neighborhoods on a DNN’s output surface would have different shapes given different inputs. Hence, given different inputs, it requires different time for converging to an adversarial sample. Based on this geometry meaning, the ACTS measures the converging time as an adversarial robustness metric. We validate the effectiveness and generalization of the proposed ACTS metric against different adversarial attacks on the large-scale ImageNet dataset using state-of-the-art deep networks. Extensive experiments show that our ACTS metric is an efficient and effective adversarial metric over the previous CLEVER metric.

2022

NeurIPS’22

Biologically Inspired Dynamic Thresholds for Spiking Neural Networks

Ding, Jianchuan, Dong, Bo, Heide, Felix, Ding, Yufei, Zhou, Yunduo, Yin, Baocai, and Yang, Xin

In Conference on Neural Information Processing System (NeurIPS) 2022

Abs HTML PDF

The dynamic membrane potential threshold, as one of the essential properties of a biological neuron, is a spontaneous regulation mechanism that maintains neuronal homeostasis, i.e., the constant overall spiking firing rate of a neuron. As such, the neuron firing rate is regulated by a dynamic spiking threshold, which has been extensively studied in biology. Existing work in the machine learning commu- nity does not employ bioplausible spiking threshold schemes. This work aims at bridging this gap by introducing a novel bioinspired dynamic energy-temporal threshold (BDETT) scheme for spiking neural networks (SNNs). The proposed BDETT scheme mirrors two bioplausible observations: a dynamic threshold has 1) a positive correlation with the average membrane potential and 2) a negative correlation with the preceding rate of depolarization. We validate the effectiveness of the proposed BDETT on robot obstacle avoidance and continuous control tasks under both normal conditions and various degraded conditions, including noisy observations, weights, and dynamic environments. We find that the BDETT out- performs existing static and heuristic threshold approaches by significant margins in all tested conditions, and we confirm that the proposed bioinspired dynamic threshold scheme offers bioplausible homeostasis to SNNs in complex real-world tasks.
ECCV’22

All You Need is RAW: Defending Against Adversarial Attacks with Camera Image Pipelines

Zhang, Yuxuan, Dong, Bo, and Heide, Felix

In European Conference on Computer Vision (ECCV) 2022

Abs HTML PDF

Existing neural networks for computer vision tasks are vulnerable to adversarial attacks: adding imperceptible perturbations to the input images can fool these models to make a false prediction on an im- age that was correctly predicted without the perturbation. Various defense methods have proposed image-to-image mapping methods, either including these perturbations in the training process or removing them in a preprocessing step. In doing so, existing methods often ignore that the natural RGB images in today’s datasets are not captured but, in fact, recovered from RAW color filter array captures that are subject to various degradations in the capture. In this work, we exploit this RAW data distribution as an empirical prior for adversarial defense. Specifically, we proposed a model-agnostic adversarial defensive method, which maps the input RGB images to Bayer RAW space and back to output RGB using a learned camera image signal processing (ISP) pipeline to eliminate potential adversarial patterns. The proposed method acts as an off-the-shelf preprocessing module and, unlike model-specific adversarial training methods, does not require adversarial images to train. As a result, the method generalizes to unseen tasks without additional re-training. Experiments on large-scale datasets (e.g., ImageNet, COCO) for different vision tasks (e.g., classification, semantic segmentation, object detection) validate that the method significantly outperforms existing methods across task domains.
CVPR’22

Glass Segmentation using Intensity and Spectral Polarization Cues

Mei, Haiyang, Dong, Bo, Dong, Wen, Yang, Jiaxi, Baek, Seung-Hwan, Heide, Felix, Peers, Pieter, Wei, Xiaopeng, and Yang, Xin

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2022

Abs HTML PDF

Transparent and semi-transparent materials pose significant challenges for existing scene understanding and segmentation algorithms due to their lack of RGB texture which impedes the extraction of meaningful features. In this work, we exploit that the light-matter interactions on glass materials provide unique intensity-polarization cues for each observed wavelength of light. We present a novel learning-based glass segmentation network that leverages both trichromatic (RGB) intensities as well as trichromatic linear polarization cues from a single photograph captured without making any assumption on the polarization state of the illumination. Our novel network architecture dynamically fuses and weights both the trichromatic color and polarization cues using a novel global-guidance and multi-scale self-attention module, and leverages global cross-domain contextual information to achieve robust segmentation. We train and extensively validate our segmentation method on a new large-scale RGB-Polarization dataset (RGBP-Glass), and demonstrate that our method outperforms state-of-the-art segmentation approaches by a significant margin.
CVPR’22

Spiking Transformers for Event-based Single Object Tracking

Zhang, Jiqing, Dong, Bo, Zhang, Haiwei, Ding, Jianchuan, Heide, Felix, Yin, Baocai, and Yang, Xin

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2022

Abs HTML PDF

Event-based cameras bring a unique capability to tracking, being able to function in challenging real-world conditions as a direct result of their high temporal resolution and high dynamic range. These imagers capture events asynchronously that encode rich temporal and spatial information. However, effectively extracting this information from events remains an open challenge. In this work, we propose a spiking transformer network, STNet, for single object tracking. STNet dynamically extracts and fuses information from both temporal and spatial domains. In particular, the proposed architecture features a transformer module to provide global spatial information and a spiking neural network (SNN) module for extracting temporal cues. The spiking threshold of the SNN module is dynamically adjusted based on the statistical cues of the spatial information, which we find essential in providing robust SNN features. We fuse both feature branches dynamically with a novel cross-domain attention fusion algorithm. Extensive experiments on two event-based datasets, FE240hz and EED, validate that the proposed STNet outperforms existing state-of-the-art methods in both tracking accuracy and speed with a significant margin.

2021

AI Letters

Explainable, Interactive Content-Based Image Retrieval

Vasu, Bhavan, Hu, Brian, Dong, Bo, Collins, Roddy, and Hoogs, Anthony

In Applied AI Letters 2021

Abs HTML PDF

Quantifying the value of explanations in a human-in-the-loop (HITL) system is difficult. Previous methods either measure explanation-specific values that do not correspond to user tasks and needs or poll users on how useful they find the explanations to be. In this work, we quantify how much explanations help the user through a utility-based paradigm that measures change in task performance when using explanations versus not. Our chosen task is content-based image retrieval (CBIR), which has well-established baselines and performance metrics independent of explainability. We extend an existing HITL image retrieval system that incorporates user feedback with similarity-based saliency maps (SBSM) that indicate to the user which parts of the retrieved images are most similar to the query image. The system helps the user understand what it is paying attention to through saliency maps, and the user helps the system understand their goal through saliency-guided relevance feedback. Using the MS-COCO dataset, a standard object detection and segmentation dataset, we conducted extensive, crowd-sourced experiments validating that SBSM improves interactive image retrieval. Although the performance increase is modest in the general case, in more difficult cases such as cluttered scenes, using explanations yields an 6.5% increase in accuracy. To the best of our knowledge, this is the first large-scale user study showing that visual saliency map explanations improve performance on a real-world, interactive task. Our utility-based evaluation paradigm is general and potentially applicable to any task for which explainability can be incorporated.
CGF’21

Luminance Attentive Networks for HDR Image and Panorama Reconstruction

Yu, Hanning, Liu, Wentao, Long, Chengjiang, Dong, Bo, Zou, Qin, and Xiao, Chunxia

In Computer Graphics Forum 2021

Abs HTML

It is very challenging to reconstruct a high dynamic range (HDR) from a low dynamic range (LDR) image as an ill-posed problem. This paper proposes a luminance attentive network named LANet for HDR reconstruction from a single LDR image. Our method is based on two fundamental observations: (1) HDR images stored in relative luminance are scale-invariant, which means the HDR images will hold the same information when multiplied by any positive real number. Based on this observation, we propose a novel normalization method called " HDR calibration " for HDR images stored in relative luminance, calibrating HDR images into a similar luminance scale according to the LDR images. (2) The main difference between HDR images and LDR images is in under-/over-exposed areas, especially those highlighted. Following this observation, we propose a luminance attention module with a two-stream structure for LANet to pay more attention to the under-/over-exposed areas. In addition, we propose an extended network called panoLANet for HDR panorama reconstruction from an LDR panorama and build a dualnet structure for panoLANet to solve the distortion problem caused by the equirectangular panorama. Extensive experiments show that our proposed approach LANet can reconstruct visually convincing HDR images and demonstrate its superiority over state-of-the-art approaches in terms of all metrics in inverse tone mapping. The image-based lighting application with our proposed panoLANet also demonstrates that our method can simulate natural scene lighting using only LDR panorama. Our source code is available at https://github.com/LWT3437/LANet.
ICCV’21

Object Tracking by Jointly Exploiting Frame and Event Domain

Zhang, Jiqing, Yang, Xin, Fu, Yingkai, Wei, Xiaopeng, Yin, Baocai, and Dong, Bo

In IEEE International Conference on Computer Vision (ICCV) 2021

Abs HTML PDF

Inspired by the complementarity between conventional frame-based and bio-inspired event-based cameras, we propose a multi-modal based approach to fuse visual cues from the frame- and event-domain to enhance the single object tracking performance, especially in degraded conditions (e.g., scenes with high dynamic range, low light, and fast-motion objects). The proposed approach can effectively and adaptively combine meaningful information from both domains. Our approach’s effectiveness is enforced by a novel designed cross-domain attention schemes, which can effectively enhance features based on self- and cross-domain attention schemes; The adaptiveness is guarded by a specially designed weighting scheme, which can adaptively balance the contribution of the two domains. To exploit event-based visual cues in single-object tracking, we construct a large-scale frame-event-based dataset, which we subsequently employ to train a novel frame-event fusion based model. Extensive experiments show that the proposed approach outperforms state-of-the-art frame-based tracking methods by at least 10.4% and 11.9i% in terms of representative success rate and precision rate, respectively. Besides, the effectiveness of each key component of our approach is evidenced by our thorough ablation study.
CGI’21

Multi-domain collaborative feature representation for robust visual object tracking

Zhang, Jiqing, Zhao, Kai, Dong, Bo, Fu, Yingkai, Wang, Yuxin, Yang, Xin, and Yin, Baocai

In Computer Graphics International (CGI) 2021

Abs HTML

Jointly exploiting multiple different yet complementary domain information has been proven to be an effective way to perform robust object tracking. This paper focuses on effectively representing and utilizing complementary features from the frame domain and event domain for boosting object tracking performance in challenge scenarios. Specifically, we propose common features extractor to learn potential common representations from the RGB domain and event domain. For learning the unique features of the two domains, we utilize a unique extractor for event based on Spiking neural networks to extract edge cues in the event domain which may be missed in RGB in some challenging conditions, and a unique extractor for RGB based on deep convolutional neural networks to extract texture and semantic information in RGB domain. Extensive experiments on standard RGB benchmark and real event tracking dataset demonstrate the effectiveness of the proposed approach. We show our approach outperforms all compared state-of-the-art tracking algorithms and verify event-based data is a powerful cue for tracking in challenging scenes.
CVPR’21 Oral

Depth-Aware Mirror Segmentation

Mei, Haiyang, Dong, Bo, Dong, Wen, Peers, Pieter, Yang, Xin, Zhang, Qiang, and Wei, Xiaopeng

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Oral 2021

Abs HTML PDF

We present a novel mirror segmentation method that leverages depth estimates from ToF-based cameras as an additional cue to disambiguate challenging cases where the contrast or relation in RGB colors between the mirror reflection and the surrounding scene is subtle. A key observation is that ToF depth estimates do not report the true depth of the mirror surface, but instead return the total length of the reflected light paths, thereby creating obvious depth discontinuities at the mirror boundaries. To exploit depth information in mirror segmentation, we first construct a large-scale RGB-D mirror segmentation dataset, which we subsequently employ to train a novel depth-aware mirror segmentation framework. Our mirror segmentation framework first locates the mirrors based on color and depth discontinuities and correlations. Next, our model further refines the mirror boundaries through contextual contrast taking into accountboth color and depth information. We extensively validate our depth-aware mirror segmentation method and demonstrate that our model outperforms state-of-the-art RGB and RGB-D based methods for mirror segmentation. Experimental results also show that depth is a powerful cue for mirror segmentation.

2020

MobiSys’20

EMO: Real-Time Emotion Recognition from Single-Eye Images for Resource-Constrained Eyewear Devices

Wu, Hao, Feng, Jinghao, Tian, Xuejin, Sun, Edward, Liu, Yunxin, Dong, Bo, Xu, Fengyuan, and Zhong, Sheng

In Proceedings of the 18th International Conference on Mobile Systems, Applications, and Services 2020

Abs HTML PDF

Real-time user emotion recognition is highly desirable for many applications on eyewear devices like smart glasses. However, it is very challenging to enable this capability on such devices due to tightly constrained image contents (only eye-area images available from the on-device eye-tracking camera) and computing resources of the embedded system. In this paper, we propose and develop a novel system called EMO that can recognize, on top of a resource-limited eyewear device, real-time emotions of the user who wears it. Unlike most existing solutions that require whole-face images to recognize emotions, EMO only utilizes the single-eye-area images captured by the eye-tracking camera of the eyewear. To achieve this, we design a customized deep-learning network to effectively extract emotional features from input single-eye images and a personalized feature classifier to accurately identify a user’s emotions. EMO also exploits the temporal locality and feature similarity among consecutive video frames of the eye-tracking camera to further reduce the recognition latency and system resource usage. We implement EMO on two hardware platforms and conduct comprehensive experimental evaluations. Our results demonstrate that EMO can continuously recognize seven-type emotions at 12.8 frames per second with a mean accuracy of 72.2%, significantly outperforming the state-of-the-art approach, and consume much fewer system resources.

2019

ISBI’19

XAI-CBIR: Explainable AI System for Content based Retrieval of Video Frames from Minimally Invasive Surgery Videos

Chittajallu, D. R., Dong, Bo, Tunison, P., Collins, R., Wells, K., Fleshman, J., Sankaranarayanan, G., Schwaitzberg, S., Cavuoto, L., and Enquobahrie, A.

In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) 2019

Abs

In this paper, we present a human-in-the-loop explainable AI (XAI) system for content based image retrieval (CBIR) of video frames similar to a query image from minimally invasive surgery (MIS) videos for surgical education. It extracts semantic descriptors from MIS video frames using a self-supervised deep learning model. It then employs an iterative query refinement strategy where in a binary classifier trained online based on relevance feedback from the user is used to iteratively refine the search results. Lastly, it uses an XAI technique to generate a saliency map that provides a visual explanation of why the system considers a retrieved image to be similar to the query image. We evaluated the proposed XAI-CBIR system on the public Cholec80 dataset containing 80 videos of minimally invasive cholecystectomy surgeries with encouraging results.
CVPRW’19

Explainability for Content-Based Image Retrieval.

Dong, Bo, Collins, Roddy, and Hoogs, Anthony

In CVPR Workshops 2019

HTML PDF

2018

ECCV’18 Demo

Iterative Query Refinement with Visual Explanations.

Dong, Bo, Vitali, Petsiuk, Das, Abir, Saenko, Kate, Collins, Roddy, and Hoogs, Anthony

In ECCV Demo 2018

HTML Video
arXiv

An Explainable Adversarial Robustness Metric for Deep Learning Neural Networks

Agarwal, Chirag, Dong, Bo, Schonfeld, Dan, and Hoogs, Anthony

CoRR 2018

arXiv

2015

ACNS’15

Exploiting Eye Tracking for Smartphone Authentication

Liu, Dachuan, Dong, Bo, Gao, Xing, and Wang, Haining

In Applied Cryptography and Network Security 2015

Abs HTML PDF

Traditional user authentication methods using passcode or finger movement on smartphones are vulnerable to shoulder surfing attack, smudge attack, and keylogger attack. These attacks are able to infer a passcode based on the information collection of user’s finger movement or tapping input. As an alternative user authentication approach, eye tracking can reduce the risk of suffering those attacks effectively because no hand input is required. However, most existing eye tracking techniques are designed for large screen devices. Many of them depend on special hardware like high resolution eye tracker and special process like calibration, which are not readily available for smartphone users. In this paper, we propose a new eye tracking method for user authentication on a smartphone. It utilizes the smartphone’s front camera to capture a user’s eye movement trajectories which are used as the input of user authentication. No special hardware or calibration process is needed. We develop a prototype and evaluate its effectiveness on an Android smartphone. We recruit a group of volunteers to participate in the user study. Our evaluation results show that the proposed eye tracking technique achieves very high accuracy in user authentication.
SIGGRAPH’15

Measurement-Based Editing of Diffuse Albedo with Consistent Interreflections

Dong, Bo, Dong, Yue, Tong, Xin, and Peers, Pieter

ACM Trans. Graph. 2015

Abs PDF Video

We present a novel measurement-based method for editing the albedo of diffuse surfaces with consistent interreflections in a photograph of a scene under natural lighting. Key to our method is a novel technique for decomposing a photograph of a scene in several images that encode how much of the observed radiance has interacted a specified number of times with the target diffuse surface. Altering the albedo of the target area is then simply a weighted sum of the decomposed components. We estimate the interaction components by recursively applying the light transport operator and formulate the resulting radiance in each recursion as a linear expression in terms of the relevant interaction components. Our method only requires a camera-projector pair, and the number of required measurements per scene is linearly proportional to the decomposition degree for a single target area. Our method does not impose restrictions on the lighting or on the material properties in the unaltered part of the scene. Furthermore, we extend our method to accommodate editing of the albedo in multiple target areas with consistent interreflections and we introduce a prediction model for reducing the acquisition cost. We demonstrate our method on a variety of scenes and validate the accuracy on both synthetic and real examples.

2014

CVPR’14

Scattering Parameters and Surface Normals from Homogeneous Translucent Materials using Photometric Stereo

Dong, Bo, Moore, Kathleen D., Zhang, Weiyi, and Peers, Pieter

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2014

Abs HTML PDF Video

This paper proposes a novel photometric stereo solution to jointly estimate surface normals and scattering parameters from a globally planar, homogeneous, translucent object. Similar to classic photometric stereo, our method only requires as few as three observations of the translucent object under directional lighting. Naively applying classic photometric stereo results in blurred photometric normals. We develop a novel blind deconvolution algorithm based on inverse rendering for recovering the sharp surface normals and the material properties. We demonstrate our method on a variety of translucent objects.