HyperE2VID: Improving Event-Based Video Reconstruction via Hypernetworks

Burak Ercan 1,2 Onur Eker 1,2 Canberk Sağlam 1,3 Aykut Erdem 3,4 Erkut Erdem 1
1 Hacettepe University, Computer Engineering Department
2 HAVELSAN Inc. 3 ROKETSAN Inc.
3 Koç University, Computer Engineering Department 4 Koç University, KUIS AI Center
IEEE Transactions on Image Processing, 2024

Paper

Code

Video

Here we present HyperE2VID, a dynamic neural network architecture for event-based video reconstruction. Our approach extends existing static architectures by using hypernetworks and dynamic convolutions to generate per-pixel adaptive filters guided by a context fusion module that combines information from event voxel grids and previously reconstructed intensity images. We show that this dynamic architecture can generate higher-quality videos than previous state-of-the-art, while also reducing memory consumption and inference time.

       
Approach Overview

Since events are generated asynchronously only when the intensity of a pixel changes, the resulting event voxel grid is a sparse tensor, incorporating information only from the changing parts of the scene. The sparsity of these voxel grids is also highly varying. This makes it hard for neural networks to adapt to new data and leads to unsatisfactory video reconstructions that contain blur, low contrast, or smearing artifacts. Unlike the previous methods that try to process the highly varying event data with static networks in which the network parameters are kept fixed after training, our proposed model, HyperE2VID, employs a dynamic neural network architecture . Specifically, we enhance the main network (a convolutional encoder-decoder architecture similar to E2VID) by employing dynamic convolutions, whose parameters are generated via hypernetworks, dynamically at inference time.

Some important aspects of our approach are:

For more details please see our paper.

Results

To evaluate our method, we utilize sequences from three real-world datasets, namely the Event Camera Dataset (ECD) dataset, the Multi Vehicle Stereo Event Camera (MVSEC) dataset and the High-Quality Frames (HQF) dataset. We evaluate the methods using three full-reference evaluation metrics, mean squared error (MSE), structural similarity (SSIM), and learned perceptual image patch similarity (LPIPS) when high-quality, distortion-free ground truth frames are available. To assess image quality under challenging scenarios, such as low light and fast motion, where ground truth frames are of low quality, we use a no-reference metric, BRISQUE.

Quantitative Results

Qualitative Results

EVREAL Result Analysis Tool

For more results and experimental analyses of HyperE2VID, please see the interactive result analysis tool of EVREAL (Event-based Video Reconstruction Evaluation and Analysis Library).

BibTeX

@article{ercan2024hypere2vid,
title={{HyperE2VID}: Improving Event-Based Video Reconstruction via Hypernetworks},
author={Ercan, Burak and Eker, Onur and Saglam, Canberk and Erdem, Aykut and Erdem, Erkut},
journal={IEEE Transactions on Image Processing},
year={2024},
volume={33},
pages={1826--1837},
doi={10.1109/TIP.2024.3372460},
publisher={IEEE}}

Acknowledgements
This work was supported in part by KUIS AI Center Research Award, TUBITAK-1001 Program Award No. 121E454, and BAGEP 2021 Award of the Science Academy to A. Erdem.