This page lists my 30+ research publications including IEEE/ACM/CVF/arXiv papers.
Tiny object detection is a challenging task. Many datasets for this task are released in past years, spanning from natural scene to remote sensing images. However, wind turbines in satellite images, a significant category of tiny objects, have not been well included. Aiming at completing the tiny object datasets, we release TinyWT, a large-scale year-round tiny wind turbine dataset of satellite images. It has 8k+ images, a very tiny object size of 3-6 pixels, and 700k+ annotations in total with the extensive effort of human correction. Unlike other tiny object datasets of aerial/satellite images that are limited to academic research only, our dataset is free for commercial use. Every pixel's geographic coordinates are also explicitly extracted for researchers without related domain knowledge. Meanwhile, we reposition the tiny object detection task as a localizing-and-counting problem and incorporate segmentation techniques, and propose a novel design to exploit the strengths of contextual similarity constraint and supervised contrastive learning. The experiment results of both baseline models (CNN-based and Transformer-based models) as well as our special design are presented. Without bells and whistles, our design effectively improves the baseline models' performance, achieving a maximum of 4.94% mIoU gain where 21.15% of false negatives are recalled and 22.02% of false positives are removed. TinyWT is available on https://github.com/MingyeZhu123/TinyWT-dataset.
Cropland segmentation of satellite images is an essential basis for crop area and yield estimation tasks in the remote sensing and computer vision interdisciplinary community. Instead of common pixel-level segmentation results with salt-and-pepper effects, a parcel-level output conforming to human recognition is required according to the clients’ needs during the model deployment. However, leveraging CNN-based models requires fine-grained parcel-level labels, which is an unacceptable annotation burden. To cure these practical pain points, in this paper, we present PARCS, a holistic deployment-oriented AI system for PARcel-level Cropland Segmentation. By consolidating multi-disciplinary knowledge, PARCS has two algorithm branches. The first branch performs pixel-level crop segmentation by learning from limited labeled pixel samples with an active learning strategy to avoid parcel-level annotation costs. The second branch aims at generating the parcel regions without a learning procedure. The final parcel-level segmentation result is achieved by integrating the outputs of these two branches in tandem. The robust effectiveness of PARCS is demonstrated by its outstanding performance on public and in-house datasets (an overall accuracy of 85.3% and an mIoU of 61.7% on the public PASTIS dataset, and an mIoU of 65.16% on the in-house dataset) . We also include subjective feedback from clients and discuss the lessons learned from deployment.
Automated analysis of remote sensing (RS) imagery is the key to monitoring global issues. Hundreds of satellites collect plentiful RS data on a daily basis. However, most images remain unlabeled, thus supervised learning algorithms are unable to make full use of the massive amounts of RS data. To address this issue, we leverage the benefits of generative methods and build a multispectral masked autoencoder (MAE) to learn RS representation from RGB and Near-infrared (RGBN) data. The results indicate that the features extracted from RS images are more effective than those from the natural images for the RS task. This domain gap makes RS self-supervised pre-training generalized better in RS tasks. Moreover, the multispectral feature learned with a near-infrared signal increases the top-1 validation accuracy by 3.8%, showing that the multispectral feature is crucial in RS representation learning.
Earth observation satellites have been continuously monitoring the earth environment for years at different locations and spectral bands with different modalities. Due to complex satellite sensing conditions (e.g., weather, cloud, atmosphere, orbit), some observations for certain modalities, bands, locations, and times may not be available. The MultiEarth Matrix Completion Challenge in CVPR 2022 [1] provides the multimodal satellite data for addressing such data sparsity challenges with the Amazon Rainforest as the region of interest. This work proposes an adaptive realtime multimodal regression and generation framework and achieves superior performance on unseen test queries in this challenge with an LPIPS of 0.2226, a PSNR of 123.0372, and an SSIM of 0.6347.
The MultiEarth 2022 Image-to-Image Translation challenge provides a well-constrained test bed for generating the corresponding RGB Sentinel-2 imagery with the given Sentinel-1 VV & VH imagery. In this challenge, we designed various generation models and found the SPADE [1] and pix2pixHD [2] models could perform our best results. In our self-evaluation, the SPADE-2 model with L1-loss can achieve 0.02194 MAE score and 31.092 PSNR dB. In our final submission, the best model can achieve 0.02795 MAE score ranked No.1 on the leader board.
The Agriculture-Vision Challenge in CVPR is one of the most famous and competitive challenges for global researchers to break the boundary between computer vision and agriculture sectors, aiming at agricultural pattern recognition from aerial images. In this paper, we propose our solution to the third Agriculture-Vision Challenge in CVPR 2022. We leverage a data pre-processing scheme and several Transformer-based models as well as data augmentation techniques to achieve a mIoU of 0.582, accomplishing the 2nd place in this challenge.
Single image dehazing is a challenging vision problem aiming to provide clear images for downstream computer vision applications (e.g., semantic segmentation, object detection, and super resolution). Most existing methods leverage the physical scattering model or convolutional neural networks (CNNs) for haze removal, which however ignore the complementary advantages between each other. Especially lacking marginal and visual prior instructions, CNN-based methods still have gaps in details and color recovery. To solve these, we propose a Prior-based with Decoupling ability Dehazing GAN Network (PDD-GAN), which is based on PeleetNet and attached with an attention module (CBAM). The prior-based decoupling approach consists of two parts: high and low frequency filtering and HSV contrastive loss. We process the image via a band-stop filter and add it as the fourth channel of data (RGBFHL) to decouple the hazy image at the structural level. Besides, a novel prior loss with contrastive regularization is proposed at the visual level. Sufficient experiments are carried out to demonstrate that PDD-GAN outperforms state-of-the-art methods by up to 0.86db in PSNR. In particular, extensive experiments indicate that RGBFHL increases by 0.99db compared with the original three-channel data (RGB) and the extra HSV prior loss escalates by 2.0db. Above all, our PDD-GAN indeed has the decoupling ability and improves the dehazing results.
Accurate parcel segmentation of remote sensing images plays an important role in ensuring various downstream tasks. Traditionally, parcel segmentation is based on supervised learning using precise parcel-level ground truth information, which is difficult to obtain. In this paper, we propose an end-to-end unsupervised Graph Con- volutional Network (GCN)-based framework for superpixel-driven parcel segmentation of remote sensing images. The key component is a novel graph-based superpixel aggregation model, which effectively learns superpixels’ latent affinities and better aggregates similar ones in spatial and spectral spaces. We construct a multi-temporal multi-location testing dataset using Sentinel-2 images and the ground truth annotations in four different regions. Extensive experiments are conducted to demonstrate the efficacy and robustness of our proposed model. The best performance is achieved by our model compared with the competing methods.
Fashion compatibility learning is important to many fashion markets such as outfit composition and online fashion recommendation. Unlike previous work, we argue that fashion compatibility is not only a visual appearance compatible problem but also a theme-matters problem. An outfit, which consists of a set of fashion items (e.g., shirt, suit, shoes, etc.), is considered to be compatible for a “dating” event, yet maybe not for a “business” occasion. In this paper, we aim at solving the fashion compatibility problem given specific themes. To this end, we built the first real-world theme-aware fashion dataset comprising 14K around outfits labeled with 32 themes. In this dataset, there are more than 40K fashion items labeled with 152 fine-grained categories. We also propose an attention model learning fashion compatibility given a specific theme. It starts with a category-specific subspace learning, which projects compatible outfit items in certain categories to be close in the subspace. Thanks to strong connections between fashion themes and categories, we then build a theme-attention model over the category-specific embedding space. This model associates themes with the pairwise compatibility with attention, and thus compute the outfit-wise compatibility. To the best of our knowledge, this is the first attempt to estimate outfit compatibility conditional on a theme. We conduct extensive qualitative and quantitative experiments on our new dataset. Our method outperforms the state-of-the-art approaches.
To reduce the significant redundancy in deep Convolutional Neural Networks (CNNs), most existing methods prune neurons by only considering statistics of an individual layer or two consecutive layers (e.g., prune one layer to minimize the reconstruction error of the next layer), ignoring the effect of error propagation in deep networks. In contrast, we argue that it is essential to prune neurons in the entire neuron network jointly based on a unified goal: minimizing the reconstruction error of important responses in the “final response layer” (FRL), which is the second-to-last layer before classification, for a pruned network to retrain its predictive power. Specifically, we apply feature ranking techniques to measure the importance of each neuron in the FRL, and formulate network pruning as a binary integer optimization problem and derive a closed-form solution to it for pruning neurons in earlier layers. Based on our theoretical analysis, we propose the Neuron Importance Score Propagation (NISP) algorithm to propagate the importance scores of final responses to every neuron in the network. The CNN is pruned by removing neurons with least importance, and then fine-tuned to retain its predictive power. NISP is evaluated on several datasets with multiple CNN models and demonstrated to achieve significant acceleration and compression with negligible accuracy loss.
TModern brain mapping techniques are producing increasingly large datasets of anatomical or functional connection patterns. Recently, it became possible to record detailed live imaging videos of mammal brain while the subject is engaging routine activity. We analyze videos recorded from ten mice to describe how to detect neurons, extract neuron signals, map correlation of neuron signals to mice activity, detect the network topology of active neurons, and analyze network topology characteristics. We propose a neuron position alignment method to compensate the distortion and movement of cerebral cortex in live mouse brain and the background luminance compensation to extract and model neuron activity. To find out the network topology, a cross-correlation based method and a causal Bayesian network method are proposed and used for analysis. Afterwards, we did preliminary analysis on network topologies. The significance of this paper is on how to extract neuron activities from live mouse brain imaging videos and a network analysis method to analyze its topology.
Modern brain mapping techniques are produc- ing increasingly large datasets of anatomical or functional connection patterns. Recently, it became possible to record detailed live imaging videos of mammal brain while the subject is engaging routine activity. We analyze a dataset of videos recorded from ten mice to describe how to detect neurons, extract neuron signals, map correlation of neuron signals to mice activity, detect the network topology of active neurons, and analyze network topology characteristics. We propose neuron position alignment to compensate the distortion and movement of cerebral cortex in live mouse brain and the background luminance compensation to extract and model neuron activity. To find out the network topology as an undirected graph model, a cross-correlation based method is proposed and used for analysis. Afterwards, we did preliminary analysis on network topologies. The significance of this paper is on how to extract neuron activities from live mouse brain imaging videos and a network analysis method to analyze topology that can potentially provide insight on how neurons are actively connected under stimulus, rather than analyzing static neural networks.
The advances of mobile computing and sensor technology have turned the mobile devices into powerful instruments. The integration of thermal and visual cameras extends the capability of computer vision, due to the fact that both images reveal different characteristics in images; however, image alignment is a challenge. This paper proposes an effective approach to align image pairs for event detection on mobile through image recognition. We leverage thermal and visual cameras as multi-modality sources for image recognition. By analyzing the heat pattern, the proposed APP can identify the heating sources and help users inspect their house heating system; on the other hand, with applying image recognition, the proposed APP furthermore can help field workers identify the asset condition and provide the guidance to solve their issues.
The difficulty of vision-based posture estimation is greatly decreased with the aid of commercial depth camera, such as Microsoft Kinect. However, there is still much to do to bridge the results of human posture estimation and the understanding of human movements. Human movement assessment is an important technique for exercise learning in the field of healthcare. In this paper, we propose an action tutor system which enables the user to interactively retrieve a learning exemplar of the target action movement and to immediately acquire motion instructions while learning it in front of the Kinect. The proposed system is composed of two stages. In the retrieval stage, nonlinear time warping algorithms are designed to retrieve video segments similar to the query movement roughly performed by the user. In the learning stage, the user learns according to the selected video exemplar, and the motion assessment including both static and dynamic differences is presented to the user in a more effective and organized way, helping him/her to perform the action movement correctly. The experiments are conducted on the videos of ten action types, and the results show that the proposed human action descriptor is representative for action video retrieval and the tutor system can effectively help the user while learning action movements.
The advances of mobile computing and sensor technology have turned the mobile devices into powerful instruments. The integration of thermal and visual cameras extends the capability of computer vision, due to the fact that both images reveal different characteristics in images; however, image alignment is a challenge. This paper proposes an effective approach to align image pairs for event detection on mobile through image recognition. We leverage thermal and visual cameras as multi-modality sources for image recognition. By analyzing the heat pattern, the proposed APP can identify the heating sources and help users inspect their house heating system; on the other hand, with applying image recognition, the proposed APP furthermore can help field workers identify the asset condition and provide the guidance to solve their issues.
Graph technologies have been widely utilized for building big data analytics systems. Since those systems are typically wrapped as service providers in industry, it is critical to handle concurrent queries at runtime by incorporating a set of parallel processing units. In many cases, such queries result in local subgraph traversals, which essentially require an efficient scheduling scheme to explore the tradeoff between the workload balance and the task affinity. In this paper, we present an auction based approach for allocating concurrent subgraph traversals onto the processors. A dynamic weighted bipartite graph is built to model the affinity between subgraph traversals and processors, and the workload of processors. In particular, an edge between a task and a processor in the bipartite graph represents that the data needed by this task is likely cached by this processor. The task vertices and edges are dynamically added or removed, and the heavier edge weight represents stronger belief of the affinity. Besides, the edge weight is also governed by the current workload of the corresponding processor. We perform a parallel auction algorithm to figure out a near-optimal assignment of the subgraph traversal tasks onto the processors, which therefore addresses both the workload balance and the task affinity. The auction algorithm is performed incrementally, so as to capture the changes of the bipartite graph structure. Our experiments show the superior performance of the proposed method for various real-world use cases based on concurrent subgraph traversals.
Filtering is widely used in image and video process- ing for various applications. Recently, the guided filter had been proposed and became one of the popular filtering methods. In this paper, to achieve the computation demand of guided filtering in Full-HD video, a double integral image architecture for guided filter ASIC design is proposed. In addition, a reformation of guided filter formula is proposed which can prevent the error resulted from truncation in fractional part and modify the regularization parameter ε on user’s demand. The hardware architecture of guided image filter is then proposed and can be embedded in mobile devices to achieve real-time HD applications. To the best of our knowledge, this work is also the first ASIC design for guided image filter. With TSMC 90nm cell library, the design can operate at 100MHz and support for Full-HD (1920x1080) 30 fps with 92.9K gate counts and 3.2KB on-chip memory. Moreover, for the hardware efficiency, our architecture is also the best comparing to other previous works with bilateral filter.
Efficient image query is a fundamental challenge in many large scale multimedia applications, especially when han dling many queries concurrently. In this paper, we proposed a novel approach called graph local random walk for high performance concurrent image query. Specifically, we or ganize the massive images set into a large scale graph us ing graph database, according to the similarity between im ages. A heuristic method is utilized to map each query im age to some vertex in the graph, followed by a local search to refine the query results using an alternative of local ran dom walk on graph. The local random walk process is es sentially a weighted partial traversal in the local subgraphs for finding a better match of the query images. We organize the graph of the image set in a parallelization amenable ap proach,so that a set of partial graph traversal for local random walk can be performed concurrently,taking the advantage of the multithreading capability of processors. We implemented the proposed method in state-of-the-art multicore platforms. The experimental result shows that the graph local random walk based approach outperforms baseline methods in terms of both throughput and scalability.
In this work, we propose a convenient system for trip planning and aim to change the behavior of trip planners from exhaustively searching information to receiving useful travel recommendations. Given the essential and optional user inputs, our system automatically recommends a route that suits the traveler based on a real-time route planning algorithm and allows the user to make adjustment according to their preferences. We construct a traveling database by collecting photos taken around famous attractions and analyzing these photos to extract each attraction’s travel information including popularity, typical stay time, available visiting time in a day, and visual scenes of different time. All the extracted travel information are presented to the user to help him/her efficiently know more about different attractions so that he/she can modify the inputs to obtain a more favorable travel route. The experimental results show that our system can effectively help the user to plan the journey.
Tennis Real Play (TRP) is an interactive tennis game system constructed with models extracted from videos of real matches. The key techniques proposed for TRP include player modeling and video-based player/court rendering. For player model creation, we propose the process for database normaliza- tion and the behavioral transition model of tennis players, which might be a good alternative for motion capture in the conven- tional video games. For player/court rendering, we propose the framework for rendering vivid game characters and providing the real-time ability. We can say that image-based rendering leads to a more interactive and realistic rendering. Experiments show that video games with vivid viewing effects and characteristic players can be generated from match videos without much user intervention. Because the player model can adequately record the ability and condition of a player in the real world, it can then be used to roughly predict the results of real tennis matches in the next days. The results of a user study reveal that subjects like the increased interaction, immersive experience, and enjoyment from playing TRP.
Scalable video is the research topic to provide different size of video bitstream under different transmission bandwidth. In this paper, the semantic scalability is proposed that provides the scalable videos in semantic domain, and the tennis videos are used as the experiments. Contrary to decreasing the video quality to reduce the bitrates, the lower bitstream size is achieved by abandoning the video contents with less semantic importance. The experimental results show that the proposed semantic scalability provides four levels of the scalable videos and maintains the visual quality in watching the game video. For user study, evaluators identify the visual quality of semantic scalability is more acceptable and the game information is clearer than Scalable Video Coding. The proposed scalability in semantic domain provides a new aspect for the scalable video.
With the aid of depth camera, such as Microsoft Kinect, the difficulty of vision-based posture estimation is greatly decreased, and human action analysis has achieved a wide range of applications. However, there is still much to do to develop effective movement assessment technique, which bridges the results of human posture estimation and the un- derstanding of human action performance. In this work, we propose an action tutor system which enables the user to interactively retrieve the learning exemplar of the target action movement and to immediately acquire motion instructions while learning it in front of the Kinect. In the retrieval stage, non-linear time warping algorithms are designed to retrieve video segments similar to the query movement roughly performed by the user. In the learning stage, the user learns according to the selected video exemplar, and the motion assessment including both static and dynamic differences is presented to the user in a more effective and organized way, helping him/her to perform the action movement correctly.
Estimation of a object pose from camera is a well-developing topic in computer vision. In theory, the pose from a calibrated camera can be uniquely determined. But in practice, most of the real-time pose estimation algorithms suffer from pose ambiguity due to low accuracy of the target object. We think that pose ambiguity—two distinct local minima of the according error function—exist because of the phenomenon of geometric illusions. Both of the ambiguous poses are plausible. After obtaining the solution of two minima (pose candidates), we develop a real-time algorithm for stable pose estimation of a target objects with a motion model. In the experimental results, the proposed algorithm diminish the significance of pose jumping and pose jittering effectively. To the best of our knowledge, this is the first work to solve the pose ambiguity problem with motion model in real-time application.
Spectral graph methods are widely employed in image segmentation, and they exhibit excellent performance. However, for high-resolution images, it is impractical to directly calculate the eigenvectors of the affinity matrix owing to the high computational requirements. The Nystrom method provides an efficient way to approximate the large-scale affinity matrix by low-rank approximation. In the machine learning field, previous studies have mainly focused on less data points with high dimensional features. To the best of our knowledge, this is the first study to discuss the performance of sampling methods for Nystrom approximation, in which we focus on the pixel-wise affinity matrix for a single image. In this paper, we propose a mean-shift segmentation-based Nystrom sampling technique for image analysis. The experimental results show that for images with simple compositions and backgrounds, k-means sampling performs better, whereas for images with more complicated compositions and backgrounds, the proposed method can perform better.
Tennis videos are used as an example for the implementation of a viewing program called as Tennis Video 2.0. For the methods in video analysis, background generation by considering the pixels in temporal and spatial distribution is proposed; fore- ground segmentation combining automatic trimap generation and matting model is proposed. To provide more functions in watching videos, the rendering flow of video contents and the semantic Scalability are proposed. With the new analysis and rendering tools, the presentation of sports videos has three prop- erties—Structure, Interactivity, and Scalability. The experiments show that several broadcasting game videos are employed to evaluate the robustness and performance of the proposed system. For user study, 20 evaluators highly identify that Tennis Video 2.0 is a new presentation of sports videos and give people better viewing experience.
Tennis Real Play (TRP) is an interactive tennis game system constructed with models extracted from videos of real matches. The key techniques proposed for TRP include player modeling and video-based player/court rendering. For player model creation, we propose a database normalization process and a behavioral transition model of tennis players, which might be a good alternative for motion capture in the conventional video games. For player/court rendering, we propose a framework for rendering vivid game characters and providing the real-time ability. We can say that image-based rendering leads to a more interactive and realistic rendering. Experiments show that video games with vivid viewing effects and characteristic players can be generated from match videos without much user intervention. Because the player model can adequately record the ability and condition of a player in the real world, it can then be used to roughly predict the results of real tennis matches in the next days. The results of a user study reveal that subjects like the increased interaction, immersive experience, and enjoyment from playing TRP.
Image-based rendering (IBR) is a technique to render the video from images, and it provides users to have more interaction and immersive experience in watching a video. In this paper, we integrate the computation of several IBR applications, analyze the bandwidth of memory access, and design an architecture to process the computation of IBR. Experimental results show that the proposed IBR Engine is able to render a video with resolution 720×480 and 30 frames per second, which is 12.7 times faster than a Core2Due 2.83 GHz CPU. For the extensions, IBR Engine can be embedded in the television system and lets viewers enjoy the functions from IBR.
Image segmentation is a well-developing topic in the image processing, and a number of previous works have been proposed and achieved high performance. However, most previous works needed user-assistance to provide the prior information of the target object in the segmentation. In this paper we propose an unsupervised scheme, combining the salient object detection and segmentation method, to segment the target object without any prior information from users. The experimental results show that the proposed salient color model derived with salient features can provide a prior information with high confidence to generate precise segmentation automatically. The proposed color model of salient objects can not only be applied with Min-Cut algorithm, but also extended to more segmentation algorithms, like matting or non-parametric model.
Tennis Real Play (TRP) is an interactive tennis game system constructed with models extracted from real game videos. The key techniques proposed for TRP include player modeling and video-based player/court rendering. Experiments show that vivid rendering results can be generated.
Image-based rendering has been highly developed for its wide applications such as view synthesis and special effects in movies. In this paper, we proposed a tennis player rendering system synthesizing diverse player action/motion based on extracted database from broadcasting game videos. The system gathers database by retrieving the player from videos and synthesizes various kinds of player action/motion according to the user's instructions. The results show that the proposed rendering system can render smooth action/motion transition with satisfactory visual effect. For further applications, the proposed system can be used in interactive tennis games with image textures.
Scalable video is the research topic to provide different size of video bitstream under different transmission bandwidth. In this paper, the semantic scalability is proposed that provides the scalable videos in semantic domain, and the tennis videos are used as the experiments. Contrary to decreasing the video quality to reduce the bitrates, the lower bitstream size is achieved by abandoning the video contents with less semantic importance. The experimental results show that the proposed semantic scalability provides four levels of the scalable videos and maintains the visual quality in watching the game video. The study of the scalability in semantic domain provides a new aspect for the scalable video.
Sprite is an image constructed from video clips and is also a medium for multimedia applications. An automatic sprite generation with foreground removal and super-resolution is proposed in this paper. To remove the foreground objects, each pixel-value on the sprite is iteratively updated by the value with maximum appearance probability on temporal and spatial distribution. By storing the half-pixel, superresolution sprite has less blurring-defect from source video. In the result, the generated sprite preserves the complete scenes of background and has higher image quality, and it can used to increase the visual quality in current sprite applications and also employed to facilitate video segmentation.
Sport video enrichment can provide viewers more interaction and user experiences. In this paper, with tennis sport video as an example, two techniques are proposed for video enrichment: content layer separation and real-time rendering. The video content is decomposed into different layers, like field, players and ball, and the enriched video is rendered by re-integrated these layers information. They are both executed in sprite plane to avoid complex 3D model construction and rendering. Experiments shows that it can generate nature and seamless edited video by viewerspsila requests, and the real-time processing speed of 30 720times480 frames per second can be achieved on a 3 GHz CPU.
Sport video annotation can help viewers easily browse sport video content and quickly find the hot events and highlights in a game. Although many annotation algorithms have been proposed, they are not suitable for practical implementation since the high complexity and the low precision rates are not acceptable. In this paper, a method of sport video temporal structure decomposition, which decomposes the sport video into many video clips, is proposed. Then score box information and additional semantic information are important clues for event annotation. Experimental results show that the proposed algorithm can successfully and effectively decompose video into clips. The annotation results also have extremely high precision and recall rates for both baseball and tennis videos.
This video demo presents a new framework of sport video applications called as Tennis Video 2.0. The proposed information extraction scheme retrieves the temporal structure of a video and separates the video foreground and background objects into different layers. With the structure and layer information, the new multimedia is generated. Contrary to the conventional video contents, the proposed new multimedia enables users to generate their own contents and feedback requests to the video players for more interaction. Users even can share their created contents with friends in different transmission bandwidth with considering the semantic.