🚦 Computer Vision for ITS
Younggun Kim*, Mohamed Abdel-Aty, Keechoo Choi, Zubayer Islam, Dongdong Wang, and Shaoyan Zhai
🎉 IEEE Open Journal of Intelligent Transportation Systems (OJ-ITS), 2025. [Impact Factor: 5.3, JCR Quartiles: Q1]
Motivation: With the rise of smart intersections, accurately predicting a pedestrian’s crossing direction at the intersection level is essential for enhancing pedestrian safety and optimizing traffic signal control. However, this task is challenging due to the diverse configurations of real-world intersections, each with different crosswalk orientations, geographic layouts, and CCTV placements and angles.
Methodology: To address this, we propose a geometric-invariant space embedding method. As illustrated in the video, our approach enables pedestrians captured at different intersections, with varying CCTV angles and crosswalk orientations, to be embedded into a unified spatial representation. This standardization allows robust learning of crossing behaviors. Furthermore, we adopt a Transformer-based encoder, achieving accuracy of 94.10% and an F1-score of 92.35%.
Impact: The proposed method demonstrates strong potential for real-world deployment. Its robustness to intersection variability makes it suitable for integration into city-wide intelligent traffic management systems to proactively ensure pedestrian safety at signalized intersections.
On going project
Motivation: In smart intersections, accurately assessing vehicle-to-vehicle safety is crucial. This requires estimating safety surrogate measures such as relative distance. However, limited CCTV viewpoints, typically front-facing, lack depth information, making it infeasible to compute inter-vehicle distances directly from monocular footage.
Methodology: We calibrate top-down drone views with CCTV footage to construct vehicle bounding boxes in both front and bottom perspectives. By calibrating vehicles captured in drone views with those in CCTV footage, we develop a cross-view dataset that includes bottom bounding boxes. A model is then trained to estimate the both bounding boxes from CCTV-only views, enabling accurate spatial localization without relying on drone footage.
Impact: This work enables real-time vehicle safety monitoring using only existing CCTV infrastructure. By bridging the gap between aerial and CCTV footage, it allows accurate estimation of surrogate safety measures at intersections, contributing to proactive vehicle conflict detection and urban traffic safety management.
Project under the National Science Foundation(NSF) and Center for Smart Streetscapes (CS3)
Motivation: Vehicles and pedestrians continuously influence each other's movement, particularly through behaviors such as yielding or hesitation. Accurately predicting the future trajectories of both agents is essential to improve road users' safety at intersections. A joint understanding enables proactive conflict detection and enhances traffic safety.
Methodology: We leverage a Graph Convolutional Network (GCN)-based framework to jointly encode the historical trajectories of vehicles and pedestrians into a shared latent feature space. The model captures mutual interactions between the two agent types and learns to predict their future trajectories concurrently.
Impact: Our approach facilitates accurate multi-agent trajectory forecasting at intersections using observed movement patterns. By modeling pedestrian–vehicle interactions, it enhances the reliability of safety assessment and supports the development of smarter traffic control systems that account for both pedestrians and vehicles.
🚘 Novel Deep Learning Architectures for 3D LiDAR Recognition
Younggun Kim and Soomok Lee*
🎉 Asian Conference on Computer Vision (ACCV), 2024. [BK21(Brain Korea) Distinguished Conference Paper List]
Motivation: LiDAR is one of the crucial sensors for autonomous vehicles (AVs), offering accurate 3D spatial information essential for object recognition and safe navigation. However, the quality of LiDAR point clouds varies significantly with the number of channels (e.g., 32CH, 64CH, 128CH), and high-channel sensors are often too expensive for wide deployment. This creates a challenge, as deep learning models trained on high-resolution data often underperform when applied to lower-resolution inputs. Thus, it is critical to develop recognition models that are robust to LiDAR channel variations.
Methodology: We developed 3D-ASCN that recognizes objects using 3D point cloud data from LiDAR sensors. To address the challenges caused by variations in LiDAR sensor configurations, our model is designed to maintain reliable performance even when the quality or resolution of LiDAR data changes. Specifically, we introduce a distance-based kernel and a direction-based kernel that learn the structural feature representations of objects from point clouds. These kernels enable the model to focus on the intrinsic geometry of objects rather than the density of the point cloud itself, allowing for robust classification and object detection across different LiDAR resolutions.
Impact: The proposed 3D-ASCN can contribute to the real-world deployment of AVs by enabling reliable object recognition regardless of the LiDAR sensor used. By overcoming performance degradation caused by LiDAR channel shifts, our approach reduces hardware dependency and cost, allowing AV systems to be both scalable and economically feasible.
Younggun Kim, Beomsik Cho, Seonghoon Ryoo, and Soomok Lee*
This work has been submitted to Expert Systems with Applications (ESWA). [Impact Factor: 7.5, JCR Quartiles: Q1]
🧠Multimodal Large Language Models for Visual Understanding and Safety
Younggun Kim, Ahmed Abdelrahman*, and Mohamed Abdel-Aty
🎉 International Conference on Computer Vision Workshop (ICCVW), 2025.
Motivation: Vulnerable Road Users (VRUs), such as pedestrians and cyclists, are disproportionately affected in traffic accidents. Understanding the causes, contexts, and preventability of such accidents is crucial for improving road safety. Multimodal Large Language Models (MLLMs) have emerged as powerful tools for scene understanding and can support applications like accident report summarization and autonomous vehicle (AV) decision-making. However, their real-world utility remains limited due to a lack of high-quality, safety-focused benchmarks that test fine-grained reasoning and description capabilities in accident scenarios.
Benchmark Statistics: The VRU-Accident benchmark consists of 1K real-world dashcam accident videos involving VRUs. It features 6K multiple-choice video question answering (VQA) questions covering six categories: weather & lighting, traffic environment, road configuration, accident type, accident cause, and prevention measure. Each question comes with four answer choices, one correct and three counterfactual distractors, resulting in 24K candidate options, including 3.4K unique answers to encourage diverse and nuanced reasoning. The benchmark also provides 1K densely annotated scene-level captions that narrate the spatiotemporal dynamics of accident scenarios.
Impact: VRU-Accident serves as the first large-scale benchmark to systematically evaluate MLLMs' understanding of VRU-related accident scenes through both VQA and dense captioning tasks. By providing diverse and semantically rich annotations grounded in real-world scenarios, the benchmark enables quantitative evaluation of MLLMs' reasoning, grounding, and narrative capabilities in high-risk traffic contexts. It lays the foundation for developing safer and more interpretable AI systems for AVs and transportation safety research.
This work has been submitted to Conference on Neural Information Processing Systems (NeurIPS).
Motivation: MLLMs have shown impressive capabilities across various domains. However, they pose serious privacy risks due to their tendency to leak biometric information such as race, gender, and age from visual data. Despite existing regulations like GDPR and SPCD, MLLMs often violate privacy constraints, which is particularly problematic in high-stakes applications such as government projects. This issue stems primarily from the leakage of biometric attributes in the training datasets used to build these models.
Methodology: We developed a novel pipeline to clean biometric information from open datasets used to train MLLMs. This process ensures that explicit biometric attributes are removed while preserving semantic richness. We also introduce a benchmark to evaluate both explicit leakage (where models are directly asked about biometric details) and implicit leakage (where personal information is inferred from open-ended questions). As illustrated in the figure, LLaVA-v1.5 trained on existing datasets reveals both types of leakage, whereas Safe-LLaVA refuses to answer biometric questions while maintaining rich, non-biometric descriptions in open-ended prompts.
Impact: Safe-LLaVA offers the first comprehensive solution for training and evaluating privacy-aware MLLMs. It enables researchers and practitioners to develop models that are both biometrically safe and informative, which is critical for real-world deployment in domains requiring strict privacy compliance. By preventing both explicit and implicit biometric inference, Safe-LLaVA represents a key step toward building responsible and ethically aligned multimodal AI systems.
🤖 Robotics Experience
Younggun Kim, Yooseong Lee, and Uikyum Kim*
🎉 Best Paper Award at the 17th Korean Robotics Society Annual Conference (KROS)
Motivation: Robotic grippers are widely used in industrial and research settings for object grasping. Conventional soft grippers such as the Fin Ray Effect (FRE) grippers can achieve stable adaptive grasping without complex control, but they have limitations in manipulating the pose or orientation of grasped objects.
Methodology: To address the limitations of soft grippers, we designed a novel gripper mechanism by adding an additional degree of freedom (DOF) to a standard FRE-inspired structure, enabling controlled deformation for manipulation tasks. We performed mathematical modeling and parametric analysis to determine the optimal force application points and directions for effective object manipulation. Then, we conducted Finite Element Analysis (FEA) simulations to validate deformation behavior under different load conditions.
Impact: The proposed gripper design enables simultaneous stable grasping and in-hand manipulation, overcoming the main limitations of conventional adaptive grippers. The gripper also provide a foundation for further development of adaptive robotic end-effectors capable of handling complex manipulation tasks in real-world environments.
Capston project
Overview: This project was carried out as part of my undergraduate capstone experience, with the goal of integrating the fundamental principles of mechanical engineering, including kinematics, dynamics, and motor design, into a comprehensive robotic system. I was particularly inspired by the Omega Gripper, published in IEEE/ASME Transactions on Mechatronics, which was specifically designed to grasp challenging objects such as thin cards. Recognizing this as an excellent example of how core engineering knowledge could be combined into a solution, I adopted the Omega Gripper as the foundation of my work. To deepen my understanding, I visited the POSTECH March Lab and consulted directly with the first author of the original paper, which helped resolve technical questions and further motivated me to pursue this project.
Gripper Kinematic Design: The gripper mechanism features an underactuated structure implemented using a parallel four-bar mechanism. I derived and analyzed both the closed-loop forward and inverse kinematic equations to model the motion mathematically and ensure precise control over the gripper’s configuration during operation.
Kinematic Simulation: Using MATLAB, I simulated the kinematics of the mechanism to optimize key parameters such as link lengths and ranges of motion, ensuring that the gripper could reliably adapt to a variety of object shapes and sizes.
Dynamic Simulation and Prototyping: Dynamic simulations were conducted to estimate the loads applied during grasping and to select appropriate material properties for the gripper body. Based on these analyses, I designed the gripper structure and fabricated the components via 3D printing. I also selected motors considering factors such as required grasping force and system responsiveness.
Demonstration and Real-time Tracking: For demonstration, I integrated an encoder at the lower part of the gripper and an IMU sensor at the center. The gripper was operated based on the inverse kinematic equations to achieve adaptive grasping of objects. Real-time tracking and visualization of the gripper’s state were performed using forward kinematics, allowing continuous monitoring of its configuration during operation.
Younggun Kim, Minjoung Sim, Hojun Lee, Wonjun Choi, and Hanbin Choi
🎉 Granted Patent (Patent No. 10-2506732, KR)
Motivation: With the increasing adoption of IoT technologies in healthcare and daily life, there is a growing need for systems that can continuously monitor and assist patients or individuals with limited mobility in real time. Maintaining critical devices such as cameras or displays in the correct ergonomic position relative to the user's face is essential to improve comfort, safety, and the effectiveness of monitoring. To address this need, we developed an intelligent cradle mechanism capable of automatically tracking and adjusting its orientation based on head movements.
Methodology: We designed a body-shaped ergonomic cradle that actively follows the user’s head movements to keep the device consistently positioned in front of the face. We adopted a real-time face detection system to capture head pose and movement directions, and implemented motor control algorithms to dynamically adjust the cradle’s orientation based on the detected motion. For deployment, we calculated the maximum load and structural requirements caused by body movements, and selected appropriate materials and actuators to ensure reliable performance.
Impact: This device enables hands-free operation and continuous monitoring, enhancing the usability of devices for patients, the elderly, or users with mobility constraints. We also demonstrates the practical integration of real-time vision-based tracking with adaptive mechanical systems, contributing to the development of smart IoT assistive devices.