Overview of the projects that I have worked on.

ObedienceBench: A Causal Intervention Benchmark for Safety Alignment in Language Models

TriGuard: Testing Model Safety with Attribution Entropy, Verification, and Drift

Abstract: Deep neural networks often achieve high accuracy, but ensuring their reliability under adversarial and distributional shifts remains a pressing challenge. We propose TriGuard, a unified safety evaluation framework that combines (1) formal robustness verification, (2) attribution entropy to quantify saliency concentration, and (3) a novel Attribution Drift Score measuring explanation stability. TriGuard reveals critical mismatches between model accuracy and interpretability: verified models can still exhibit unstable reasoning, and attribution-based signals provide complementary safety insights beyond adversarial accuracy. Extensive experiments across three datasets and five architectures show how TriGuard uncovers subtle fragilities in neural reasoning. We further demonstrate that entropy-regularized training reduces explanation drift without sacrificing performance. TriGuard advances the frontier in robust, interpretable model evaluation.

Scalable Adversarial Filtering for Enhanced Generative AI Risk Detection

A python-based framework for detecting harmful outputs from large language models (LLMs). Combines rule-based heuristics, semantic analysis, and fine-tuned classifiers-to flag toxic/unsafe content with 40% lower false negatives than baseline approaches.

Beyond RAG: Enabling Explicit Memory in Pretrained LLMs for Enhanced Reasoning and Efficiency

Abstract: Explicit memory refers to highly sparse attention key-value pairs derived from the reference text. As a form of knowledge that offers lower decoding costs compared to plain-text retrieval augmented generation (RAG) and lower encoding costs compared to model parameters, explicit memory can help large language mod els (LLMs) reduce the cost of acquiring new knowledge. Previous work has explored train ing models from scratch to reason with explicit memory, achieving significant capability improvements while reducing training and inference costs. However, many existing pretrained models lack the ability to utilize explicit memory. In this work, we aim to study how to enable pretrained models to learn to use explicit memory through supervised fine-tuning (SFT), without forgetting their pretrained knowledge. We curated 120,000 training examples and de signed a training strategy to teach the model to use explicit memory to solve math and coding tasks. Experiment results show that the use of explicit memory can help models solve mathematical problems; however, the improvements are limited and come at the cost of degraded performance on coding tasks. Furthermore, the results reveal that pretrained models are prone to catastrophic forgetting during memory in volving fine-tuning. This work serves as an initial attempt to explore how pretrained models can be equipped with explicit memory, and we plan to continue investigating better training and data strategies in the future research. All of our work can be found on our GitHub homepage: https://github.com/szjiozi/Explicit-Memory

Hybrid Risk Models for Small Businesses

Abstract: We propose a novel hybrid quantum-classical model for small business credit risk assessment that integrates simulated quantum annealing for feature selection and business segmentation with classical XGBoost classification. Our method achieves a 98.6\% AUC and a 94.5\% F1-score while reducing the feature set from 32 to 10, enhancing model interpretability and training efficiency. Using a QUBO-based feature selection mechanism, our model emphasizes economically meaningful attributes and integrates SHAP for explainability. This approach demonstrates real-world applicability for rapid, transparent credit risk decisions in SME finance. Our contributions include: (1) a quantum-inspired pipeline for SME lending, (2) integration of interpretable machine learning via SHAP, (3) extensive comparative benchmarks against classical models, and (4) rigorous feature attribution analysis using SHAP.

Quantifying Risk Aversion and Risk-Seeking Behavior Through Utility-Based Policy Learning in Blackjack

Abstract: Modeling human decision-making under risk and uncertainty remains a significant challenge, with implications in fields like economics and cognitive science. Many decision-making models fail to fully account for individual risk preferences and utility functions. This work investigates the relationship between payoff and utility across diverse risk rofiles, using Blackjack as a simulation to quantify rewards and outcomes. By modeling the game as a deterministic environment, we apply dynamic programming to compute action values for each state and subsequent subgame encountered during gameplay

based on different utility functions, and select the highest-value action, weighted by probabilistic outcomes1. Drawing on cognitive theories such as Prospect Theory and Expected Utility Theory, our findings show that variations in utility functions significantly influence decision-making and financial outcomes, offering new insights into decision-making under un-certainty.