Overview of the projects that I have worked on.
Alignment-Aware Tokenization with Few Safety Labels
Abstract: Tokenizer design is usually treated as a fixed preprocessing choice, yet it can couple hazardous morphemes to benign neighbors through subword spillover. This creates representation space interference that increases false positives and destabilizes safety behavior. We propose AAT, a small label pipeline that first learns a mid layer hazard concept direction from a few hundred labeled anchors and matched neutral examples, then uses this direction to regularize LoRA fine tuning by penalizing hazard activation in neutral contexts, and finally edits tokenization itself to reduce spillover through drift aware BPE merge pruning for BPE models and hazard aware SentencePiece priors for Unigram models. Across five backbones: Pythia 410M and 1.4B, LLaMA 3 8B, Mistral 7B, and Qwen2 7B, all trained on unlabeled C4 with the same small labeled safety set, we evaluate perplexity, tokens per character, drift statistics, segmentation stability, and jailbreak and benign refusal proxy metrics. Our results support a practical but narrower claim. Tokenization is a meaningful safety relevant control knob in the few label regime. Tokenizer edits improve segmentation stability and change hazard sensitive internal behavior in measurable ways. When they are paired with lightweight adaptation, they often improve robustness proxies at modest quality and efficiency cost. We also find a clear constraint. Tokenizer only retokenization for Unigram models creates severe tokenizer model mismatch without adaptation. Finally, frozen feature hazard probes are highly label efficient and saturate quickly by 300 labels. Taken together, these results suggest that alignment objectives should be designed jointly with tokenization rather than placed on top of a fixed input basis.
Minimizing Targeted Activations to Reduce Prompt Leakage in LLMs
Abstract: Prompt phrasing can activate internal “meta” features that bias model behavior (e.g., evaluation-awareness cues such as “this will be graded”) or shift computation (e.g., ortho-graphic casing). We study prompt-side suppression of such features as a complement to inference-time activation steering. Building on Evolutionary Prompt Optimization (EPO) (Thompson et al., 2024), we introduce inverted EPO (EPO-MIN), which minimizes model-internal quantities while main taining fluency, measured by self cross-entropy (self-XE). We target (i) output-token logit margins, (ii) individual MLP neuron acti- vations, and (iii) residual-stream directional projections learned from prompt contrasts in microsoft/phi-2 (Javaheripi et al., 2023). Milestone 1 localizes sensitive layers for two contrasts (evaluation-awareness vs. neutral; UP-PERCASE vs. lowercase). Milestone 2 per- forms minimization under fluency constraints, revealing distinct trade-off geometry: logit objectives yield extended Pareto fronts, while many neuron and residual-direction targets quickly enter near-zero basins, compressing Pareto sets. Milestone 3 compares EPO-MIN jointly against two prompt-only baselines: BASELINE 1 (RANDOM) and BASELINE 2(MINSCAN), summarized in Figure 5 and Table 1 and reports stability across seeds. Overall, prompt-side optimization can reliably suppress targeted internal features without degrading fluency, offering a lightweight alternative when model-side interventions are unavailable.
TriGuard: Testing Model Safety with Attribution Entropy, Verification, and Drift
Abstract: Deep neural networks often achieve high accuracy, but ensuring their reliability under adversarial and distributional shifts remains a pressing challenge. We propose TriGuard, a unified safety evaluation framework that combines (1) formal robustness verification, (2) attribution entropy to quantify saliency concentration, and (3) a novel Attribution Drift Score measuring explanation stability. TriGuard reveals critical mismatches between model accuracy and interpretability: verified models can still exhibit unstable reasoning, and attribution-based signals provide complementary safety insights beyond adversarial accuracy. Extensive experiments across three datasets and five architectures show how TriGuard uncovers subtle fragilities in neural reasoning. We further demonstrate that entropy-regularized training reduces explanation drift without sacrificing performance. TriGuard advances the frontier in robust, interpretable model evaluation.
Beyond RAG: Enabling Explicit Memory in Pretrained LLMs for Enhanced Reasoning and Efficiency
Abstract: Explicit memory refers to highly sparse attention key-value pairs derived from the reference text. As a form of knowledge that offers lower decoding costs compared to plain-text retrieval augmented generation (RAG) and lower encoding costs compared to model parameters, explicit memory can help large language mod els (LLMs) reduce the cost of acquiring new knowledge. Previous work has explored train ing models from scratch to reason with explicit memory, achieving significant capability improvements while reducing training and inference costs. However, many existing pretrained models lack the ability to utilize explicit memory. In this work, we aim to study how to enable pretrained models to learn to use explicit memory through supervised fine-tuning (SFT), without forgetting their pretrained knowledge. We curated 120,000 training examples and de signed a training strategy to teach the model to use explicit memory to solve math and coding tasks. Experiment results show that the use of explicit memory can help models solve mathematical problems; however, the improvements are limited and come at the cost of degraded performance on coding tasks. Furthermore, the results reveal that pretrained models are prone to catastrophic forgetting during memory in volving fine-tuning. This work serves as an initial attempt to explore how pretrained models can be equipped with explicit memory, and we plan to continue investigating better training and data strategies in the future research. All of our work can be found on our GitHub homepage: https://github.com/szjiozi/Explicit-Memory
Abstract: Modeling human decision-making under risk and uncertainty remains a significant challenge, with implications in fields like economics and cognitive science. Many decision-making models fail to fully account for individual risk preferences and utility functions. This work investigates the relationship between payoff and utility across diverse risk rofiles, using Blackjack as a simulation to quantify rewards and outcomes. By modeling the game as a deterministic environment, we apply dynamic programming to compute action values for each state and subsequent subgame encountered during gameplay
based on different utility functions, and select the highest-value action, weighted by probabilistic outcomes1. Drawing on cognitive theories such as Prospect Theory and Expected Utility Theory, our findings show that variations in utility functions significantly influence decision-making and financial outcomes, offering new insights into decision-making under un-certainty.