Abstract:

The success of neural networks and the advent of specialized hardware such as GPUs has led to larger models with increasingly large unstructured datasets in machine learning. Curating and assembling a large, high-quality dataset is a time-consuming process. Further, training models on these datasets requires expensive computing resources. Some of these issues are alleviated with the advent of paradigms such as self-supervised and transfer learning. However, when the data drift and change over time, models must be periodically retrained to keep up. Graph structures, both implicit and explicit, are ubiquitous in Natural Language Processing. Implicit structures can be derived from language morphology, syntax, and semantics and expressed using attributed tree graphs. External structures capture world knowledge and semantics using knowledge graphs and ontologies. Additionally, textual data may have associated metadata in external graphs, such as network structure for social media interactions. In this dissertation, we posit that an abundance of associated structural information needs to be utilized for scaling and adaptation. The prevalence of these structures behooves us to utilize them to improvise, adapt, and overcome the challenges posed by scaling and drifts in data. In our work, we focus on three broad directions for using these structures: augmenting existing text models with structure, exploring the role of structure in creating adversarial testing samples, and structured-enhanced monitoring of model performance over time. The first direction that we explore is the impact of incorporating structure into text representation learning pipelines. In our first contribution, we study how the implicit structure of text data (here, URLs) can be used to design domain-specific losses and adversarial attacks to build a state-of-the-art system for phishing URL detection. This work comprehensively analyzes transformer models on the phishing URL detection task. We consider the standard masked language model and additional domain-specific pre-training tasks and compare these models to fine-tuned transformer models. Our model improves over the best baseline over a range of low false positive rates. Using a domain-informed attack scenario, we then demonstrate how these models can be more robust using adversaries constructed from benign URLs. In both fine-tuning and adversarial attacks, the underlying syntax of URLs serves as the structure that enables us to build a robust model. Our second area of research is the role of intrinsic structure in visualizing and analyzing the fairness of machine learning models. Specifically, we study the syntax of commonly used fairness metrics. Our contribution improves the probabilistic guarantees for such grammars in an interactive and online setting. We construct a novel visualization mechanism that can be used to investigate the context of reported fairness violations and guide users toward meaningful and compliant fairness specifications. Our framework requires certain assumptions about the data-generating process at run time. Following this work, we investigate techniques that can help expand probabilistic guarantees under weaker assumptions. In particular, we are interested in a setting where dependencies between different data points are represented through a (predefined) network structure. We critically analyze the choices made and describe the trade-offs associated with existing work in this domain. Finally, in our third broad direction, we study the problem of author identification. Our work demonstrates that it is possible to appropriately intermingle graph representation learning with textual representations to utilize the orthogonal signals from each and improve author identification across time-disjoint task settings. We first develop a novel stylometry-based multitask learning approach for natural language and model interactions using graph embeddings to construct low-dimensional representations of short episodes of user activity for authorship attribution. We comprehensively evaluate our methods across four darkweb forums, demonstrating their efficacy over the state-of-the-art, with a lift of up to 2.5X on Mean Retrieval Rank and 2X on Recall@10. Next, we focus on the textual component of the author identification models. We demonstrate that it is possible to use models trained on large, clear web datasets to improve author identification on darkweb forums. We conclude this direction with a study of the limitations of text-based models in generalizing across time and demographics. Our work has potential extensions for the latter two directions, which we discuss in a concluding chapter on future work. We empirically demonstrate that structure can improve author identification even with large-scale datasets. We provide concrete architectural suggestions that may be used to train models that utilize both the structure and content of large datasets in future work. Secondly, we discuss extensions of the ideas we discussed above in the work on fairness monitoring. We expand our work on our theoretical framework for conformal prediction on graphs to propose mechanisms for runtime fairness in graph-structured data. Finally, based on the observed limitations of author identification models, we propose extensions of ideas explored in our work on temporal robustness that may be used to provide bounds on the generalization capabilities of these models.