Data Science in Libraries: Infrastructures, Epistemologies, and the Politics of Evidence

Introduction

Over the past decade, the phrase *data science in libraries* has circulated widely in professional literature, policy documents, grant narratives, and conference panels. Yet the meaning of the term remains unstable. For some authors, data science refers to a technical repertoire—machine learning, natural language processing, statistical modeling, and large-scale data visualization—and the question for libraries is simply whether they can “adopt” these methods. Others understand data science as a shift in **organizational epistemology**, in which libraries increasingly orient decision-making, scholarly support, and public programming around evidence derived from computational analysis. Still others regard it as an **arena of political struggle**, given that data-intensive practices—however promising they appear—intersect uncomfortably with core professional commitments to privacy, autonomy, and equitable access.

This paper argues that to understand the emergence of data science in libraries, one must resist technological determinism. Libraries did not arrive at this moment simply because “big data is everywhere.” Rather, data science expanded within libraries through the interaction of three forces: (1) the digitization of scholarship and reading, (2) the consolidation of commercial library analytics into a set of platform vendors with unprecedented power over bibliographic infrastructure, and (3) the global movement toward openness and reproducibility in science. The result is not a simple “modernization” of the library but a reconfiguration of what libraries are expected to *know* about their collections and their users, and what forms of evidence are considered legitimate in professional reasoning.

Historical Genealogy: From Quantitative Librarianship to Algorithmic Infrastructure

The idea that librarianship has suddenly become “data-driven” is historically inaccurate. Quantitative analysis has been embedded in library administration since at least the 1960s, when operations research and cost–benefit analysis shaped collection development and staffing models. The influential works of Buckland, Kantor, and Lancaster treated library services as systems whose inputs, outputs, and efficiencies could be measured and optimized. In the 1990s, large-scale ILS implementations introduced usage logging and circulation analytics. The 2000s brought COUNTER and SUSHI, which standardized electronic resource usage reporting.

What distinguishes the 2010s and 2020s is not the presence of metrics but the **granularity, volume, and computability** of data. Discovery platforms record every search query and click path. Institutional repositories log download geographies and citation networks. Learning management systems track reading sequences and assignment interactions. Some libraries—typically large research institutions—now maintain terabytes of log-level behavioral data. The problem is no longer the absence of information but the overabundance of it.

Scholars in LIS have disagreed on whether this marks a fundamental epistemic shift. Tenopir and King argue that the abundance of user data enables “evidence-based librarianship,” where professional judgment is enhanced by systematic analysis. By contrast, Buschman cautions that the valorization of quantification risks recasting the library primarily in managerial logic—where success becomes a function of what can be counted rather than what matters.

These disagreements form the intellectual backdrop against which data science is unfolding.

Defining “Data Science” in the Library Context

Outside LIS, data science typically refers to an interdisciplinary synthesis of applied statistics, computer science, and domain expertise. Within libraries, the term has been used to describe at least five distinct—sometimes contradictory—practices:

    Computational research support: librarians enabling text and data mining, topic modeling, geospatial analysis, and digital humanities. Business intelligence for collection management**: predictive modeling to optimize acquisitions, cancellations, and license negotiations. Algorithmic discovery and recommendation systems**: machine learning embedded in search platforms. Learning analytics**: using student data to evaluate or redesign instructional support. Data literacy pedagogy: teaching patrons to analyze data critically and ethically.

The lack of a single definition is not a weakness but evidence of a field negotiating what kinds of expertise libraries should legitimately claim.

Practical Manifestations: Where Data Science Is Actually Happening

Research Data Services

Research data management has matured beyond compliance consulting. Many academic libraries now:

  1. mint DOIs and ORCID identifiers,
  2. maintain Dataverse, OSF, or Figshare repositories,
  3. run reproducibility workshops on Jupyter, Git, and Zenodo,
  4. advise on disclosure risk mitigation and differential privacy.

At institutions such as the University of Illinois, the University of Michigan, and the University of Edinburgh, librarians are listed as co-authors on computational research projects, not just facilitators.

Digital Scholarship and Text Mining

Digital humanities centers housed in libraries have supported:

  • OCR + NLP analysis of early modern print archives,
  • topic modeling of congressional transcripts,
  • sentiment analysis of historical newspapers,
  • network mapping of Indigenous authorship.
  • In these cases, librarians do not simply provide access—they contribute methodological expertise.

    Collection Analytics and Budget Strategy

    Data science has reshaped collection development strategy in ways that would have been politically unthinkable twenty years ago. Machine learning models analyzing citation half-lives, interdisciplinarity, and publisher bundle overlap are now routine inputs in subscription cancellation decisions. While some faculty view this as efficiency, others experience it as algorithmic governance of scholarly communication.

    Algorithmic Discovery and Knowledge Graphs

    Vendors such as Ex Libris, EBSCO, and Clarivate now embed proprietary ranking algorithms into discovery services. A corpus of work by Asher, Wilson, and Gross demonstrates that these systems profoundly shape which scholarship becomes visible, who is cited, and what counts as “core literature” in a field. The editorial point here is crucial: data science in discovery is not neutral infrastructure, but bibliographic power.

    Space and UX Analytics

    Libraries using heat-mapping and sensor systems (ETH Zürich, North Carolina State University, and others) have documented how patron movement, noise zones, and seating pressure reveal usage patterns that contradict professional intuition. Evidence frequently shows strong student preference for semi-isolated hybrid spaces, rather than the open-collaboration environments architects assume they want.

    The Politics and Ethics of Data Science in Libraries

    Real scholarship acknowledges **controversy and boundary conditions**. It does not treat ethical concerns as decorative afterthoughts.

    Privacy and the Fantasy of “Anonymous Data"

    Libraries historically protect patron confidentiality with near-absolute rigor. Data science undermines this assumption—not maliciously, but structurally. Even when datasets are anonymized, re-identification is possible when combined with external sources. One cannot meaningfully guarantee anonymity in a world of high-dimensional behavioral data. The real ethical question becomes: Should the analysis take place at all?

    Vendor Concentration and Algorithmic Opacity

    The consolidation of library technology into a handful of multinational vendors has produced an environment in which:

  • discovery algorithms cannot be audited,
  • user data flows to third parties without true informed consent,
  • ranking signals replicate market incentives rather than scholarly value.
  • The tension is unavoidable: institutions committed to openness are increasingly dependent on closed algorithmic systems.

    Algorithmic Bias in Bibliographic Infrastructure

    Subject headings, descriptive vocabularies, and historical cataloging practices encode systemic bias. Machine learning models trained on them amplify these distortions. Examples are not theoretical:

  • bias against Indigenous authors in subject authority assignment
  • misgendering in name authority control
  • underrepresentation of Black scholarship in recommendation engines.
  • The question is not whether library algorithms are biased—they are—but what institutions are willing to do about it.

    Labor and Skill Inequities

    Large research libraries can hire data scientists, UX analysts, and digital scholarship librarians; most public libraries cannot. This asymmetry raises questions about:

  • the future distribution of digital research infrastructure,
  • whether data science becomes a mechanism for institutional stratification within LIS,
  • and whether “innovation” narratives disguise widening inequality.
  • Future Directions: Not Predictions, but Contested Trajectories

    Academic writing does not pretend to know the future; it maps *possibilities and tensions*. Several trajectories already underway illustrate the stakes of the next decade.

    Libraries as civic data stewards

    The idea that libraries may hold municipal, cultural, or community-generated datasets under ethical governance models is powerful—but under-theorized. The risk is that libraries inherit responsibility without resources.

    Open, auditable discovery algorithms

    Some librarians advocate for community-governed search ranking and recommendation systems. Whether this is economically feasible in a vendor-dominated market remains open.

    Privacy-preserving computational analytics

    Differential privacy and federated learning offer symbolic alignment with library ethics. But they are not magic solutions; they require expertise, infrastructure, and long-term funding.

    Pedagogy as strategic positioning

    Data literacy instruction—not just technical but **critical** data literacy—may become the most durable and mission-consistent form of data science in libraries.

    Conclusion

    Data science has not swept libraries because librarians are technophiles or because analytics represents modernity. It has taken hold because **the epistemic expectations placed upon libraries have changed**. Universities, funding agencies, and the public increasingly expect libraries to justify decisions empirically, to actively participate in computational research, and to provide patrons with training relevant to a data-saturated world.

    The danger is not that data science undermines librarianship, but that—without critical intervention—it may recast librarianship in ways misaligned with its foundational values. The opportunity is that libraries can articulate a model of data stewardship grounded not in extraction but in autonomy, transparency, and community benefit. Whether this opportunity becomes reality depends not on technology but on governance, labor investment, and political will.

    The central insight is therefore neither celebratory nor cynical: data science in libraries is a site of **negotiation**, not arrival. It represents not a settled identity but an ongoing struggle over what it means for an institution built on intellectual freedom to operate in a world built on behavioral data. The outcome of that struggle will determine not only what libraries do, but—for the first time in decades—what libraries are.

    imls logo data & society logo