1) Topic modeling project: I would like to explore topic modeling for library search and discovery as an initial prototype project for the Graduate Center’s new institutional repository, Academic Works. Academic Works is a digital repository of the scholarly and creative works of CUNY Graduate Center faculty, students, and research centers. The Graduate Center Library administers this open access repository to preserve, showcase, and facilitate access to these works, which currently include articles, dissertations, and theses.
Academic Works allows users to browse across collections and disciplines and to enter key word searches across a particular series of texts or across the full repository, but there is no more granular level of structured search by topic and the repository does not offer any traditional subject headings or other classification scheme. The problem is not limited to Academic Works and it common in many digital repositories. We need better metadata!
In the first phase of the project, I plan to use MALLET to analyze a test corpus of GC full text dissertations in Academic Works over a five-year period from 2009 to 2014. The prototype could then be used to launch a longer-term project based in the CUNY Graduate Center Library to implement topic modeling as an access tool for all materials in Academic Works.
2) Text mining project with special collections materials and archival finding aids. This project could use a tool such as the Stanford Named Entity Recognizer (NER) to search metadata and any available full text for names of individuals and organizations to be added to library authority databases (Library of Congress, Virtual International Authority File) and to more general resources such as Wikipedia/DBpedia. This project would improve user search experiences (fixing problems such as searching through several different versions of the same name) and also increase public knowledge about neglected individuals and organizations. (This would be one way to address Wikipedia’s gender gap!)
3) A project which would both build and improve ways for users to find open access content (or open educational resources) through the library. Directing users (especially students) to high quality free resources would serve CUNY’s educational mission and also lessen the library’s reliance on high-priced content from commercial vendors.

