Project Proposal #1: Topic modeling project
- Introduction
I would like to explore topic modeling for library search and discovery as an initial prototype project for the Graduate Center’s new institutional repository, Academic Works. Academic Works is a digital repository of the scholarly and creative works of CUNY Graduate Center faculty, students, and research centers. The Graduate Center Library administers this open access repository to preserve, showcase, and facilitate access to these works, which currently include articles, dissertations, and theses.
Academic Works allows users to browse across collections and disciplines and to enter key word searches across a particular series of texts or across the full repository. There is no more granular level of structured search and the repository does not offer any traditional subject headings or other classification scheme. This problem of limited search and discovery in digital repositories is not limited to Academic Works. The traditional methods of providing structured searching by topic through the use of hierarchical taxonomies have failed to keep pace with the explosion of digitized print publications and born digital texts. A researcher looking for information by topic in a large digital corpus is often at sea without a compass and cannot be sure that search results offer complete or accurate information. Academic Works needs better metadata!
- Set of personas
Andromeda Dissertation is a graduate student at the CUNY Graduate Center. She is planning her dissertation topic and she wants to consult Academic Works to see what other students at GC have done.
Professor Old Hat is not very excited about digital projects, but he thinks Academic Works sounds more convenient than going to the library to look at the print copies of the dissertations.
Faraway Researcher lives in another country and does not know anything about the CUNY Graduate Center. S/he comes to Academic Works after doing a search in Google Scholar.
- Use case scenario
Andromeda Dissertation is a graduate student in the English department at the CUNY Graduate Center. She is browsing Academic Works to look for recent dissertations on Walt Whitman’s Leaves of Grass. She looks at the discipline lists but discovers that she can go no further than an undifferentiated list of “Works in American Literature.” Annoyed, she starts typing in “Leaves of Grass” in the keyword search box. This does pull up one dissertation on Whitman, but it also yields several irrelevant results such as “How Important Is Land-Based Foraging to Polar Bears?” Andromeda laughs bitterly and loathes those idiots at the library, but then she notices an attractive visualization displaying a cloud of words representing topics in all of the English department dissertations. She finds not only Whitman, but several other topics related to her research which she had not thought to seek out through a key word search. She clicks on the visualization that takes her directly to the related section of the full text and begins exploring.
- How to make the full-fledged version
In the first phase of the project, I plan to focus on a test corpus of GC full text dissertations in Academic Works over a five-year period from 2009 to 2014. The prototype will then be used to launch a longer-term project based in the CUNY Graduate Center Library to implement topic modeling as an access tool for all dissertations in Academic Works. Academic Works will eventually contain the full corpus of past CUNY dissertations from 1965 to the present. It will also become the primary platform for the publication, dissemination, and preservation of all future CUNY dissertations. Creating better subject access to the dissertations through topic modeling will aid researchers and also provide an overview analysis of the history of scholarship at the Graduate Center. Other possible features of the full-fledged version would include building a metadata crosswalk from topic modeling results to standard controlled vocabularies such as Library of Congress Subject Headings or OCLC FAST (Faceted Application of Subject Terminology) Headings.
- Time estimate
Uncertain, but probably at least a year for the test phase with dissertations from 2009-2014. The larger project for all dissertations going back to 1965 is a long term project that would take several years.
Skills estimate
Known skills: XML/XSLT, MarcEdit, Excel, general metadata/controlled vocabulary knowledge
Skills I would need to learn: AntConc, Mallet, Gephi or other visualization tools. Also need to clear substantial technical and bureaucratic hurdles for integrating new search tool into Digital Commons repository platform. Building the metadata crosswalk would also require considerable additional exploration, consulting metadata experts at other libraries, and the acquisition of additional skills.
- How to make the stripped down version
A stripped down version would focus on a smaller test corpus of 377 dissertations and theses from 2010 and compare topic modeling results against author supplied key words, subject headings assigned in the ProQuest Dissertations and Theses database, and traditional Library of Congress Subject Headings assigned by GC librarians in the records in the CUNY catalog. In 2010, the library was still doing manual, original cataloging of each dissertation, but it had also begun the electronic dissertation deposit process which affords both full text searching and additional metadata (author keywords and ProQuest headings). This comparison of old and new methods would be a more systematic and traditional way to test topic modeling as a library classification tool, but the resulting project would take the form of writing an initial evaluation rather than building a functioning tool. This version of the project would be much simpler to do and it would eliminate the need to integrate a new feature into the Digital Commons repository platform.
- Time estimate: 6 months
Skills estimate
Known skills: XML/XSLT, MarcEdit, Excel, general metadata/controlled vocabulary knowledge
Skills I would need to learn: AntConc, Mallet, Gephi or other visualization tools
Project Proposal #2: Mapping the History of NYC Islands
- Introduction
New York City is the nation’s largest urban archipelago, with dozens of small islands which are now mostly abandoned and off limits to the public. Many of the islands were once part of New York’s maritime and industrial economy or served to house hospitals, jails, military training centers, and graveyards. Abandoned 19th century buildings can be seen on some islands, while most are barren except for flocks of nesting birds. Over thirty small islands no longer exist at all because they were joined to the mainland using construction debris as landfill. Some islands played a more dramatic role in history, such as Big Tom Island which housed a munitions depot destroyed by German saboteurs in 1916.
This project will use maps to tell the history of these islands, uniting information from books, scholarly articles, newspapers, and archives. The project will be suitable for use in New York City history classes as well as appealing to scholars, NYC history buffs, eccentric tourists, and curious kayakers and bird watchers.
- Set of personas
Cassidy Commoncore is a social studies teacher in a NYC public middle school. She is always looking for new ways to teach students about the history and environment of NYC.
Horatio History is a native New Yorker and local history enthusiast. He is not an academic, but he devotes a lot of time to reading and research.
Tina Tourista is a visitor to NYC who is interested in the off-beat and forgotten aspects of the city’s history.
Birding Bob & Kayaking Karen are interested in NYC harbor islands as part of the city’s natural environment.
- Use case scenario
Cassidy Commoncore is a social studies teacher in a NYC public middle school. She is required to teach about the history and environment of NYC as part of the curriculum, but it is a struggle to find ways to keep students interested in the material. The Mapping the History of NYC Islands project is a free online tool that helps her students learn geography and history as well as exploring environmental issues such as protecting nesting harbor herons. Students can explore the islands through an online map of New York City and click on each one to see images, interesting facts, historical summaries, and links to further information.
- How to make the full-fledged version
The project will have an initial research phase to find and assemble information on the history of the islands from a variety of sources (online sources including Wikipedia, NYC history blogs and websites, scholarly and popular journal articles and books, and museum and archive collections). The building of the actual interface would involve using CartoDB or similar platform designed for story telling with maps
- Time estimate: 6 months to 1 year.
Skills estimate
Known skills: Research, writing, and presentation skills. Basic HTML, WordPress skills.
Skills I would need to learn: GIS skills/CartoDB, more advanced WordPress
- How to make the stripped down version
The stripped down version would limit the initial research phase to more readily available resources and save more labor-intensive research (such as visits to museums and archives) for a later phase. This version would focus on building the basic interface first and then expand and develop with additional content on an ongoing basis.
- Time estimate: 6 months.
Skills estimate
Known skills: Research, writing, and presentation skills. Basic HTML, WordPress skills.
Skills I would need to learn: GIS skills/CartoDB, more advanced WordPress

