DSpace and the Institutional Repository: Preservation and the Spatial Data Infrastructure

By Andrew Battista and Stephen Balogh

This is the second in a series of seven posts on the development of GeoBlacklight at New York University and is published on the first day of Geo4Lib Camp at Stanford University. To see the first post and an overview of the schedule, visit here.

The role of an institutional repository in building collections

In the first post, we provided an overview of our collection process and technical infrastructure. Now, we begin our detailed narrative with the first step: preservation within an institutional repository. Although we are not experts on the landscape of library repositories, we recognized early on that many schools are developing multi-purpose, centralized environments that attend to the interconnected needs of storage, preservation, versioning, and discovery. Furthermore, when it comes to collecting born-digital objects, every institution exists somewhere on a continuum of readiness. Some have a systematic, well-orchestrated approach to collecting digital objects and media of many kinds, while others have only a partially developed sense of the degree to which archival collections and other ad hoc born-digital items should exist within traditional library collections.

Stanford University is at the former end, it seems. The Stanford Digital Repository (SDR) is a multifaceted, libraries-based repository that allows for both collections (including archival materials, data, and spatial data) and researcher-submitted contributions (including publications and research data) to be housed together. Their repository attends to preservation, adopts a metadata standard (MODS) for all items, and imagines clear ways for content to be included within a larger discovery environment, SearchWorks. The SDR suggests a unifying process for providing access to many disparate forms of knowledge, some of which are accompanied by complex technological needs, and it facilitates the exposure of such items in ways that are useful to scholars and researchers of all kinds. See also the Purdue University Research Repository (PURR) for another good example of this model.

NYU is near the other end of this continuum of readiness, although this is changing with the advancement of our Digital Repository Services for Research (DRSR) project. The closest thing we have currently is our Faculty Digital Archive (FDA), which is an instance of DSpace. As the name implies, the FDA was conceived as a landing-place for individual faculty research, such as dissertations, electronic versions of print publications, and other items. More recently, we have started deploying it as a place to host and mediate purchased data collections for the NYU community, and we are encouraging researches to submit their data as a way of fulfilling grant data management plan requirements. These uses would anticipate the kind of function of the Stanford Data Repository, but we haven’t arrived yet.

Overview of Dspace and other options

Although NYU’s IR status is in flux, we decided to begin with the FDA (DSpace) as we developed our spatial data infrastructure. In this case, it was a good decision to work within a structure that already existed locally. dspace_logoFortunately, the specific tool used in the preservation component of the collection workflow is the least connected to GeoBlacklight of all our project components, and it can be altered as larger changes take place within an institution (as they inevitably do). There are other options available aside from DSpace, most notably the Hydra/Fedora project, a growing community that develops a robust range of resources for the preservation, presentation, and access of data. We won’t belabor this point, other than to say that if you’re working within a context that has no preservation repository in place, it is highly advisable to stand up an instance of DSpace, Hydra, or some other equivalent before beginning collection efforts. Even a persistent URL generator could work.

Collection Principles

The concept of a preservation repository spurred our thinking about some important collection development principles at the onset. The first and most obvious is that preservation is vital. Agencies release data periodically, retract old files, and produce new ones. Data comes and goes, and for the sake of stability and posterity, it is incumbent on those creating access to data to be vested in the preservation of it as well. Thus, we will make an effort to preserve all data that we acquire, including public data.

Second, the IR concept demanded that we make a decision regarding the level of data at which we would collect. In other words, the concept forced us to define what constitutes a “digital object” and plan accordingly. For many reasons, we decided that the individual shapefile or feature class layer is the level at which we would collect. The layer as an object corresponds with the level of operation intrinsic to every GIS software, and it’s the way that data is already organized in the field. What this does mean, for instance, is that we will not create a single item record that bundles multiple shapefiles from a set or a collection. There is a 1:1 correlation between each layer and each unique ID.

Adding Non-conventional Documentation or Metadata

Beginning our collection by preserving a “copy of record” in the FDA also gave us the space and flexibility to pair proprietary data with its original documentation, particularly in cases in which documentation comes in haphazard or unwieldy forms. This record of 2010 China Coastlines and Islands is a good example.


A screenshot of an item in the FDA that is also in our spatial data repository.

We purchased this set from the University of Michigan China Data Center, and the files came with a rudimentary Excel spreadsheet and an explanation of the layer titles and meanings. Rather than trying to transform that spreadsheet into a conventional metadata format, we just uploaded the spreadsheet with the item record. Now, when the data layer is discovered, users have a direct pathway back to original documentation. Note that there are other ways to expose original documentation and metadata within GeoBlacklight itself; this is just one that is easy and flexible.

The Chicken-Egg Conundrum 

The final (and most important) role of the repository model is that it generates a unique identifier which marks data in multiple places within the metadata schema and the technology stack behind it. This identifier applies to the item record in the repository, the layer in Geoserver, and the data file itself. However, given the workflow inherent in our process of collecting, the need for a unique identifier creates a classic chicken-egg scenario. In order to acquire an item, an item ID is required to create a metadata record for said item, yet without having made a “digital space” to store the item, we would not have the requisite “information” (i.e., the unique identifier) to create a metadata record for it. Invariably, then, the collection process starts with preserving the item of record, but it involves two separate steps. At the individual level, this is an easy problem to solve. We create an item in DSpace, get the unique identifier (the handle address), and then plug it into the GeoBlacklight metadata record. But as an ontological challenge and a workflow challenge, it’s something to think about.

Brief Snapshot of the Collection Model

At this point, it makes sense to introduce one of our first project diagrams: a hand-drawn overview of our architecture and collection model. The section in the red box is what’s pertinent here.


An overview of NYU’s spatial data infrastructure

The collection process is follows. The decision to collect happens, and we add an item in the FDA, which generates a unique ID (a handle). Note that we have created two distinct collections within our DSpace: a collection for unrestricted data and a separate collection for private, restricted data. The generated handle, in turn, is integrated into several places in the GeoBlacklight metadata record, including the dct_references field, which triggers the direct file download within the GeoBlacklight application (more on that in future posts). We preserve both the “copy of record” and a re-projected version of the file on the same item record in the institutional repository, a decision we will explain later when we talk about the technology stack behind the project.

Batch Imports

Ingesting records at the batch level is a bit more involved than individual additions, of course. Using the FDA at NYU has required us to collaborate with the managers of the system, Digital Library Technology Services (DLTS). Since we don’t administer our repository directly, we’ve had to collaborate with the team that does and provide for the .CSV files for batch upload. Because we generate our GeoBlacklight metadata with Omeka, we were making .CSVs regardless, but we got around the “chicken and egg” conundrum by having DLTS do a batch upload, which would “mint” a unique handle ID for each item. Then, we’d apply that handle ID to the item and re-upload or overwrite the existing items to include updated files, documentation, and descriptions. Most repositories should have a solution for batch uploading and uploading according to a set metadata standard, to which you might be required to conform.

On using a Third-Party Unique Identifier

At various points in time during this project, we thought about the benefits of using a third-party persistent link generator in order to facilitate the collection process and mitigate the chicken-egg problematic. Fortunately, GeoBlacklight metadata is capacious enough to allow for several different kinds of unique identifier services. Stanford uses PURL, others use ARK, while still others use DOIs. We’ve chosen to remain with the unique ID minted by DSpace, at least for now.

Up Next: Creating GeoBlacklight Metadata Records

Everything we’ve discussed regarding preservation within an institutional repository is not directly implicated in the application of GeoBlacklight itself. However, it should go without saying that these preservation steps are the foundation of our spatial data infrastructure. The application is inseparable from the metadata. In the next post, we will discuss the many ways to generate GeoBlacklight metadata as we collect spatial data.

Apps and Senior Adults

This past week, Lauren Wallis and I had the chance to deliver a session on apps and programming for senior adults at the 2014 meeting of the Alabama Gerontological Society. The theme of the conference was, “Aging: is there an app for that?” We discussed ways to filter through the iTunes app store and integrate iPads into practice settings with senior adults. Here are the slides of our presentation. Thoughts and comments welcome!

Sample GIS projects

I’ve written previously (here and here) on the relationship between GIS projects and information literacy learning. After the Fall 2013 semester concluded, I compiled a list of GIS projects completed by students in GEOG 231 World Regional Geography at the University of Montevallo. Working in groups, the students explored issues like immigration, wage earnings according to gender, educational attainment, and economic growth in the wake of Disney World. The groups were instructed to identify several indicators and represent social changes over time via Google Fusion Tables. Here’s a sample list of projects:

The Impact of the Creation of Disney World

Elderly Voting in the 2008 Election

Between Religion and Poverty: Megachurches and Poverty in California

Immigration: Crossing the Line

Gay Marriage and Population

Gender Wage Gap in the U.S.A.: 1999 vs. 2012

Comparison of Graduation Rates and Median Income Levels in Alabama Counties

Population Change in Arizona After the 2008 Recession

I’m interested to hear feedback! Who else is doing GIS projects with undergraduate students in social sciences disciplines?

Twitter Analytics

Many of you know that Laurel Hitchcock and I have embarked on a project that explores the impact of social media-based pedagogy on professional practice. By now, many social work educators are integrating Twitter and other social media platforms into the classroom, and there is much evidence that social media is an important vehicle of information gathering and advocacy.


However, thus far, one gap in education research is our reticence to derive empirical measures of how students use social media platforms once they leave the classroom. Do the social media strategies we teach sink in? Furthermore, do we have any evidence that the social media literacy we cultivate in the classroom has a meaningful impact on professional practice?  Continue reading

GIS Projects in Social Work Education: Preparing Resources for Agencies

Should Social Work students learn rudimentary Geographic Information Systems (GIS) skills as they prepare for professional practice?

Geographic Information Systems projects (GIS) are interactive representations of information on maps that illustrate how societies develop and change over time. Fundamentally, GIS projects are interactive constructions of complex social, economic, and cultural phenomena, and they invite students to locate information and deliver it via a spatial medium. Just about any research question in any discipline can be framed and explored on a map. In fact, GIS is a way of thinking and a way to solve problems. GIS products, like embedded Google maps or searchable Esri ArcGIS databases, are also repositories of information that can serve entire communities and populations. GIS is yet another social media platform that social work students can master as they prepare for professional practice.

Although not usually considered social media, Google’s Maps platform allows people to construct rudimentary GIS products that could be valuable sources of information for members of a community. Anyone with minimal knowledge of maps can custom-build information, insert commentary, and add ideas into Google’s highly-recognizable interface. I would go so far as to say that when social work professionals enter the field, they will be expected to create interactive maps to be displayed on agency websites or blogs. These maps can be simple or elaborate, and they could spatially orient people to vital resources in a community.

For example, consider this map designed by the Chicago Coalition for the Homeless in 2012. Agency staff created a Google Map that pins the locations of homeless shelters in the Chicago area that house registered voters and also have programs in place that encourage residents to registers and vote.

View Homeless Shelters with Registered Voters Map in a larger map

It’s true that any semi-savvy Internet user could search for “shelters” on a Google Map and retrieve a reasonably complete set of locations. However, the added value of this GIS map by the Chicago Coalition for the Homeless is that its pins contain brief, personalized blurbs about the voting resources each shelter provides for that particular election cycle. Another example is this Community Snapshots map, created by City Heights Support Services of Price Charities in San Diego, CA. The public housing advocacy group has pinned information that lets people know about counseling services, financial education centers, farmer’s markets, and other resources that benefit residents of the downtown San Diego area. Note that this map is embedded directly on the agency website. Continue reading

Social Media, Pedagogy, and Conflict of Interests

There’s been a great deal of hype recently about the news that Twitter will join the ranks of Facebook and schedule an initial public offering. It’s hard to say at what level TWTR will trade when it is offered later this week or early next week, and it’s harder still to determine whether or not investors will clamor to snap up the stock like they did when Facebook went public in May, 2012 (or if there will be any similar malfeasance in the price-setting process).

image via Digitaltrends.com

image via Digitaltrends.com

The Twitter IPO should be a reminder that much of the infrastructure we use in higher education to teach digital literacy is produced by companies who stand to profit when people devote their time and attention to them. Twitter’s speculated market value is as high as $20 billion, but this is only because the company offers a platform that captures people’s attention, the only valuable commodity that remains today.

Rogue profiteering notwithstanding, the learning benefits of networks like Twitter are profound. Twitter allows students and young professionals to synthesize staggering amounts of information, make connections with others, and present themselves to communities of like-minded thinkers. When developed, Twitter literacy can open doors for students and lead to the discovery of information that will allow them to be responsible citizens and scholars. All of this costs no money at all; the price we pay is our time and attention.

As Twitter has evolved over the past seven years, academics have published articles outlining the pedagogical value of microblogging and have argued for Twitter’s essential role in the undergraduate learning experience. For instance, Christine Greenhow and Benjamin Gleason published a piece in which they suggest that higher education needs “better theorization and study of the forms and functions of social media communication” if it is to prepare students for today’s workforce and professional environment. I, along with my colleague Laurel Hitchcock, have also published an article on Twitter and it’s place within social work education. Continue reading

Toward an Empirical Study of Twitter Use

Even though this blog has been dormant for a while, I’m going to write a post on it now. Eventually I’ll start writing more. Today, though, I’ve been thinking about ways to study how people use Twitter. When teaching with Laurel Hitchcock, we require students to use Twitter to share information, network with other people in their respective fields, or stay informed in order to better serve as professionals (see our article, recently published in the Journal of Baccalaureate Social Work for more about our integration of Twitter into the classroom). Twitter BadgeWe’ve taught students in several settings, and in each instance, we have implored students to be active users of Twitter after they leave the classroom. Twitter is particularly well-suited for social work students because they will be entering a field with an express commitment to social justice. Continue reading

Does #curationculture blog live after course ends?

I found an infographic on how professors use social media. Two points seem worth mentioning here: the social networking gap between students and professors is shrinking and professors in the humanities use social networking the most for pedagogical purposes, which could include blogs, wikis, and podcasts.

I have reproduced the graphic below, but the whole story can be found here.

Is it important that the gap is closing? Do students think it is important, helpful and/or beneficial to have social networking and other web 2.0 experiences integrated into the course? Continue reading