architecture complete

The Technology Stack: Amazon Web Services Products

This is the fourth in a series of posts on the development of GeoBlacklight at NYU. Apologies for what has turned into a semester-long hiatus, but we are back now, using the GeoBlacklight community sprint from July 25 – August 5 to finish the documentation narrative and write more about developments and questions happening. For earlier posts and a description of the project, click here.

By Andrew Battista and Stephen Balogh

GeoBlacklight itself is simple to deploy (and there are several richly-documented resources on how to do this with Packer, Vagrant, Docker, and other virtual machine tools). However, the dependencies and entire technology stack behind our Spatial Data Infrastructure is a bit complex, so we hope that insight on specs and installations can help others in their development process. Note that neither the specific cloud-based services nor the way these software platforms relate to each other are necessarily intrinsic to GeoBlacklight. There are many other ways to deploy these tools.

AmazonWebservices_LogoOur most important initial decision was to use a cloud computing provider to host the majority of components comprising our geospatial data infrastructure. We ended up using Amazon Web Services, but many viable competitors to AWS exist as well. Early on, AWS allowed us to prototype and stand up core elements of the stack to develop  a good proof of concept, and using cloud infrastructure provided us with the invaluable opportunity to test interconnections between GeoBlacklight and the various other components of the stack. With AWS, it is simple to spin up temporary servers for purposes of testing, and it has proven to be very straightforward to scale up directly from our initial prototype.

At NYU, it became apparent that AWS could be a solution for running the spatial data services in production, not just development. So far, we have been able to maintain a high-functioning service at a very reasonable cost, and because we had already had success in our university and within our team with AWS products, we were able to get institutional buy-in quickly.

Outlining the Pieces

First, let’s begin with our spatial data infrastructure as it currently stands.

architecture complete

NYU Spatial Data Repository architecture diagram by Himanshu Mistry

What started as a series of drawings on a chalkboard ended up as this mostly-correct diagram of our deployment of our Spatial Data Repository. To explain how we stood up these various pieces, we created a table that indexes these parts as they pertain to our collection and service flow. In-depth glosses are below the table.

SDR Component Description   Documentation AWS Product NYU Deployment
PostGIS Database Holds reprojected vector data  Available here RDS  unavailable
Solr Needed for GeoBlacklight searches  Available here  EC2  unavailable
MySQL Background database for GeoBlacklight  Available here  RDS  unavailable
DSpace Institutional preservation repository  General documentation  N/A  http://archive.nyu.edu
Geoserver Produces WMS/WFS endpoints & layer previews  http://docs.geoserver.org/  EC2 maps-public.geo.nyu.edu
GeoBlacklight Ruby on Rails discovery environment Available here EC2 https://geo.nyu.edu 

PostGIS Database

Our PostGIS database runs on Amazon’s managed relational database service (RDS). PostGIS extensions should be added to a default PostgreSQL database by following these instructions. While configuring your database, make sure to create read-only user accounts for GeoServer to use while retrieving vector geometries. At the current moment, we are using PostGIS only for vector data, though it would be possible to store rasters within it as well; we have opted to keep raster data as GeoTIFFs.

One PostGIS database contains all vector layers, regardless of whether they are Public or Restricted (that distinction only becomes consequential at the GeoServer level for us). We use Amazon’s security groups to restrict traffic to and from this database, for additional protection. Traffic from outside the Virtual Private Cloud may be limited to NYU IP ranges (or another institutional IP range) if there is a desire to allow direct connections. Otherwise, when establishing the PostGIS database, access should be completely restricted to the GeoServer instance(s), as well as a management server.

We have also experimented with directly connecting to PostGIS from a desktop GIS client (like QGIS). While this is possible, we have gotten better results by making a desktop connection via GeoServer’s WFS.

MySQL (Database for Blacklight)

We have chosen to use MySQL as the backend database for Blacklight. A database is required for user data, including bookmarks, etc. within the application. If replicating this, make sure to add gem 'mysql2' to your Gemfile, and adjust the appropriate database parameters. Our instance of MySQL also runs on Amazon RDS, and it contains two databases (one for production and one for development).

Solr

Solr is the “blazing fast” search engine that powers GeoBlacklight. Since Solr has no user-access restriction measures by default, access to it is highly restricted in our deployment (and we recommend following suit). GeoBlacklight communicates with it directly via the RSolr client. At no time is a user connecting directly to a Solr core via their browser; the queries always go through Blacklight, which handles the Solr connection. Solr is deployed on an EC2 instance, and is fire-walled such that it is only able to communicate with the Rails server and a deployment server that has been designated for handling metadata ingest. We are maintaining production, staging, and development Solr cores, so that we can preview new records before publishing them to the production instance. This is a good strategy for catching errors in metadata records or seeing the effects of large-scale changes in metadata.

Geoserver(s)

Geoserver is an open-source server for sharing geospatial data, and GeoServer provides two crucial services for us: the WMS and WFS endpoints that GeoBlacklight needs for layer previews and generated layer downloads respectively. We have two instances of GeoServer up, at:

Both connect directly to the PostGIS database, though the layers enabled by each (and therefore being served) are mutually exclusive, and dependent on the Rights status in the metadata for the records. Layers are enabled and disabled by a Ruby script that runs through GeoBlacklight-schema JSON records, and then connects to the GeoServer REST API. We have separate instances of GeoServer for Public and Restricted data so that we can limit access to the Restricted endpoint to NYU IP address ranges. For users trying to access Restricted data from off-campus, OCLC EZproxy is used to establish a proxy after users authenticate with NYU Single Sign-on.

GeoBlacklight

The SDR currently relies on GeoBlacklight to provide a discovery interface. Our local code modifications to GeoBlacklight core are logged on the production code instance here. GeoBlacklight is a Ruby on Rails application, and we have deployed it on an Ubuntu 14.04 server, hosted on an Amazon Web Services EC2 instance. Phusion Passenger, via Apache, is the Rails web server. HTTPS is forced (this is a requirement of the OmniAuth integration). Here is a helpful set of instructions on how to implement Passenger with Apache on Ubuntu 14.04 (albeit from a different cloud provider).

Connecting with SSO

As is the case for many institutions providing access to geospatial data, there are specific license restrictions to much of the data in our collection. The easiest way to mediate access to these protected layers is to set up two instances of GeoServer and gate our restricted instance with NYU’s Single Sign-on service. GeoBlacklight also needs to be aware of user accounts from an institution’s greater SSO environment. Since GeoBlacklight is already built around user models from  Devise, we were able to connect to NYU SSO by using an OmniAuth strategy for Devise.

Final Words

We are happy to answer any specific question about the deployment of these platforms. Also note that during the current GeoBlacklight sprint, several minor changes may be made. In the next post, we’ll revisit some of these changes and talk about some of our collection strategies.

DSpace and the Institutional Repository: Preservation and the Spatial Data Infrastructure

By Andrew Battista and Stephen Balogh

This is the second in a series of seven posts on the development of GeoBlacklight at New York University and is published on the first day of Geo4Lib Camp at Stanford University. To see the first post and an overview of the schedule, visit here.

The role of an institutional repository in building collections

In the first post, we provided an overview of our collection process and technical infrastructure. Now, we begin our detailed narrative with the first step: preservation within an institutional repository. Although we are not experts on the landscape of library repositories, we recognized early on that many schools are developing multi-purpose, centralized environments that attend to the interconnected needs of storage, preservation, versioning, and discovery. Furthermore, when it comes to collecting born-digital objects, every institution exists somewhere on a continuum of readiness. Some have a systematic, well-orchestrated approach to collecting digital objects and media of many kinds, while others have only a partially developed sense of the degree to which archival collections and other ad hoc born-digital items should exist within traditional library collections.

Stanford University is at the former end, it seems. The Stanford Digital Repository (SDR) is a multifaceted, libraries-based repository that allows for both collections (including archival materials, data, and spatial data) and researcher-submitted contributions (including publications and research data) to be housed together. Their repository attends to preservation, adopts a metadata standard (MODS) for all items, and imagines clear ways for content to be included within a larger discovery environment, SearchWorks. The SDR suggests a unifying process for providing access to many disparate forms of knowledge, some of which are accompanied by complex technological needs, and it facilitates the exposure of such items in ways that are useful to scholars and researchers of all kinds. See also the Purdue University Research Repository (PURR) for another good example of this model.

NYU is near the other end of this continuum of readiness, although this is changing with the advancement of our Digital Repository Services for Research (DRSR) project. The closest thing we have currently is our Faculty Digital Archive (FDA), which is an instance of DSpace. As the name implies, the FDA was conceived as a landing-place for individual faculty research, such as dissertations, electronic versions of print publications, and other items. More recently, we have started deploying it as a place to host and mediate purchased data collections for the NYU community, and we are encouraging researches to submit their data as a way of fulfilling grant data management plan requirements. These uses would anticipate the kind of function of the Stanford Data Repository, but we haven’t arrived yet.

Overview of Dspace and other options

Although NYU’s IR status is in flux, we decided to begin with the FDA (DSpace) as we developed our spatial data infrastructure. In this case, it was a good decision to work within a structure that already existed locally. dspace_logoFortunately, the specific tool used in the preservation component of the collection workflow is the least connected to GeoBlacklight of all our project components, and it can be altered as larger changes take place within an institution (as they inevitably do). There are other options available aside from DSpace, most notably the Hydra/Fedora project, a growing community that develops a robust range of resources for the preservation, presentation, and access of data. We won’t belabor this point, other than to say that if you’re working within a context that has no preservation repository in place, it is highly advisable to stand up an instance of DSpace, Hydra, or some other equivalent before beginning collection efforts. Even a persistent URL generator could work.

Collection Principles

The concept of a preservation repository spurred our thinking about some important collection development principles at the onset. The first and most obvious is that preservation is vital. Agencies release data periodically, retract old files, and produce new ones. Data comes and goes, and for the sake of stability and posterity, it is incumbent on those creating access to data to be vested in the preservation of it as well. Thus, we will make an effort to preserve all data that we acquire, including public data.

Second, the IR concept demanded that we make a decision regarding the level of data at which we would collect. In other words, the concept forced us to define what constitutes a “digital object” and plan accordingly. For many reasons, we decided that the individual shapefile or feature class layer is the level at which we would collect. The layer as an object corresponds with the level of operation intrinsic to every GIS software, and it’s the way that data is already organized in the field. What this does mean, for instance, is that we will not create a single item record that bundles multiple shapefiles from a set or a collection. There is a 1:1 correlation between each layer and each unique ID.

Adding Non-conventional Documentation or Metadata

Beginning our collection by preserving a “copy of record” in the FDA also gave us the space and flexibility to pair proprietary data with its original documentation, particularly in cases in which documentation comes in haphazard or unwieldy forms. This record of 2010 China Coastlines and Islands is a good example.

chinadata_screenshot

A screenshot of an item in the FDA that is also in our spatial data repository.

We purchased this set from the University of Michigan China Data Center, and the files came with a rudimentary Excel spreadsheet and an explanation of the layer titles and meanings. Rather than trying to transform that spreadsheet into a conventional metadata format, we just uploaded the spreadsheet with the item record. Now, when the data layer is discovered, users have a direct pathway back to original documentation. Note that there are other ways to expose original documentation and metadata within GeoBlacklight itself; this is just one that is easy and flexible.

The Chicken-Egg Conundrum 

The final (and most important) role of the repository model is that it generates a unique identifier which marks data in multiple places within the metadata schema and the technology stack behind it. This identifier applies to the item record in the repository, the layer in Geoserver, and the data file itself. However, given the workflow inherent in our process of collecting, the need for a unique identifier creates a classic chicken-egg scenario. In order to acquire an item, an item ID is required to create a metadata record for said item, yet without having made a “digital space” to store the item, we would not have the requisite “information” (i.e., the unique identifier) to create a metadata record for it. Invariably, then, the collection process starts with preserving the item of record, but it involves two separate steps. At the individual level, this is an easy problem to solve. We create an item in DSpace, get the unique identifier (the handle address), and then plug it into the GeoBlacklight metadata record. But as an ontological challenge and a workflow challenge, it’s something to think about.

Brief Snapshot of the Collection Model

At this point, it makes sense to introduce one of our first project diagrams: a hand-drawn overview of our architecture and collection model. The section in the red box is what’s pertinent here.

architecture_map

An overview of NYU’s spatial data infrastructure

The collection process is follows. The decision to collect happens, and we add an item in the FDA, which generates a unique ID (a handle). Note that we have created two distinct collections within our DSpace: a collection for unrestricted data and a separate collection for private, restricted data. The generated handle, in turn, is integrated into several places in the GeoBlacklight metadata record, including the dct_references field, which triggers the direct file download within the GeoBlacklight application (more on that in future posts). We preserve both the “copy of record” and a re-projected version of the file on the same item record in the institutional repository, a decision we will explain later when we talk about the technology stack behind the project.

Batch Imports

Ingesting records at the batch level is a bit more involved than individual additions, of course. Using the FDA at NYU has required us to collaborate with the managers of the system, Digital Library Technology Services (DLTS). Since we don’t administer our repository directly, we’ve had to collaborate with the team that does and provide for the .CSV files for batch upload. Because we generate our GeoBlacklight metadata with Omeka, we were making .CSVs regardless, but we got around the “chicken and egg” conundrum by having DLTS do a batch upload, which would “mint” a unique handle ID for each item. Then, we’d apply that handle ID to the item and re-upload or overwrite the existing items to include updated files, documentation, and descriptions. Most repositories should have a solution for batch uploading and uploading according to a set metadata standard, to which you might be required to conform.

On using a Third-Party Unique Identifier

At various points in time during this project, we thought about the benefits of using a third-party persistent link generator in order to facilitate the collection process and mitigate the chicken-egg problematic. Fortunately, GeoBlacklight metadata is capacious enough to allow for several different kinds of unique identifier services. Stanford uses PURL, others use ARK, while still others use DOIs. We’ve chosen to remain with the unique ID minted by DSpace, at least for now.

Up Next: Creating GeoBlacklight Metadata Records

Everything we’ve discussed regarding preservation within an institutional repository is not directly implicated in the application of GeoBlacklight itself. However, it should go without saying that these preservation steps are the foundation of our spatial data infrastructure. The application is inseparable from the metadata. In the next post, we will discuss the many ways to generate GeoBlacklight metadata as we collect spatial data.

Apps and Senior Adults

This past week, Lauren Wallis and I had the chance to deliver a session on apps and programming for senior adults at the 2014 meeting of the Alabama Gerontological Society. The theme of the conference was, “Aging: is there an app for that?” We discussed ways to filter through the iTunes app store and integrate iPads into practice settings with senior adults. Here are the slides of our presentation. Thoughts and comments welcome!

Sample GIS projects

I’ve written previously (here and here) on the relationship between GIS projects and information literacy learning. After the Fall 2013 semester concluded, I compiled a list of GIS projects completed by students in GEOG 231 World Regional Geography at the University of Montevallo. Working in groups, the students explored issues like immigration, wage earnings according to gender, educational attainment, and economic growth in the wake of Disney World. The groups were instructed to identify several indicators and represent social changes over time via Google Fusion Tables. Here’s a sample list of projects:

The Impact of the Creation of Disney World

Elderly Voting in the 2008 Election

Between Religion and Poverty: Megachurches and Poverty in California

Immigration: Crossing the Line

Gay Marriage and Population

Gender Wage Gap in the U.S.A.: 1999 vs. 2012

Comparison of Graduation Rates and Median Income Levels in Alabama Counties

Population Change in Arizona After the 2008 Recession

I’m interested to hear feedback! Who else is doing GIS projects with undergraduate students in social sciences disciplines?

Twitter Analytics

Many of you know that Laurel Hitchcock and I have embarked on a project that explores the impact of social media-based pedagogy on professional practice. By now, many social work educators are integrating Twitter and other social media platforms into the classroom, and there is much evidence that social media is an important vehicle of information gathering and advocacy.

twitter_word_cloud

However, thus far, one gap in education research is our reticence to derive empirical measures of how students use social media platforms once they leave the classroom. Do the social media strategies we teach sink in? Furthermore, do we have any evidence that the social media literacy we cultivate in the classroom has a meaningful impact on professional practice?  Continue reading

GIS Projects in Social Work Education: Preparing Resources for Agencies

Should Social Work students learn rudimentary Geographic Information Systems (GIS) skills as they prepare for professional practice?

Geographic Information Systems projects (GIS) are interactive representations of information on maps that illustrate how societies develop and change over time. Fundamentally, GIS projects are interactive constructions of complex social, economic, and cultural phenomena, and they invite students to locate information and deliver it via a spatial medium. Just about any research question in any discipline can be framed and explored on a map. In fact, GIS is a way of thinking and a way to solve problems. GIS products, like embedded Google maps or searchable Esri ArcGIS databases, are also repositories of information that can serve entire communities and populations. GIS is yet another social media platform that social work students can master as they prepare for professional practice.

Although not usually considered social media, Google’s Maps platform allows people to construct rudimentary GIS products that could be valuable sources of information for members of a community. Anyone with minimal knowledge of maps can custom-build information, insert commentary, and add ideas into Google’s highly-recognizable interface. I would go so far as to say that when social work professionals enter the field, they will be expected to create interactive maps to be displayed on agency websites or blogs. These maps can be simple or elaborate, and they could spatially orient people to vital resources in a community.

For example, consider this map designed by the Chicago Coalition for the Homeless in 2012. Agency staff created a Google Map that pins the locations of homeless shelters in the Chicago area that house registered voters and also have programs in place that encourage residents to registers and vote.


View Homeless Shelters with Registered Voters Map in a larger map

It’s true that any semi-savvy Internet user could search for “shelters” on a Google Map and retrieve a reasonably complete set of locations. However, the added value of this GIS map by the Chicago Coalition for the Homeless is that its pins contain brief, personalized blurbs about the voting resources each shelter provides for that particular election cycle. Another example is this Community Snapshots map, created by City Heights Support Services of Price Charities in San Diego, CA. The public housing advocacy group has pinned information that lets people know about counseling services, financial education centers, farmer’s markets, and other resources that benefit residents of the downtown San Diego area. Note that this map is embedded directly on the agency website. Continue reading

Social Media, Pedagogy, and Conflict of Interests

There’s been a great deal of hype recently about the news that Twitter will join the ranks of Facebook and schedule an initial public offering. It’s hard to say at what level TWTR will trade when it is offered later this week or early next week, and it’s harder still to determine whether or not investors will clamor to snap up the stock like they did when Facebook went public in May, 2012 (or if there will be any similar malfeasance in the price-setting process).

image via Digitaltrends.com

image via Digitaltrends.com

The Twitter IPO should be a reminder that much of the infrastructure we use in higher education to teach digital literacy is produced by companies who stand to profit when people devote their time and attention to them. Twitter’s speculated market value is as high as $20 billion, but this is only because the company offers a platform that captures people’s attention, the only valuable commodity that remains today.

Rogue profiteering notwithstanding, the learning benefits of networks like Twitter are profound. Twitter allows students and young professionals to synthesize staggering amounts of information, make connections with others, and present themselves to communities of like-minded thinkers. When developed, Twitter literacy can open doors for students and lead to the discovery of information that will allow them to be responsible citizens and scholars. All of this costs no money at all; the price we pay is our time and attention.

As Twitter has evolved over the past seven years, academics have published articles outlining the pedagogical value of microblogging and have argued for Twitter’s essential role in the undergraduate learning experience. For instance, Christine Greenhow and Benjamin Gleason published a piece in which they suggest that higher education needs “better theorization and study of the forms and functions of social media communication” if it is to prepare students for today’s workforce and professional environment. I, along with my colleague Laurel Hitchcock, have also published an article on Twitter and it’s place within social work education. Continue reading