Bob Consoli's "placename lookup" critique
Bob Consoli recently had some nice things to say about Pleiades' role in establishing stable, unique identifiers (URIs) for ancient places and information about them (“A Better Place-name Lookup for PLEIADES Data.” Squinches, July 11, 2013). Thanks!
In the same post, he also expressed some frustration in trying to use data downloaded from Pleiades to do placename lookups. I can't say for sure, but it's my impression that much of his trouble stems from a misunderstanding of the Pleiades data model and of the way our data is serialized into the download "dump" files.
This post is an attempt to clear up that confusion and save Bob (and others) a lot of additional work and fiddling.
Here's the rule to live by: if you want all of the details about names, you can't just use the "places" CSV file; you've got to use the "names" CSV file too!
As Bob points out, each pleiades "place" can have zero-to-many associated "names". Readers familiar with database design will realize therefore that a relational database model is called for if one is to make efficient and effective use of the data. Indeed, Bob exhorts us:
PLEIADES has to create a separate table. The table I have in mind has two fields, the first is the name field and the second is the PLEIADES ID.
In fact, the Pleiades "names" CSV dump file already fulfills this purpose, so no new creation is required. As our CSV file documentation says, the "pid" column is the "Unique identifier for the place container within the site", i.e., the final integer component of the corresponding Pleiades URI (for example: the 216706 in http://pleiades.stoa.org/places/216706). The table also contains the original-script form of the name string (if we have it), the corresponding language and script code for that string (in accordance with the Pleiades Language and Script Vocabulary, which uses the standard codes from the IANA Language Subtag Registry), and (crucially) a transliteration of the name string into Latin characters. Note that all Pleiades dump files use UTF-8 encoding, so when editing them in Excel or other programs you'll want to make sure you're handling encoding properly. If the file gets treated as ASCII or some other encoding automatically, characters other than a-z, A-Z, 0-9, basic punctuation and the like will get borked.
Aside: there's also a "locations" CSV file, which contains more detailed geometry for some Pleiades places than is to be found in the representative geometry included in the "places" CSV file. Just like the names, Pleiades "places" have a zero-to-many relationship with "locations".
Bob also complains about the inclusion in place "titles" of strings like "Ins." and about the concatenation therein of multiple (but not always all) associated placenames using a slash (/) character. The complaint is understandable, given his use case (building clean lookup functionality for names), but he could have avoided the trouble by using the names table as described above instead of trying to roll his own. There's no concatenation of names therein (each name string gets its own row in the table), and many of the most common feature type abbreviations have been stripped out. Moreover, the name strings in the names table do not contain editorial indicia that are found in some of our titles (e.g., a prefixed asterisk, inverted commas), about which Bob is silent. All of this stuff in Pleiades place titles is legacy of the labeling practices of the Barrington Atlas (which were driven principally by cartographic concerns, of course). The problems they present for digital discovery and data reuse —so well illustrated in Bob's post —were evident to us from the beginning of the project. Consequently, we devoted a significant amount of thought and data-munging work (both programmatical and by hand) to producing the cleaned names data that is now presented in Pleiades name resources and the corresponding "names" CSV dump file. It's one of the reasons it took so long for us to bring the full dataset online (a complaint we used to hear a lot from prospective users).
I implied above that the Barrington didn't always include all associated names in its labels, and Pleiades data derived from the Atlas inherited this behavior in its place titles (though these are editable by registered users). This too got in Bob's way when using just the "places" CSV file. What's the background? It was Atlas policy not to include the modern name of a place in the map label if the ancient name was known. Moreover, Atlas compilers and editors sometimes suppressed some "minor" name variants when a place had many names in antiquity in order to keep the length of labels sensible for map presentation. Sometimes numbers were used on maps where sites were too closely clustered to permit labeling with names. All of this unlabeled toponomy was published in the two-volume Map-by-Map Directory that accompanied the Atlas. All of that directory data was incorporated into Pleiades and reunited with the cartographic and label data from the maps themselves. Thus, on the Pleiades website, users see every name associated with each place, under the prominent "names" heading to the right of the map (see, for example, the Pleiades page for Rome: http://pleiades.stoa.org/places/423025/). Every one of those names is searchable via our LiveSearch and Advanced Search website functions. Every one of those names is also in the "names" CSV file.
Is Pleiades name-complete? No, especially with regard to modern names and ancient names that have been brought into modern names. You can help.
If you're trying to use the Pleiades dump files (CSV or otherwise), we'd love to hear about it. We're happy to answer questions posted as comments on this blog post, emailed to pleiades.admin@nyu.edu or on the Pleiades Community List, or asked live in the Pleiades IRC channel.