About these charts

Introduction

There should probably be a long bit here about my background and motivations for building this. Unfortunately, I've left writing this introduction until the very end, and I'm now at the point where I'm sick of looking at all this, and just want to put it out there and be done with it :-(

Suffice to say, I don't have any particular agenda regarding gender in SF&F awards, and I certainly wouldn't profess any expertise in this area. I am however of a fairly pedantic mind, and get a bit annoyed when I see people shouting about this sort of thing, often without presenting anything close to facts to back up their arguments.

Whether the data assembled here might help things, I don't know. There are certainly several areas where things are imperfect to say the least. The use of open code and data sources at least gives the hope that future iterations of this can improve on the accuracy and coverage of the data presented.

One note about terminology in this page: the text will often refer to "authors", as this is the most common activity for the award finalists covered here, and is also the name of the database table(s) in ISFDB that store(s) the biographical data. However, dependent on the award category context, it could just as easily refer to editors, fan writers, artists, etc.

Prior work in this area

The following are articles that I've either read recently, or discovered via some cursory Googling. I'm sure there has been much prior research done that I don't know about, please let me know of any relevant links, and I'll happily add them. (That said, right now, I'm not really looking to add links to that aren't about both awards and gender.)

Author/publisher	Date	Awards/categories covered	Level covered	Notes
James Davis Nicoll/ tor.com	2019-09-10	Hugo Best Novel / Novella / Novelette / Short Story	Finalists	Discussion thread at /r/Fantasy
The Fantasy Inn	2019-09-10	Hugo Best Novel	Finalists
Camestros Felapton	2019-08-10, 2019-09-05	Dragon Award, various categories	Finalists
The Clarke Award	2019-05-09	Clarke Award	Submitted works	See also earlier articles about the shortlists in 2013 and 2014.
Ahasuerus/rec.arts.sf.written	2014	"Top 100 novelists and short fiction writers"		Linked in a 2019 rasfw thread discussing the aforementioned 2019 James Davis Nicoll/tor.com piece.
File 770	2008-05-12	Hugo Award, Best Novel	Finalists

I believe all of the above, with the exception of the Ahasuerus 2014 post, are the product of someone manually gathering and analysing the data. That approach should produce more accurate results than the automated methods described here, but with the drawback of requiring much more human effort, and thus not scaling up well - all of the above restrict themselves to a limited number of award categories and/or period of review.

One benefit of the automated approach used here is that it can be extended relatively easily to datasets other than awards, assuming that dataset exists within ISFDB and can be extracted. For example, I have some rough code that produces similar charts for all the books issued by a publisher, potentially running into hundreds or thousands of titles over periods of decades. This stuff isn't closed to being publishable though, for a number of reasons, not least that the higher-profile of award finalists/nominees compared to "regular" authors means that the latter are less likely to have Wikipedia pages.

Data sources

All the data used to produce these charts is derived from public data sources, and with the exception of Twitter bios, can be improved by contributions from anyone with an internet connection and an account on the hosting site or service.

ISFDB

All the core data about which authors/books won which awards comes from the weekly database dump that ISFDB makes available for anyone to download.

The author entries are also used to obtain any links to biographical pages on Wikipedia or Twitter accounts, as detailed below, and to obtain any variants or aliases the author uses/used.

Wikipedia

Wikipedia is the preferred reference for author gender, specifically the categories that appear at the bottom of the page, such as "American male novelists" or "Women speculative fiction editors".

Twitter bios

If no gender information could be extracted from a Wikipedia page, then the secondary source tried are the biographical details in the bio on the author's Twitter page, if they have one. This is a pretty basic search for any relevant phrases such as - but not limited to - "he/him", "she/her" or "non-binary".

human-names repository

If neither Wikipedia nor Twitter provided gender information, then a check of the author's given name against public lists of male and female names is done. (Any names which are in both male and female lists will not be matched against.)

Methodology

Getting lists of award finalists/nominees/shortlists

Getting the award finalists/nominees/shortlist is a reasonably straightforward SQL query against the database.

Ideally each work that is a finalist will have its own entry in ISFDB (in technical terms, having a record in the titles table.) If so, then then records of author(s) of that work are queried, specifically any hyperlinks to other sites.

If there is no link to a work - as might be the case for a fair proportion of Best Related Work finalists, for example - then the names listed as the creators of the awarded work will be used, and a search will be made for any authors matching those names. If matches are found, the relevant hyperlinks are obtained.

Determining author gender

The algorithm - which in this case is a fancy euphemism for a bunch of hacked-up if-statements and regular expressions - presented here is broadly how things work as of the initial release in September 2019. As described in this page, there are numerous known issues with how things currently work, and who knows how many other issues yet to be uncovered. Any feedback or suggestions - or, dare I hope, pull requests are welcome.

All determination of gender is done in an automated manner based on the data sources available - there is (currently) no facility for manually-entered overrides to fix incorrect values. This means that the results output by this code are subject to the problems described in the articles in this list of falsehoods programmers believe about human identity, albeit with some areas - gender detection based on given name - being more prone to such errors than others.

If an author has a Wikipedia link (or links) recorded in ISFDB, then those Wiki pages will be downloaded, and parsed for categories containing "men", "male", "women", "female", "non-binary" or "genderqueer".

If there was no Wikipedia page, or the Wikipedia page(s) lacked any gendered categories, then instead the author's Twitter homepage will be downloaded, and the brief biographical details extracted. If these contain non-contradictory pronoun declarations - e.g. "he/him" or "she/her" - or "non-binary", then that is used to determine gender. Gendered words such as "husband", "wife", "father" or "mother" are not used, as these may lead to incorrect matches without a better understanding of context e.g. "I love my wife" does not reliably indicate the gender of "I". Pronoun declarations such as "he/they" or "she/her or they/them" are also not considered unambiguous enough to be used here.

If nothing usable can be obtained via these Wikipedia and Twitter methods, then the author's given name will be compared against gendered lists of names in the human-names lists. This is much less reliable than the former methods, as described further down this page.

Lacking any gender information from any of the above three sources, the author will be marked as "unknown". For the "knowns", all the charts and tables will indicate the source of gender information, mainly as an indicator of how reliable it is:

Wikipedia - should be reliable, but may not be available for authors not considered sufficiently "notable".
Twitter bio - should be very reliable, but probably only available for a small percentage of authors.
human-names - relatively unreliable, and poor-to-non-existent coverage for non-Anglophone given names.

An aside: I have grouped non-binary and genderqueer together. To the best of my limited understanding, this is hopefully a reasonable thing to do, even though "not all genderqueer people identify as non-binary" (thanks for that link Lynn!). I mention this because

There are some authors with genderqueer categories on their Wikipedia page, but not any non-binary categories.
There is (at least) one author (with several award nominations to their name) who has both genderqueer and binary categories on their Wikipedia page. This seems contradictory to my limited understanding, but I certainly don't have sufficient knowledge of this area, or the person in question, to progress further.

Feedback from anyone with better understanding of this area would be very welcome - perhaps as a reply in this Twitter thread?

Producing the charts and spreadsheets

For each year of finalists, the genders of the winning works are added up. If a work is (known to be) a collaboration of multiple authors, then the genders of each author will be individually added e.g. a book writen by James S. A. Corey add 2 to the male totals, although - as things stand as of September 2019 - this will segment down to 1 in the male:wikipedia bucket, and 1 in the male:human-names bucket, because Ty Franck doesn't currently have a Wikipedia page of his own.

The data is then formatted in a tabular/matrix form. A Google Sheet is created using the API, sheets for each award/category created, the relevant data inserted into each sheet, and a stepped area chart based on the data is added.

The spreadsheet is then shared publicly, and a script run to download all the chart images, and build some very basic webpages to render them with any accompanying notes. All the pages and chart images are then uploaded to a public webserver.

Implementation

The querying of the database is doing using code written in Python 3, with notable libraries used including SQLAlchemy for database access, and Requests for downloading Wikipedia and Twitter webpages.

The code to generate the basic data is publicly available in a GitHub repository although at time of writing, that public repo does not contain the latest revisions, because ~~of a bunch of last-minute late-night hacks that I'm embarrassed to let anyone else see in their current state~~ ongoing fine tuning of my finely honed algorithms.

Errors and omissions

Potential causes for misgendering

Issue	Impact
Wikipedia related
The author does not have a wiki page	Code will fall back to Twitter bio or human-names
The author has a wiki page, but it is not recorded in ISFDB	Code will fall back to Twitter bio or human-names
The author has a known wiki page, but it lacks any gendered categories	Code will fall back to Twitter bio or human-names
The gendered categorization in Wikipedia is wrong, contradictory or out-of-date	An incorrect gender will be used when calculating totals, even if the correct gender is available via Twitter bio and/or given name.
Twitter bio related
The author does not have a Twitter page	Code will fall back to human-names
The author has a Twitter page, but it is not recorded in ISFDB	Code will fall back to human-names
The author has a known Twitter page, but the bio lacks any recognized pronouns or non-binary status.	Code will fall back to human-names.
The author has a known Twitter page, but the account is locked/private	Code will fall back to human-names. (This is an assumption, as I haven't come across any locked author Twitter accounts, and arguably they probably shouldn't be in ISFDB if they aren't going to be accessible to most users.
The Twitter bio has contradictory pronouns	As mentioned earlier, the likes of "he/they" or "she/her or they/them" are currently not used for gender determination, as it is unclear which should "win". Any thoughts on how these could be used are welcome, although my guess is that the gender of someone with these pronouns in their bio will vary on a case-by-case basis?
human-names related
Author's given name - for any of the variant or pseudonymous names recorded for the author - is not in the (English language) lists	Author will be marked as unknown gender
Given name - for all of the variant or pseudonymous names recorded for the author - is in both male and female (English language) lists	Author will be marked as unknown gender
Given name - for all of the variant or pseudonymous names recorded for the author - is only in a single gender list, but is actually applicable to both male and female genders	The wrong gender may be used. (Example: "Pat" is only listed in the male list, but off the top of my head, I think of more women authors called "Pat" than men.)
Author uses a pseudonym with a given name associated with another gender	Depending on the order the variant names are processed, the incorrect gender will be used.

Omitted awards and categories

Graphic/comic and media/dramatic categories, most notably in the Hugos, are currently omitted. The former probably will get added at some point, as a cursory check of ISFDB shows that most nominated people have proper ISFDB records (example). Another reason for avoiding them until now is that I don't have the knowledge - or to be brutally honest, personal interest - in that area to be able to do any appropriate category updates in Wikipedia with confidence that I'm not adding incorrect information.

On the other hand, media/dramatic categories are relatively poorly covered in ISFDB in terms of proper author records (example), which is perfectly reasonable, as films/TV/etc are well outside of ISFDB's core remit, at least as I understand it. However, this means that attempts to analyse the gender of these finalists will not be able to use Wikipedia pages or Twitter bios, and thus will be at risk of much less accurate stats.

There are SF&F awards which aren't recorded in ISFDB - e.g. The Kitschies - so they consequently cannot be included here.

Currently these charts don't cover regional awards and categories - by which I mean, those where eligibility is restricted by the nationality or regional affiliation of the author, or for works first published in a language other than English. This may well change in the future, but I suspect that results could be poor as finalists are less likely to be featured in ISFDB, (English) Wikipedia, and given names may not be covered by the current name lists.

"International" categories within regional awards will be covered at some point, but not right now, in part because author names may be modified from their original form, as in the Seiun Best Translated Long Story category.

Potential improvements

Algorithm improvements

Wikipedia-related improvements

If no gendered category, we could parse the main body of text, and count the occurrences of pronouns, and if the total count for male, female or or non-binary pronouns clearly exceeds the others, declare that the "winner"?

I recently noticed that the Infobox can include gender and pronoun information. However, I've only noticed a single instance of this as-yet, and that currently mis-spells "pronoun", which makes me slightly dubious of relying on this at the present time.

There are a number of award finalists who don't have a "proper" entry in ISFDB, but do have Wikipedia pages. This typically affects non-authors who were finalists in mixed-media categories, e.g. Janelle Monáe. In such cases, it might be possible to instead do a Wikipedia search to find the relevant page. Care would have to be taken to avoid picking up any false positive matches returned in the search results.

Twitter-related improvements

Neopronouns in Twitter bios are not currently detected, nor are pronoun.is links. Both of these should in theory be reasonably straightforward to pick up, although at time of writing the only Twitter accounts for award finalists that I've seen using them already had Wikipedia pages with appropriate gendered category.

Given that Twitter bios are "straight from the horse's mouth", whereas Wikipedia if anything tries to avoid that. As such, it might make sense to consider Twitter bios ahead of Wikipedia categories when determining gender?

Given name-related improvements

We could additionally use the non-English/Anglophone name lists. I have avoided doing this so far due to the increased likelihood of names that occurs on both male and females lists - e.g. "Jean" is a female Anglophone name, but a male French or Dutch name. The downside with the current "monolingual" approach is that it will potentially misgender authors from a non-Anglophone background with a name like "Jean".

A possible mitigation for this might be to use the birthplace information that exists for many ISFDB entries to prioritize the language lists to check against. This is still flawed though - if for example a "Quebec, Canada" or "Louisiana, USA" birthplace is encountered, knowing that it might be better to use the French lists ahead of the English lists is getting massively overcomplicated for the scope of this project.

Data improvements

As part of the development of this project, I must have made several hundred edits to ISFDB and Wikipedia, generally of the form:

Add Wikipedia or Twitter links to the author entry on ISFDB, where they haven't already been added.
Add gendered-categories to the Wikipedia entry if none already exists. NB: with the exception of "(Male|weomen) speculative fiction editors", I didn't create any new categories, just used any existing relevant ones. This means that critics, fan writers etc. are more likely to fall back to using given name recognition, or being marked as unknown, as they don't (currently) have gendered categories on Wikipedia.

There are many areas that I don't have the expertise or interest to look into and fix missing or wrong information e.g. artists, comics writers, horror writers, etc. If anyone wants to volunteer to pick up on those areas, I can provide a list of affected authors/artists/etc.

I think there are alternative sources for given name/gender mapping than the human-names repo - which seems to be somewhat abandoned, perhaps they are more complete or accurate?

Something which would be technically feasible would be to have a local manually researched and maintained list of authors' gender, to cover over errors or omissions in the data sources. This has not - and almost certainly will not - be done, for the following reasons:

Me not wanting the hassle of manually maintaining data - although this could potentially be mitigated by having user contributions e.g. GitHub pull requests.
It's surely better to have this information in an existing public resource, where it can easily be used by other projects, moderated for errors, etc. The downside is that many (most?) authors are likely to fall afoul of Wikipedia notability rules.
I don't want to put myself at risk of having to deal with data protection and GDPR laws and penalties. (Of course, a month from now, I might have much more important things to worry about than GDPR infractions, such as where my next meal is coming from.)

Other improvements

It's a bit crap that the colour key on the charts doesn't match the colours used for the areas on the chart. My excuse is that the Google Sheets API only allows you to set the colour of the lines, and automatically applies an area opacity of 30%. This can be manually changed in the Google Sheets web application, but seemingly not via the API - and I'm afraid there's no way I'm manually changing 9 colours on 64 charts!

(In fact, I'd like to get rid of the lines on the chart, and just have the filled areas, but is another thing that doesn't seem to be possible via the Google Sheets API, at least for the stepped area chart type.)

Pre-emptively answered questions

Can I get the data and/or code?

Broadly, yes:

The ISFDB downloads are here.
The Wikipedia and Twitter pages are downloaded dynamically as needed - it is the regular HTML that is downloaded and parsed, rather than any marked up text or API response. I believe that Wikipedia make dump archives available on a regular basis, but I don't know the details. (Using such a dump might be a better way of doing things than downloading ad hoc pages, but that's something to consider for the future.
The human-names files are available as JSON files - with a small bit of supporting node.js code which I've never personally used - here.
My code to get data out of ISFDB and produce the basic tables is on GitHub. At time of writing, the code to upload the data to Google Sheets and create these webpages is not available, but there's nothing particularly interesting about it - all the Google Sheets related stuff is a mix of sample code and Stack Overflow answers that has been hacked about until it does roughly what I need.
You can access the Google Sheet that contains the charts and the tables used to construct them by clicking on any of the chart images, which should take you to the relevant sheet. I've deliberately buried this information at the bottom of this document, as - I'm told - there is a limit (of 50?) for how many people can have the sheet open simultaneously. Plus, I'd prefer people to view my site rather than the Google doc :-) Something worth noting is as-and-when the data and charts are regenerated, the output will be to a brand-new spreadsheet with a different URL. These webpages will always use and link to the latest version of the spreadsheet, and out-of-date spreadsheets will get removed sooner or later, breaking those URLs.