The Programmer Experience: Redundancy Edition

Blog by Sumana Harihareswara, Changeset founder

03 Dec 2017, 12:23 p.m.

Programming

Hi, reader. I wrote this in 2017 and it's now more than five years old. So it may be very out of date; the world, and I, have changed a lot since I wrote it! I'm keeping this up for historical archive purposes, but the me of today may 100% disagree with what I said then. I rarely edit posts after publishing them, but if I do, I usually leave a note in italics to mark the edit and the reason. If this post is particularly offensive or breaches someone's privacy, please contact me.

Sunday morning, one thought led to another, and I realized I'd like to read a US nonprofit's Form 990 (annual report on assets, revenue, etc.). I started looking for 990s, and stumbled upon a big set of machine-readable 990s from 2011 through sometime in 2016. Is the one I seek in there? How recently was the dataset updated?

So I broke out bpython, the json module, and wget and started figuring it out.

Some things I learned/realized/remembered along the way:

json.load() does pretty much exactly what I hoped it would even though I had forgotten the exact wording -- yay.
Gar -- why can't datetime Just Do The Right Thing regarding turning very datestamp-looking strings into datetime objects? Aha, Arrow is my answer (Arrow : datetime :: requests : urllib2).
Hey, if I've started up a virtualenv (being careful to create one whose Python is my installation of Python 3), how come I can't invoke bpython? Ohhhh, I have to pip install bpython inside the venv (or -- just to be safe -- I used pip3 install bpython which worked).
I am fine with iteratively searching through a huge dataset in an incredibly unoptimized and cut-and-paste way at first, and if I get irritated by waiting then I can fix things up with refactoring and optimization and so on.
According to the index of available filings, the most recent update reflected in those filings is from January 11, 2017.
I'm one step closer to reflexively doing this kind of stuff in a Jupyter notebook and publishing it but I'm not there yet, despite encouragement from Julia Evans and how much easier the infrastructure's gotten. Next time, I think.
I remembered that Bradley Kuhn would like for far more people to pay attention to 990 filings, and has co-maintained a repository of FLOSS foundations' filings for some time. I started searching around in case there was a more recent version of the repo* -- and found, via an actually relevant textual ad served by a search engine, ProPublica's Nonprofit Explorer.

Use this database to view summaries of 3 million tax returns from tax-exempt organizations and see financial details such as their executive compensation and revenue and expenses. You can browse IRS data released since 2013 and access over 9.6 million tax filing documents going back as far as 2001.

For example, attractively presented data, and full 990s, for the Organization for Transformative Works.

So, uh, turns out I didn't need to actually get all "make a new GitLab repo! how does Arrow work? gotta refactor this!" and so on, but ah well. Now I will have a fresh new project to use for testing as I delve into Python packaging and test Warehouse, the new Python Package Index.

* Update: Mike Linksvayer, on Identi.ca, edifies: "The archived gitorious floss-foundations filings repo is now actively maintained by Martin Michlmayr at https://gitlab.com/floss-foundations/npo-public-filings".

Video of Our PyGotham Play

A Theme in Newitz, Sugar, Wells, & Leckie