RankAndFiled.com is like the SEC’s EDGAR database, but for humans

A new website, Rank and Filed, gathers data from the Security and Exchange Commission’s EDGAR database, indexes it, and publishes it online in open formats that  investors can use to research and discover companies. I’ve included a screenshot of Tesla’s SEC filings below.

tesla-rank-filed

The site currently has over 25 million files indexed.

I heard about the new website directly from its creator, Maris Jensen, a former SEC analyst who built the site independently. According to Maris, she proposed the project internally in March 2013 but was immediately turned down.

A month later, after she was terminated for threatening the Commission’s mission with a “lack of respect for senior management” — an issue she holds was unrelated to the proposal — Maris decided to make the idea become real independently and started building. She has since offered to give the site and its code to the SEC but has not heard back from them yet.

Our interview, lightly edited for content and clarity, follows.

20140219-201203.jpg

Where did the idea for this originate?

The breaking point was realizing that the guy in the cubicle across from me had spent a week writing the same parser as me — a Python program to parse the EDGAR FTP index for specific filings. This is nearly two decades after Carl Malamud set everything up; the FTP index is exactly as he left it. We were in the division responsible for the SEC’s data analytics and interactive data initiatives. The division literally rewrites this program each time they need SEC filings data. There’s no version control. There’s just no excuse!  Hilariously, that guy also left the SEC and built an SEC filings website, though his is for-profit: http://legalai.com/

What does this do that the SEC needed?

In 2008, the SEC set up a task force (the ‘21st Century Disclosure Initiative‘) to rethink the way they were making data available to the public. A year later, they published this report, with their conclusion and proposal for a new, modernized disclosure system.  I basically just tried to build the system they described. I also did lots of googling — ‘SEC EDGAR tool terrible‘, ‘how to find SEC data‘, etc — and then tried to address the problems people were having.

The problems have been the same for decades. In 1994, people wanted a SEC CIK-to-ticker mapping. 20 years later, this question still pops up on forums monthly.

There are over 600 different forms on EDGAR but the SEC’s form lists are basically no help at all. I went through and googled each form individually. I tried to group them into understandable categories.

The comment at the bottom of this post describes the SEC’s current problem better than I ever could:

Has anyone out there ever tried to use SEC.GOV to search for information about a company? The problem is very easy to articulate. If you search for something, you get 5000 results. At about 10 results per page, you have 500 pages to sift through to find what you want. Once you find what you want, there is ZERO ability to navigate from what you found into related documents!

What if you want to research a particular company’s board of directors? What other companies is each director associated with? Have there been any problems in any of those companies? You can’t investigate these types of things using the technology sec.gov has fielded. You want a needle. The SEC gives you a haystack.

Why not allow for better discovery of all of the SEC data and let investors perform their own investigations of markets & companies?

So instead of focusing on this obvious improvement to the public service the SEC provides, the emphasis apparently is on improving investigative actions. Great. Why not just shut off the sec.gov website completely and let the SEC do all of the investigating and researching of SEC data?

How does RankAndFiled.com compare to other sources of SEC data online?

I unfortunately haven’t added that much ‘value’ yet. I’m a total amateur. I’m just trying to make the data available and understandable! The website doesn’t do any analysis: it just collects, links and presents data from different SEC filings.

Looks like you got some great help from the folks you thanked. Did you build this all yourself with these tools?

Yes, open source tools these days are amazing!!  I started this project with no web or software development experience at all.

I actually feel really lucky to have fallen into all of this. Everything I know I learned on google, mostly through tutorials written by the developers listed there.

I also didn’t know anyone in the dataviz or open source community, so I reached out to some of them with stuff like etiquette questions. Their response and support was just incredible — especially the D3 community, they’re just wonderful.

Can you tell me more about where the data on this site comes from and what you’ve done to it?

Basically, the system watches the SEC’s RSS feeds. It reads and indexes data from SEC filings as they come in. Not all the filings show up on the feeds — I’m not sure why — so it also scans the FTP index for any missed filings.

About 25 million SEC documents have been parsed and incorporated so far, which is everything that’s publicly available on EDGAR.  So companies and people are tracked and connected over time — who’s raising money where, who owns whom, who moved companies or got promoted, who sold a ton of shares.  I also realign all the financial data from quarterly and annual reports so you can see a company’s financial history and so the data is comparable between companies.

It actually feels silly even talking about it, because it’s just so basic. This is stuff the SEC should have been doing years and years and years ago.

But its not a perfect science because one, only a few SEC forms are machine-readable and two, the SEC doesn’t even try to standardize names. SEC registrants are given distinct identifiers but anything goes when companies or names are listed inside a filing. Middle names, middle initials, nicknames, suffixes, titles…

What’s next?

I spent November and December trying to give all my code to the SEC. I received no response, not even a polite no. That’s still the goal — I want them to take over and open source it, or at the very least host the underlying API.  It’s their job to make this data available and accessible. They NEED a team over there doing hands-on work with SEC filings, a team struggling to make sense of this data with just the tools available to retail investors, especially now that they’re talking about disclosure reform.  Right now, they have almost no incentive to change things over to structured data — they buy all the structured EDGAR data they need.

The SEC keeps saying that it’s the private sector’s job to build tools like this, not theirs, but in the past 20 years nobody has come up with a really great, really affordable option.  It doesn’t make sense for any of us to even try — I’ve heard that Bloomberg and Thomson Reuters hire legions of Indian professionals to go through each SEC filing by hand.  We just can’t compete.

The SEC will have to make a lot more of their data machine-readable before any ‘disruptive’ innovation can happen, but they won’t do that until they’re forced to (by Congress), unless they have people there who realize how unfair the situation has become.

There are actually a heartbreaking number of SEC employees who also want this to happen, self-described worker bees who’ve reached out to me from personal email to say they’ve been trying to convince their bosses to give this thing a chance.  So far, no luck! I would open source it myself, but unfortunately I can’t afford to host the project indefinitely.

Opening IRS e-file data would add innovation and transparency to $1.6 trillion U.S. nonprofit sector

One of the most important open government data efforts in United States history came into being in 1993, when citizen archivist Carl Malamud used a small planning grant from the National Science Foundation to license data from the Securities and Exchange Commission, published the SEC data on the Internet and then operated it for two years. At the end of the grant, the SEC decided to make the EDGAR data available itself — albeit not without some significant prodding — and has continued to do so ever since. You can read the history behind putting periodic reports of public corporations online at Malamud’s website, public.resource.org.

Meals-on-Wheels-Reports

Two decades later, Malamud is working to make the law public, reform copyright, and free up government data again, buying, processing and publishing millions of public tax filings from nonprofits to the Internal Revenue Service. He has made the bulk data from these efforts available to the public and anyone else who wants to use it.

“This is exactly analogous to the SEC and the EDGAR database,” Malamud told me, in an phone interview last year. The trouble is that data has been deliberately dumbed down, he said. “If you make the data available, you will get innovation.”

Making millions of Form 990 returns free online is not a minor public service. Despite many nonprofits file their Form 990s electronically, the IRS does not publish the data. Rather, the government agency releases images of millions of returns formatted as .TIFF files onto multiple DVDs to people and companies willing and able to pay thousands of dollars for them. Services like Guidestar, for instance, acquire the data, convert it to PDFs and use it to provide information about nonprofits. (Registered users view the returns on their website.)

As Sam Roudman reported at TechPresident, Luke Rosiak, a senior watchdog reporter for the Washington Examiner, took the files Malamud published and made them more useful. Specifically, he used credits for processing that Amazon donated to participants in the 2013 National Day of Civic Hacking to make the .TIFF files text-searchable. Rosiak then set up CItizenAudit.org a new website that makes nonprofit transparency easy.

“This is useful information to track lobbying,” Malamud told me. “A state attorney general could just search for all nonprofits that received funds from a donor.”

Malamud estimates nearly 9% of jobs in the U.S. are in this sector. “This is an issue of capital allocation and market efficiency,” he said. “Who are the most efficient players? This is more than a CEO making too much money — it’s about ensuring that investments in nonprofits get a return.

Malamud’s open data is acting as a platform for innovation, much as legislation.gov.uk is the United Kingdom. The difference is that it’s the effort of a citizen that’s providing the open data, not the agency: Form 990 data is not on Data.gov.

Opening Form 990 data should be a no-brainer for an Obama administration that has taken historic steps to open government dataLiberating nonprofit sector data would provide useful transparency into a $1.6 trillion dollar sector for the U.S. economy.

After many letters to the White House and discussions with the IRS, however, Malamud filed suit against the IRS to release Form 990 data online this summer.

“I think inertia is behind the delay,” he told me, in our interview. “These are not the expense accounts of government employees. This is something much more fundamental about a $1.6 trillion dollar marketplace. It’s not about who gave money to a politician.”

When asked for comment, a spokesperson for the White House Office of Management and Budget said that the IRS “has been engaging on this topic with interested stakeholders” and that “the Administration’s Fiscal Year 2014 revenue proposals would let the IRS receive all Form 990 information electronically, allowing us to make all such data available in machine readable format.”

Today, Malamud sent a letter of complaint to Howard Shelanski, administrator of the Office of Information and Regulatory Affairs in the White House Office of Management and Budget, asking for a review of the pricing policies of the IRS after a significant increase year-over-year. Specifically, Malamud wrote that the IRS is violating the requirements of President Obama’s executive order on open data:

The current method of distribution is a clear violation of the President’s instructions to
move towards more open data formats, including the requirements of the May 9, 2013
Executive Order making “open and machine readable the new default for government
information.”

I believe the current pricing policies do not make any sense for a government
information dissemination service in this century, hence my request for your review.
There are also significant additional issues that the IRS refuses to address, including
substantial privacy problems with their database and a flat-our refusal to even
consider release of the Form 990 E-File data, a format that would greatly increase the
transparency and effectiveness of our non-profit marketplace and is required by law.

It’s not clear at all whether the continued pressure from Malamud, the obvious utility of CitizenAudit.org or the bipartisan budget deal that President Obama signed in December will push the IRS to freely release open government data about the nonprofit sector,

The furor last summer over the IRS investigating the status of conservative groups claimed tax-exempt status, however, could carry over into political pressure to reform. If political groups were tax-exempt and nonprofit e-file data were published about them, it would be possible for auditors, journalists and Congressional investigators to detect patterns. The IRS would need to be careful about scrubbing the data of personal information: last year, the IRS mistakenly exposed thousands of Social Security numbers when it posted 527 forms online — an issue that Malamud, as it turns out, discovered in an audit.

“This data is up there with EDGAR, in terms of its potential,” said Malamud. “There are lots of databases. Few are as vital to government at large. This is not just about jobs. It’s like not releasing patent data.”

If the IRS were to modernize its audit system, inspector generals could use automated predictive data analysis to find aberrations to flag for a human to examine, enabling government watchdogs and investigative journalists to potentially detect similar issues much earlier.

That level of data-driven transparency remains in the future. In the meantime, CitizenAudit.org is currently running on a server in Rosiak’s apartment.

Whether the IRS adopts it as the SEC did EDGAR remains to be seen.

[Image Credit: Meals on Wheels]