PCAST report on big data and privacy emphasizes value of encryption, need for policy

pcast-4-4-2014 (1)
April 4, 2014 meeting of PCAST at National Academy of Sciences

This week, the President’s Council of Advisors on Science and Technology (PCAST) met to discuss and vote to approve a new report on big data and privacy.

UPDATE: The White House published the findings of its review on big data today, including the PCAST review of technologies underpinning big data (PDF), discussed below.

As White House special advisor John Podesta noted in January, the PCAST has been conducting a study “to explore in-depth the technological dimensions of the intersection of big data and privacy.” Earlier this week, the Associated Press interviewed Podesta about the results of the review, reporting that the White House had learned of the potential for discrimination through the use of data aggregation and analysis. These are precisely the privacy concerns that stem from data collection that I wrote about earlier this spring. Here’s the PCAST’s list of “things happening today or very soon” that provide examples of technologies that can have benefits but pose privacy risks:

 Pioneered more than a decade ago, devices mounted on utility poles are able to sense the radio stations
being listened to by passing drivers, with the results sold to advertisers.26
 In 2011, automatic license‐plate readers were in use by three quarters of local police departments
surveyed.  Within 5 years, 25% of departments expect to have them installed on all patrol cars, alerting
police when a vehicle associated with an outstanding warrant is in view.27  Meanwhile, civilian uses of
license‐plate readers are emerging, leveraging cloud platforms and promising multiple ways of using the
information collected.28
 Experts at the Massachusetts Institute of Technology and the Cambridge Police Department have used a
machine‐learning algorithm to identify which burglaries likely were committed by the same offender,
thus aiding police investigators.29
 Differential pricing (offering different prices to different customers for essentially the same goods) has
become familiar in domains such as airline tickets and college costs.  Big data may increase the power
and prevalence of this practice and may also decrease even further its transparency.30
 reSpace offers machine‐learning algorithms to the gaming industry that may detect
early signs of gambling addiction or other aberrant behavior among online players.31
 Retailers like CVS and AutoZone analyze their customers’ shopping patterns to improve the layout of
their stores and stock the products their customers want in a particular location.32  By tracking cell
phones, RetailNext offers bricks‐and‐mortar retailers the chance to recognize returning customers, just
as cookies allow them to be recognized by on‐line merchants.33  Similar WiFi tracking technology could
detect how many people are in a closed room (and in some cases their identities).
 The retailer Target inferred that a teenage customer was pregnant and, by mailing her coupons
intended to be useful, unintentionally disclosed this fact to her father.34
 The author of an anonymous book, magazine article, or web posting is frequently “outed” by informal
crowd sourcing, fueled by the natural curiosity of many unrelated individuals.35
 Social media and public sources of records make it easy for anyone to infer the network of friends and
associates of most people who are active on the web, and many who are not.36
 Marist College in Poughkeepsie, New York, uses predictive modeling to identify college students who are
at risk of dropping out, allowing it to target additional support to those in need.37
 The Durkheim Project, funded by the U.S. Department of Defense, analyzes social‐media behavior to
detect early signs of suicidal thoughts among veterans.38
 LendUp, a California‐based startup, sought to use nontraditional data sources such as social media to
provide credit to underserved individuals.  Because of the challenges in ensuring accuracy and fairness,
however, they have been unable to proceed.

The PCAST meeting was open to the public through a teleconference line. I called in and took rough notes on the discussion of the forthcoming report as it progressed. My notes on the comments of professors Susan Graham and Bill Press offer sufficient insight and into the forthcoming report, however, that I thought the public value of publishing them was warranted today, given the ongoing national debate regarding data collection, analysis, privacy and surveillance. The following should not be considered verbatim or an official transcript. The emphases below are mine, as are the words of [brackets]. For that, look for the PCAST to make a recording and transcript available online in the future, at its archive of past meetings.


 

graham-sSusan Graham: Our charge was to look at confluence of big data and privacy, to summarize current tech and the way technology is moving in foreseeable future, including its influence the way we think about privacy.

The first thing that’s very very obvious is that personal data in electronic form is pervasive. Traditional data that was in health and financial [paper] records is now electronic and online. Users provide info about themselves in exchange for various services. They use Web browsers and share their interests. They provide information via social media, Facebook, LinkedIn, Twitter. There is [also] data collected that is invisible, from public cameras, microphones, and sensors.

What is unusual about this environment and big data is the ability to do analysis in huge corpuses of that data. We can learn things from the data that allow us to provide a lot of societal benefits. There is an enormous amount of patient data, data about about disease, and data about genetics. By putting it together, we can learn about treatment. With enough data, we can look at rare diseases, and learn what has been effective. We could not have done this otherwise.

We can analyze more online information about education and learning, not only MOOCs but lots of learning environments. [Analysis] can tell teachers how to present material effectively, to do comparisons about whether one presentation of information works better than another, or analyze how well assessments work with learning styles.
Certain visual information is comprehensible, certain verbal information is hard to understand. Understanding different learning styles [can enable] develop customized teaching.

The reason this all works is the profound nature of analysis. This is the idea of data fusion, where you take multiple sources of information, combine them, which provides much richer picture of some phenomenon. If you look at patterns of human movements on public transport, or pollution measures, or weather, maybe we can predict dynamics caused by human context.

We can use statistics to do statistics-based pattern recognition on large amounts of data. One of the things that we understand about this statistics-based approach is that it might not be 100% accurate if map down to the individual providing data in these patterns. We have to very careful not to make mistakes about individuals because we make [an inference] about a population.

How do we think about privacy? We looked at it from the point of view of harms. There are a variety of ways in which results of big data can create harm, including inappropriate disclosures [of personal information], potential discrimination against groups, classes, or individuals, and embarrassment to individuals or groups.

We turned to what tech has to offer in helping to reduce harms. We looked at a number of technologies in use now. We looked at a bunch coming down the pike. We looked at several tech in use, some of which become less effective because of pervasivesness [of data] and depth of analytics.

We traditionally have controlled [data] collection. We have seen some data collection from cameras and sensors that people don’t know about. If you don’t know, it’s hard to control.

Tech creates many concerns. We have looked at methods coming down the pike. Some are more robust and responsive. We have a number of draft recommendations that we are still working out.

Part of privacy is protecting the data using security methods. That needs to continue. It needs to be used routinely. Security is not the same as privacy, though security helps to protect privacy. There are a number of approaches that are now used by hand that with sufficient research could be automated could be used more reliably, so they scale.

There needs to be more research and education about education about privacy. Professionals need to understand how to treat privacy concerns anytime they deal with personal data. We need to create a large group of professionals who understand privacy, and privacy concerns, in tech.

Technology alone cannot reduce privacy risks. There has to be a policy as well. It was not our role to say what that policy should be. We need to lead by example by using good privacy protecting practices in what the government does and increasingly what the private sector does.

pressBill Press: We tried throughout to think of scenarios and examples. There’s a whole chapter [in the report] devoted explicitly to that.

They range from things being done today, present technology, even though they are not all known to people, to our extrapolations to the outer limits, of what might well happen in next ten years. We tried to balance examples by showing both benefits, they’re great, and they raise challenges, they raise the possibility of new privacy issues.

In another aspect, in Chapter 3, we tried to survey technologies from both sides, with both tech going to bring benefits, those that will protect [people], and also those that will raise concerns.

In our technology survey, we were very much helped by the team at the National Science Foundation. They provided a very clear, detailed outline of where they thought that technology was going.

This was part of our outreach to a large number of experts and members of the public. That doesn’t mean that they agree with our conclusions.

Eric Lander: Can you take everybody through analysis of encryption? Are people using much more? What are the limits?

Graham: The idea behind classical encryption is that when data is stored, when it’s sitting around in a database, let’s say, encryption entangles the representation of the data so that it can’t be read without using a mathematical algorithm and a key to convert a seemingly set of meaningless set of bits into something reasonable.

The same technology, where you convert and change meaningless bits, is used when you send data from one place to another. So, if someone is scanning traffic on internet, you can’t read it. Over the years, we’ve developed pretty robust ways of doing encryption.

The weak link is that to use data, you have to read it, and it becomes unencrypted. Security technologists worry about it being read in the short time.

Encryption technology is vulnerable. The key that unlocks the data is itself vulnerable to theft or getting the wrong user to decrypt.

Both problems of encryption are active topics of research on how to use data without being able to read it. There research on increasingly robustness of encryption, so if a key is disclosed, you haven’t lost everything and you can protect some of data or future encryption of new data. This reduces risk a great deal and is important to use. Encryption alone doesn’t protect.

Unknown Speaker: People read of breaches derived from security. I see a different set of issues of privacy from big data vs those in security. Can you distinguish them?

Bill Press: Privacy and security are different issues. Security is necessary to have good privacy in the technological sense if communications are insecure, they clearly can’t be private. This goes beyond, to where parties that are authorized, in a security sense, to see the information. Privacy is much closer to values. security is much closer to protocols.

Interesting thing is that this is less about purely tech elements — everyone can agree on right protocol, eventually. These things that go beyond and have to do with values.

On data journalism, accountability and society in the Second Machine Age

On Monday, I delivered a short talk on data journalism, networked transparency, algorithmic transparency and the public interest at the Data & Society Research Institute’s workshop on the social, cultural & ethical dimensions of “big data”. The forum was convened by the Data & Society Research Institute and hosted at New York University’s Information Law Institute at the White House Office of Science and Technology Policy, as part of an ongoing review on big data and privacy ordered by President Barack Obama.

Video of the talk is below, along with the slides I used. You can view all of the videos from the workshop, along with the public plenary on Monday evening, on YouTube or at the workshop page.

Here’s the presentation, with embedded hyperlinks to the organizations, projects and examples discussed:

For more on the “Second Machine Age” referenced in the title, read the new book by Erik Brynjolfsson and Andrew McAfee.

Digging in open data dirt, Climate Corporation finds nearly $1 billion in acquisition

“Like the weather information, the data on soils was free for the taking. The hard and expensive part is turning the data into a product.”-Quentin Hardy, in 2011, in a blog post about “big data in the dirt.”

soil-type-graphic_wide-7a4e4709ff8554fc2abafaa342589fccf0524189-s6-c30

The Climate Corporation, acquired by Monsanto for $930 million dollars on Wednesday, was founded using 30 years of government data from the National Weather Service, 60 years of crop yield data and 14 terabytes of information on soil types for every two square miles for the United States from the Department of Agriculture, per David Friedberg, chief executive of the Climate Corporation.

Howsabout that for government “data as infrastructure” and platform for businesses?

As it happens, not everyone is thrilled to hear about that angle or the acquirer. At VentureBeat, Rebecca Grant takes the opportunity to knock “the world’s most evil corporation for the effects of Monsanto’s genetically modified crops, and, writing for Salon, Andrew Leonard takes the position that the Climate Corporation’s use of government data constitutes a huge “taxpayer ripoff.”

Most observers, however, are more bullish. Hamish MacKenzie hails the acquisition as confirmation that “software is eating the world,” signifying an inflection point in data analysis transforming industries far away from Silicon Valley. Liz Gannes also highlighted the application of data-driven analysis to an offline industry. Ashlee Vance focused on the value of Climate Corporation’s approach to scoring risk for insurance in agribusiness. Stacey Higginbotham posits that the acquisition could be a boon to startups that specialize in creating data on soil and climate through sensors.

[Image Credit: Climate Corporation, via NPR]

Hedge fund use of government data for business intelligence shows where the value is

money vortexThis week, I read and shared a notable Wall Street Journal article on the value of government data in that bears highlighting: hedge funds are paying for market intelligence using open government laws as an acquisition vehicle.

Here’s a key excerpt from the story: “Finance professionals have been pulling every lever they can these days to extract information from the government. Many have discovered that the biggest lever of all is the one available to everyone—the Freedom of Information Act—conceived by advocates of open government to shine light on how officials make decisions. FOIA is part of an array of techniques sophisticated investors are using to try to obtain potentially market-moving information about products, legislation, regulation and government economic statistics.”

What’s left unclear by the reporting here is if there’s 1) strong interest in data and 2) deep pocketed hedge funds or well-financed startups are paying for it, why aren’t agencies releasing it proactively?

Notably, the relevant law provides for this, as the WSJ reported:

“The only way investors can get most reports is to send an open-records request to the FDA. Under a 1996 law, when the agency gets frequent requests for the same records—generally more than three—it has to make them public on its website. But there isn’t any specific deadline for doing so, says Christopher Kelly, an FDA spokesman. That means first requesters can get records days or even months before they are posted online.”

Tracking inbound FOIA requests from industry and responding to this market indicator as a means of valuing  “high value data” is a strategy that has been glaringly obvious for years. Unfortunately, it’s an area in which the Obama administration’s open data policies look to have failed over the last 4 years, at least as viewed through the prism of this article.

If data sets that are requested multiple times are not being proactively posted on Data.gov and tracked there, there’s a disconnect between what the market for government data is and the perception by officials. As the Obama administration and agencies prepare to roll out enterprise data inventories later fall as part of the open data policies, here’s hoping agency CIOs are also taking steps to track who’s paying for data and which data sets are requested.

If one of the express goals of the federal governments is to find an economic return on investment on data releases, agencies should focus on open data with business value. It’s just common sense.

[Image Credit: Open Knowledge Foundation, Follow the Money]

Election 2012: A #SocialElection Driven By The Data

Social media was a bigger part of the election season of 2012 than ever before, from the enormous volume of Facebook updates and tweets to memes during the Presidential debates to public awareness of what the campaigns were doing there in popular culture. Facebook may even have booted President Obama’s vote tally.

While it’s too early to say if any of the plethora of platforms played any sort of determinative role in 2012, strong interest in what social media meant in this election season led me to participate in two panels in the past two weeks: one during DC Week 2012 and another at the National Press Club, earlier today. Storifies of the online conversations during each one are embedded below.

http://storify.com/digiphile/social-media-and-election-2012-at-dc-week-2012.js[View the story “Social media and Election 2012 at DC Week 2012” on Storify]

http://storify.com/digiphile/election-2012-the-socialelection.js[View the story “Election 2012: The #SocialElection?” on Storify]

The big tech story of this campaign, however, was not social media. As Micah Sifry presciently observed last year, it wasn’t (just) about Facebook: “it’s the data, stupid.” And when it came to building for this re-election campaign like an Internet company, the digital infrastructure that the Obama campaign’s team of engineers built helped to deliver the 2012 election.

Do newspapers need to adopt data science to cover campaigns?

Last October, New York Times elections developer Derek Willis was worried about what we don’t know about elections:

While campaigns have a public presence that is mostly recorded and observed, the stuff that goes on behind the scenes is so much more sophisticated than it has been. In 2008 we were fascinated by the Obama campaign’s use of iPhones for data collection; now we’re entering an age where campaigns don’t just collect information by hand, but harvest it and learn from it. An “information arms race,” as GOP consultant Alex Gage puts it.

For most news organizations, the standard approach to campaign coverage is tantamount to bringing a knife to a gun fight. How many data scientists work for news organizations? We are falling behind, and we risk not being able to explain to our readers and users how their representatives get elected or defeated.

Writing for the New York Times today, Slate columnist Sasha Issenberg revisited that theme, arguing that campaign reporters are behind the curve in understanding, analyzing or being able to capably replicate what political campaigns are now doing with data. Whether you’re new to the reality of the role of big data in this campaign or fascinated by it, a recent online conference on the data-driven politics of 2012 will be of interest. I’ve embedded it below:

Issenberg’s post has stirred online debate amongst journalists, academics and at least one open government technologist. I’ve embedded a storify of them below.

http://storify.com/digiphile/should-newspapers-adopt-data-science-to-cover-poli.js[View the story “Should newspapers adopt data science to cover political campaigns?” on Storify]

U.S. cities form working group to share predictive data analytics skills

Yesterday, I published an interview with Michael Flowers, New York City’s director of analytics for the Office of Policy and Strategic Planning in Mayor Bloomberg’s office. In the interview, “Predictive data analytics is saving lives and taxpayer dollars in New York City,” Flowers talks about how his team of 5 is applying data analysis on the behalf of citizens to improve the efficiency of processes and more effectively detection of crimes, from financial fraud to cigarette bootlegging.

After our interview, Flowers followed up over email to tell me about a new working group on data analytics between New York City, Boston, Chicago and Philadelphia. The working group, which recently launched a website at www.g-analytics.org, is sharing methodologies, ideas and strategies,

“Ultimately we want the group to grow and support as many cities interested in pursuing this approach as possible,” wrote Flowers. “It can get pretty lonely when you pursue something asymmetrical or untraditional in the government space, so we felt it was important to make it as simple as possible for like-minded cities to get started. There’s a great guy I work closely with out in Chicago on this effort – [Chicago chief data officer] Brett Goldstein; we talk at least twice a week.”

White House announces 200m in funding for big data research and development, hosts forum at AAAS

In 2012, making sense of big data through narrative and context, particularly unstructured data, is now a strategic imperative for leaders around the world, whether they serve in Washington, run media companies or trading floors in New York City or guide tech titans in Silicon Valley.

While big data carries the baggage of huge hype, the institutions of federal government are getting serious about its genuine promise. On Thursday morning, the Obama Administration announced a “Big Data Research and Development Initiative,” with more than $200 million in new commitments. (See fact sheet provided by the White House Office of Science and technology policy at the bottom of this post.)

“In the same way that past Federal investments in information-technology R&D led to dramatic advances in supercomputing and the creation of the Internet, the initiative we are launching today promises to transform our ability to use Big Data for scientific discovery, environmental and biomedical research, education, and national security,” said Dr. John P. Holdren, Assistant to the President and Director of the White House Office of Science and Technology Policy, in a prepared statement.

The research and development effort will focus on advancing “state-of-the-art core technologies” need for big data, harnessing said technologies “to accelerate the pace of discovery in science and engineering, strengthen our national security, and transform teaching and learning,” and “expand the workforce needed to develop and use Big Data technologies.”

In other words, the nation’s major research institutions will focus on improving available technology to collect and use big data, apply them to science and national security, and look for ways to train more data scientists.

“IBM views Big Data as organizations’ most valuable natural resource, and the ability to use technology to understand it holds enormous promise for society at large,” said David McQueeney, vice president of software, IBM Research, in a statement. “The Administration’s work to advance research and funding of big data projects, in partnership with the private sector, will help federal agencies accelerate innovations in science, engineering, education, business and government.”

While $200 million dollars is a relatively small amount of funding, particularly in the context of the federal budget or as compared to investments that are (probably) being made by Google or other major tech players, specific support for training and subsequent application of big data within federal government is important and sorely needed. The job market for data scientists in the private sector is so hot that government may well need to build up its own internal expertise, much in the same way Living Social is training coders at the Hungry Academy.

Big data is a big deal,” blogged Tom Kalil, deputy director for policy at White House OSTP, at the White House blog this morning.

We also want to challenge industry, research universities, and non-profits to join with the Administration to make the most of the opportunities created by Big Data. Clearly, the government can’t do this on its own. We need what the President calls an “all hands on deck” effort.

Some companies are already sponsoring Big Data-related competitions, and providing funding for university research. Universities are beginning to create new courses—and entire courses of study—to prepare the next generation of “data scientists.” Organizations like Data Without Borders are helping non-profits by providing pro bono data collection, analysis, and visualization. OSTP would be very interested in supporting the creation of a forum to highlight new public-private partnerships related to Big Data.

The White House is hosting a forum today in Washington to explore the challenges and opportunities of big data and discuss the investment. The event will be streamed online in live webcast from the headquarters of the AAAS in Washington, DC. I’ll be in attendance and sharing what I learn.

“Researchers in a growing number of fields are generating extremely large and complicated data sets, commonly referred to as ‘big data,'” reads the invitation to the event from the White House Office of Science and Technology Policy. “A wealth of information may be found within these sets, with enormous potential to shed light on some of the toughest and most pressing challenges facing the nation. To capitalize on this unprecedented opportunity — to extract insights, discover new patterns and make new connections across disciplines — we need better tools to access, store, search, visualize, and analyze these data.”

Speakers:

  • John Holdren, Assistant to the President and Director, White House Office of Science and Technology Policy
  • Subra Suresh, Director, National Science Foundation
  • Francis Collins, Director, National Institutes of Health
  • William Brinkman, Director, Department of Energy Office of Science

Panel discussion:

  • Moderator: Steve Lohr, New York Times, author of “Big Data’s Impact in the World
  • Alex Szalay, Johns Hopkins University
  • Lucila Ohno-Machado, UC San Diego
  • Daphne Koller, Stanford
  • James Manyika, McKinsey

What is big data?

Anyone planning for big data to use data for public good — or profit — through applied data science must know first understand what big data is.

On that count, turn to my colleague Edd Dumbill, who posted a useful definition last year on the O’Reilly Radar in his introduction to the big data landscape:

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

The hot IT buzzword of 2012, big data has become viable as cost-effective approaches have emerged to tame the volume, velocity and variability of massive data. Within this data lie valuable patterns and information, previously hidden because of the amount of work required to extract them. To leading corporations, such as Walmart or Google, this power has been in reach for some time, but at fantastic cost. Today’s commodity hardware, cloud architectures and open source software bring big data processing into the reach of the less well-resourced. Big data processing is eminently feasible for even the small garage startups, who can cheaply rent server time in the cloud.

Teams of data scientists are increasingly leveraging a powerful, growing set of common tools, whether they’re employed by government technologists opening cities, developers driving a revolution in healthcare or hacks and hackers defining the practice of data journalism.

To learn more about the growing ecosystem of big data tools, watch my interview with Cloudera architect Doug Cutting, embedded below. @Cutting created Lucerne and led the Hadoop project at Yahoo before he joined Cloudera. Apache Hadoop is an open source framework that allows distributed applications based upon the MapReduce paradigm to run on immense clusters of commodity hardware, which in turn enables the processing of massive amounts of big data.

Details on the administration’s big data investments

A fact sheet released by the White House OSTP follows, verbatim:

National Science Foundation and the National Institutes of Health – Core Techniques and Technologies for Advancing Big Data Science & Engineering

“Big Data” is a new joint solicitation supported by the National Science Foundation (NSF) and the National Institutes of Health (NIH) that will advance the core scientific and technological means of managing, analyzing, visualizing, and extracting useful information from large and diverse data sets. This will accelerate scientific discovery and lead to new fields of inquiry that would otherwise not be possible. NIH is particularly interested in imaging, molecular, cellular, electrophysiological, chemical, behavioral, epidemiological, clinical, and other data sets related to health and disease.

National Science Foundation: In addition to funding the Big Data solicitation, and keeping with its focus on basic research, NSF is implementing a comprehensive, long-term strategy that includes new methods to derive knowledge from data; infrastructure to manage, curate, and serve data to communities; and new approaches to education and workforce development. Specifically, NSF is:

· Encouraging research universities to develop interdisciplinary graduate programs to prepare the next generation of data scientists and engineers;
· Funding a $10 million Expeditions in Computing project based at the University of California, Berkeley, that will integrate three powerful approaches for turning data into information – machine learning, cloud computing, and crowd sourcing;
· Providing the first round of grants to support “EarthCube” – a system that will allow geoscientists to access, analyze and share information about our planet;
Issuing a $2 million award for a research training group to support training for undergraduates to use graphical and visualization techniques for complex data.
Providing $1.4 million in support for a focused research group of statisticians and biologists to determine protein structures and biological pathways.
· Convening researchers across disciplines to determine how Big Data can transform teaching and learning.

Department of Defense – Data to Decisions: The Department of Defense (DoD) is “placing a big bet on big data” investing approximately $250 million annually (with $60 million available for new research projects) across the Military Departments in a series of programs that will:

*Harness and utilize massive data in new ways and bring together sensing, perception and decision support to make truly autonomous systems that can maneuver and make decisions on their own.
*Improve situational awareness to help warfighters and analysts and provide increased support to operations. The Department is seeking a 100-fold increase in the ability of analysts to extract information from texts in any language, and a similar increase in the number of objects, activities, and events that an analyst can observe.

To accelerate innovation in Big Data that meets these and other requirements, DoD will announce a series of open prize competitions over the next several months.

In addition, the Defense Advanced Research Projects Agency (DARPA) is beginning the XDATA program, which intends to invest approximately $25 million annually for four years to develop computational techniques and software tools for analyzing large volumes of data, both semi-structured (e.g., tabular, relational, categorical, meta-data) and unstructured (e.g., text documents, message traffic). Central challenges to be addressed include:

· Developing scalable algorithms for processing imperfect data in distributed data stores; and
· Creating effective human-computer interaction tools for facilitating rapidly customizable visual reasoning for diverse missions.

The XDATA program will support open source software toolkits to enable flexible software development for users to process large volumes of data in timelines commensurate with mission workflows of targeted defense applications.

National Institutes of Health – 1000 Genomes Project Data Available on Cloud: The National Institutes of Health is announcing that the world’s largest set of data on human genetic variation – produced by the international 1000 Genomes Project – is now freely available on the Amazon Web Services (AWS) cloud. At 200 terabytes – the equivalent of 16 million file cabinets filled with text, or more than 30,000 standard DVDs – the current 1000 Genomes Project data set is a prime example of big data, where data sets become so massive that few researchers have the computing power to make best use of them. AWS is storing the 1000 Genomes Project as a publically available data set for free and researchers only will pay for the computing services that they use.

Department of Energy – Scientific Discovery Through Advanced Computing: The Department of Energy will provide $25 million in funding to establish the Scalable Data Management, Analysis and Visualization (SDAV) Institute. Led by the Energy Department’s Lawrence Berkeley National Laboratory, the SDAV Institute will bring together the expertise of six national laboratories and seven universities to develop new tools to help scientists manage and visualize data on the Department’s supercomputers, which will further streamline the processes that lead to discoveries made by scientists using the Department’s research facilities. The need for these new tools has grown as the simulations running on the Department’s supercomputers have increased in size and complexity.

US Geological Survey – Big Data for Earth System Science: USGS is announcing the latest awardees for grants it issues through its John Wesley Powell Center for Analysis and Synthesis. The Center catalyzes innovative thinking in Earth system science by providing scientists a place and time for in-depth analysis, state-of-the-art computing capabilities, and collaborative tools invaluable for making sense of huge data sets. These Big Data projects will improve our understanding of issues such as species response to climate change, earthquake recurrence rates, and the next generation of ecological indicators.”

Further details about each department’s or agency’s commitments can be found at the following websites by 2 pm today:

NSF: http://www.nsf.gov/news/news_summ.jsp?cntn_id=123607
HHS/NIH: http://www.nih.gov/news/health/mar2012/nhgri-29.htm
DOE: http://science.energy.gov/news/
DOD: www.DefenseInnovationMarketplace.mil
DARPA: http://www.darpa.mil/NewsEvents/Releases/2012/03/29.aspx
USGS: http://powellcenter.usgs.gov

IBM infographic on big data

Big Data: The New Natural Resource

This post and headline have been updated as more information on the big data R&D initiative became available.

A future of cities fueled by citizens, open data and collaborative consumption

The future of cities was a hot topic this year at the SXSW Interactive Festival in Austin, Texas, with two different panels devoted to thinking about what’s next. I moderated one of them, on “shaping cities with mobile data.” Megan Schumann, a consultant at Deloitte, was present at both sessions and storified them. Her curatorial should gives you a sense of the zeitgeist of ideas shared.

http://storify.com/Schumenu/sxsw-future-cities-fueled-by-citizens-and-collabo.js[View the story “#SXSW Future Cities Fueled by Citizens and Collaborative Consumption” on Storify]

The importance of being earnest about big data

We are deluged in big data. We have become more adept, however, at collecting it than in making sense of it. The companies, individuals and governments that become the most adept at data analysis are doing more than find the signal in the noise: they are creating a strategic capability. Why?

“After Eisenhower, you couldn’t win an election without radio.

After JFK, you couldn’t win an election without television.

After Obama, you couldn’t win an election without social networking.

I predict that in 2012, you won’t be able to win an election without big data.”

Alistair Croll, founder of bitcurrent.

In November 2012, we’ll know if his prediction came true.

All this week, I’ll be reporting from Santa Clara at the so-called “data Woodstock” that is the Strata Conference. Croll is its co-chair. You can tune in to the O’Reilly Media livestream for the conference keynotes.

For some perspective on big data and analytics in government, watch IBM’s Dave McQueeney at last year’s Gov 2.0 Summit:

Or watch how Hans Rosling makes big data dance in this TED Talk:

http://video.ted.com/assets/player/swf/EmbedPlayer.swf