PCAST report on big data and privacy emphasizes value of encryption, need for policy

pcast-4-4-2014 (1)
April 4, 2014 meeting of PCAST at National Academy of Sciences

This week, the President’s Council of Advisors on Science and Technology (PCAST) met to discuss and vote to approve a new report on big data and privacy.

UPDATE: The White House published the findings of its review on big data today, including the PCAST review of technologies underpinning big data (PDF), discussed below.

As White House special advisor John Podesta noted in January, the PCAST has been conducting a study “to explore in-depth the technological dimensions of the intersection of big data and privacy.” Earlier this week, the Associated Press interviewed Podesta about the results of the review, reporting that the White House had learned of the potential for discrimination through the use of data aggregation and analysis. These are precisely the privacy concerns that stem from data collection that I wrote about earlier this spring. Here’s the PCAST’s list of “things happening today or very soon” that provide examples of technologies that can have benefits but pose privacy risks:

 Pioneered more than a decade ago, devices mounted on utility poles are able to sense the radio stations
being listened to by passing drivers, with the results sold to advertisers.26
 In 2011, automatic license‐plate readers were in use by three quarters of local police departments
surveyed.  Within 5 years, 25% of departments expect to have them installed on all patrol cars, alerting
police when a vehicle associated with an outstanding warrant is in view.27  Meanwhile, civilian uses of
license‐plate readers are emerging, leveraging cloud platforms and promising multiple ways of using the
information collected.28
 Experts at the Massachusetts Institute of Technology and the Cambridge Police Department have used a
machine‐learning algorithm to identify which burglaries likely were committed by the same offender,
thus aiding police investigators.29
 Differential pricing (offering different prices to different customers for essentially the same goods) has
become familiar in domains such as airline tickets and college costs.  Big data may increase the power
and prevalence of this practice and may also decrease even further its transparency.30
 reSpace offers machine‐learning algorithms to the gaming industry that may detect
early signs of gambling addiction or other aberrant behavior among online players.31
 Retailers like CVS and AutoZone analyze their customers’ shopping patterns to improve the layout of
their stores and stock the products their customers want in a particular location.32  By tracking cell
phones, RetailNext offers bricks‐and‐mortar retailers the chance to recognize returning customers, just
as cookies allow them to be recognized by on‐line merchants.33  Similar WiFi tracking technology could
detect how many people are in a closed room (and in some cases their identities).
 The retailer Target inferred that a teenage customer was pregnant and, by mailing her coupons
intended to be useful, unintentionally disclosed this fact to her father.34
 The author of an anonymous book, magazine article, or web posting is frequently “outed” by informal
crowd sourcing, fueled by the natural curiosity of many unrelated individuals.35
 Social media and public sources of records make it easy for anyone to infer the network of friends and
associates of most people who are active on the web, and many who are not.36
 Marist College in Poughkeepsie, New York, uses predictive modeling to identify college students who are
at risk of dropping out, allowing it to target additional support to those in need.37
 The Durkheim Project, funded by the U.S. Department of Defense, analyzes social‐media behavior to
detect early signs of suicidal thoughts among veterans.38
 LendUp, a California‐based startup, sought to use nontraditional data sources such as social media to
provide credit to underserved individuals.  Because of the challenges in ensuring accuracy and fairness,
however, they have been unable to proceed.

The PCAST meeting was open to the public through a teleconference line. I called in and took rough notes on the discussion of the forthcoming report as it progressed. My notes on the comments of professors Susan Graham and Bill Press offer sufficient insight and into the forthcoming report, however, that I thought the public value of publishing them was warranted today, given the ongoing national debate regarding data collection, analysis, privacy and surveillance. The following should not be considered verbatim or an official transcript. The emphases below are mine, as are the words of [brackets]. For that, look for the PCAST to make a recording and transcript available online in the future, at its archive of past meetings.


 

graham-sSusan Graham: Our charge was to look at confluence of big data and privacy, to summarize current tech and the way technology is moving in foreseeable future, including its influence the way we think about privacy.

The first thing that’s very very obvious is that personal data in electronic form is pervasive. Traditional data that was in health and financial [paper] records is now electronic and online. Users provide info about themselves in exchange for various services. They use Web browsers and share their interests. They provide information via social media, Facebook, LinkedIn, Twitter. There is [also] data collected that is invisible, from public cameras, microphones, and sensors.

What is unusual about this environment and big data is the ability to do analysis in huge corpuses of that data. We can learn things from the data that allow us to provide a lot of societal benefits. There is an enormous amount of patient data, data about about disease, and data about genetics. By putting it together, we can learn about treatment. With enough data, we can look at rare diseases, and learn what has been effective. We could not have done this otherwise.

We can analyze more online information about education and learning, not only MOOCs but lots of learning environments. [Analysis] can tell teachers how to present material effectively, to do comparisons about whether one presentation of information works better than another, or analyze how well assessments work with learning styles.
Certain visual information is comprehensible, certain verbal information is hard to understand. Understanding different learning styles [can enable] develop customized teaching.

The reason this all works is the profound nature of analysis. This is the idea of data fusion, where you take multiple sources of information, combine them, which provides much richer picture of some phenomenon. If you look at patterns of human movements on public transport, or pollution measures, or weather, maybe we can predict dynamics caused by human context.

We can use statistics to do statistics-based pattern recognition on large amounts of data. One of the things that we understand about this statistics-based approach is that it might not be 100% accurate if map down to the individual providing data in these patterns. We have to very careful not to make mistakes about individuals because we make [an inference] about a population.

How do we think about privacy? We looked at it from the point of view of harms. There are a variety of ways in which results of big data can create harm, including inappropriate disclosures [of personal information], potential discrimination against groups, classes, or individuals, and embarrassment to individuals or groups.

We turned to what tech has to offer in helping to reduce harms. We looked at a number of technologies in use now. We looked at a bunch coming down the pike. We looked at several tech in use, some of which become less effective because of pervasivesness [of data] and depth of analytics.

We traditionally have controlled [data] collection. We have seen some data collection from cameras and sensors that people don’t know about. If you don’t know, it’s hard to control.

Tech creates many concerns. We have looked at methods coming down the pike. Some are more robust and responsive. We have a number of draft recommendations that we are still working out.

Part of privacy is protecting the data using security methods. That needs to continue. It needs to be used routinely. Security is not the same as privacy, though security helps to protect privacy. There are a number of approaches that are now used by hand that with sufficient research could be automated could be used more reliably, so they scale.

There needs to be more research and education about education about privacy. Professionals need to understand how to treat privacy concerns anytime they deal with personal data. We need to create a large group of professionals who understand privacy, and privacy concerns, in tech.

Technology alone cannot reduce privacy risks. There has to be a policy as well. It was not our role to say what that policy should be. We need to lead by example by using good privacy protecting practices in what the government does and increasingly what the private sector does.

pressBill Press: We tried throughout to think of scenarios and examples. There’s a whole chapter [in the report] devoted explicitly to that.

They range from things being done today, present technology, even though they are not all known to people, to our extrapolations to the outer limits, of what might well happen in next ten years. We tried to balance examples by showing both benefits, they’re great, and they raise challenges, they raise the possibility of new privacy issues.

In another aspect, in Chapter 3, we tried to survey technologies from both sides, with both tech going to bring benefits, those that will protect [people], and also those that will raise concerns.

In our technology survey, we were very much helped by the team at the National Science Foundation. They provided a very clear, detailed outline of where they thought that technology was going.

This was part of our outreach to a large number of experts and members of the public. That doesn’t mean that they agree with our conclusions.

Eric Lander: Can you take everybody through analysis of encryption? Are people using much more? What are the limits?

Graham: The idea behind classical encryption is that when data is stored, when it’s sitting around in a database, let’s say, encryption entangles the representation of the data so that it can’t be read without using a mathematical algorithm and a key to convert a seemingly set of meaningless set of bits into something reasonable.

The same technology, where you convert and change meaningless bits, is used when you send data from one place to another. So, if someone is scanning traffic on internet, you can’t read it. Over the years, we’ve developed pretty robust ways of doing encryption.

The weak link is that to use data, you have to read it, and it becomes unencrypted. Security technologists worry about it being read in the short time.

Encryption technology is vulnerable. The key that unlocks the data is itself vulnerable to theft or getting the wrong user to decrypt.

Both problems of encryption are active topics of research on how to use data without being able to read it. There research on increasingly robustness of encryption, so if a key is disclosed, you haven’t lost everything and you can protect some of data or future encryption of new data. This reduces risk a great deal and is important to use. Encryption alone doesn’t protect.

Unknown Speaker: People read of breaches derived from security. I see a different set of issues of privacy from big data vs those in security. Can you distinguish them?

Bill Press: Privacy and security are different issues. Security is necessary to have good privacy in the technological sense if communications are insecure, they clearly can’t be private. This goes beyond, to where parties that are authorized, in a security sense, to see the information. Privacy is much closer to values. security is much closer to protocols.

Interesting thing is that this is less about purely tech elements — everyone can agree on right protocol, eventually. These things that go beyond and have to do with values.

Map of open government communities generated by social network analysis of Twitter

Graph-12287

The map above was created on November 20 by researcher Marc Smith using a dataset of tweets that contained “opengov” over the past month. You can explore an interactive version of it here.

The social network analysis is, by its nature, a representation of only the data used to create it. It’s not a complete picture of open government communities offline, or even the totality of the communities online: it’s just the people who tweeted about open gov.

That said, there are some interesting insights to be gleaned.

1) The biggest network is the one for the Open Government Partnership (OGP), on the upper left (G1), which had its annual summit during the time period in question. That likely affected the data set.

2) I’m at the center of the U.S. open government community on the bottom left (G2) (I’m doing something right!) and am connected throughout these communities, though I need to work on my Spanish. This quadrant is strongly interconnected and includes many nodes linked up to OGP and around the world. (Those are represented by the green lines.)

3) Other communities include regional networks, like Spain (G4) and Spanish-speaking (G11) open government organizations, Germany (G3), Italy (G12), Canada (G7), Greece (G5) and Australia (G9), and ideological networks, like the White House @OpenGov initiative (G8) and U.S. House Majority Leader (G6). These networks have many links to one another, although Mexico looks relatively isolated. Given that Indonesia has a relatively high Twitter penetration, its relative absence from the map likely reflects users there not tweeting with “opengov.”

4) The relative sparseness of connections between the Republican open government network and other open government communities strongly suggests that, despite the overwhelming bipartisan support for the DATA Act in the House, the GOP isn’t engaging and linked up to the broader global conversation yet, an absence that should both concern its leaders and advocates in the United States that would like to see effective government rise above partisan politics. This community is also only tweeting links to its own (laudable) open government initiatives and bills in the House, as opposed to what’s happening outside of DC.

5) You can gain some insight into the events and issues that matter in these communities by looking at the top links shared. Below, I’ve shared the top links from Smith’s NodeXL analysis:
Top URLs in Tweet in Entire Graph:

https://healthcare.gov/
http://www.opengovguide.com/
https://www.gov.uk/government/topical-events/open-government-partnership-summit-2013
http://www.opengovpartnership.org/get-involved/london-summit-2013
https://govmakerday.eventbrite.com/
http://blogs.worldbank.org/youthink/can-young-people-make-your-government-more-accountable
http://www.opengovpartnership.org/london-summit-2013
http://paper.li/DGateway/1350366870
https://www.thunderclap.it/projects/5907-more-open-government-ogp13
http://www.youtube.com/playlist?feature=edit_ok&list=PLMDgGB-pYxdFNupM0kiHFPjwv8by2alxY

Top URLs in Tweet in G1 (Open Government Partnership):

https://www.gov.uk/government/topical-events/open-government-partnership-summit-2013
https://www.thunderclap.it/projects/5907-more-open-government-ogp13
http://www.opengovpartnership.org/london-summit-2013
http://www.thunderclap.it/tipped/5907/twitter
http://www.opengovpartnership.org/get-involved/london-summit-2013
http://www.youtube.com/playlist?feature=edit_ok&list=PLMDgGB-pYxdFNupM0kiHFPjwv8by2alxY
https://www.thunderclap.it/projects/5907-more-open-government-ogp13?locale=en
http://www.opengovguide.com/
https://www.gov.uk/government/consultations/open-government-partnership-uk-national-action-plan-2013
http://www.opengovpartnership.org/open-government-awards-launched-reward-transparent-accountable-and-effective-public-programs#sthash.xl5Bwn5D.dpuf

Top URLs in Tweet in G2 (US OpenGov Community):

http://www.usgovernmentmanual.gov/
http://www.huffingtonpost.com/danielle-brian/in-wake-of-snowden-us-mus_b_4192804.html?utm_hp_ref=tw
http://e-pluribusunum.com/2013/11/05/farm-bill-foia-open-government-epa/
http://www.knightfoundation.org/blogs/knightblog/2013/10/28/new-project-aims-connect-dots-open-data/
http://www.whitehouse.gov/lWV7k
http://sunlightfoundation.com/blog/2013/11/15/opengov-voices-pdf-liberation-hackathon-at-sunlight-in-dc-and-around-the-world-january-17-19-2014/
http://sunlightfoundation.com/blog/2013/11/14/recent-developments-show-desire-for-trade-talk-transparency/
http://sunlightfoundation.com/blog/2013/10/22/how-much-did-healthcare-gov-actually-cost/
http://www.consumerfinance.gov/blog/making-regulations-easier-to-use/
http://sunlightfoundation.com/blog/2013/11/19/house-keeps-data-act-momentum-moving/

Top URLs in Tweet in G3 (Germany):

http://paper.li/DGateway/1350366870
http://oknrw.de/
http://www.opengovpartnership.org/blog/christian-heise/2013/11/18/german-grand-coalition-might-agree-joining-ogp
http://dati.comune.bologna.it/node/962
http://www.globalhealthhub.org/thehive/
https://www.facebook.com/events/1431657647046016
http://aiddata.org/blog/this-week-open-data-for-open-hearts-and-open-minds
http://www.opengovpartnership.org/get-involved/london-summit-2013
http://www.freiburg.de/pb/,Lde/541381.html
http://digitaliser.dk/resource/2534864

Top URLs in Tweet in G4 (Spain):

http://www.opengovguide.com/
http://www.cepal.org/cgi-bin/getprod.asp?xml=/ilpes/capacitacion/0/50840/P50840.xml&xsl=/ilpes/tpl/p15f.xsl&base=/ilpes/tpl/top-bottom.xsl
https://vine.co/v/hpZErXPd6rq
http://thepowerofopengov.tumblr.com/
http://es.scribd.com/collections/4376877/Case-Studies
https://vine.co/v/hpZiw7TXanI
https://vine.co/v/hpZIV002zar
http://aga.org.mx/SitePages/DefinicionGob.aspx
http://www.opengovpartnership.org
http://inicio.ifai.org.mx/Publicaciones/La%20promesa%20del%20Gobierno%20Abierto.pdf

Top URLs in Tweet in G5 (Greece):

https://healthcare.gov/
http://venturebeat.com/2013/10/23/so-much-for-opengov-quantcast-traffic-on-healthcare-gov-hidden-by-the-owner/
http://elegilegi.org/
http://opengov.seoul.go.kr/
http://venturebeat.com/2013/10/23/so-much-for-opengov-quantcast-traffic-on-healthcare-gov-hidden-by-the-owner/?utm_source=twitterfeed&utm_medium=twitter
http://www.aspeninstitute.org/about/blog/biases-open-government-blind-us?utm_source=as.pn&utm_medium=urlshortener
http://www.opengov.gr/consultations/?p=1744
http://www.aspeninstitute.org/about/blog/biases-open-government-blind-us
http://www.opengov.gr/minfin/?p=4076
http://OpenGov.com

Top URLs in Tweet in G6 (GOP):

http://houselive.gov/
https://www.cosponsor.gov/details/hr2061
http://oversight.house.gov/release/oversight-leaders-introduce-bipartisan-data-act/
http://instagram.com/p/g3uvs_sYYr/
http://cpsc.gov/live
https://cosponsor.gov/details/hr2061-113
http://www.youtube.com/watch?v=Bnn3IsOhulE&feature=youtu.be
http://www.cpsc.gov/en/Regulations-Laws–Standards/Rulemaking/Final-and-Proposed-Rules/Hand-Held-Infant-Carriers/
http://www.speaker.gov/press-release/opengov-house-representatives-makes-us-code-available-bulk-xml
http://docs.house.gov/billsthisweek/20131118/BILLS-113hr2061XML.xml

Top URLs in Tweet in G7 (Canada):

https://govmakerday.eventbrite.com/
http://www.ontario.ca/government/open-government-initial-survey
http://govmakerday.eventbrite.com
http://www.thestar.com/opinion/commentary/2013/10/29/the_promise_and_challenges_of_open_government.html
http://www.marsdd.com/2013/10/31/open-government-three-stages-for-codeveloping-solutions/
http://www.thestar.com/opinion/commentary/2013/11/09/rob_ford_and_the_emerging_crisis_of_legitimacy.html
https://www.ontario.ca/government/open-government-initial-survey
http://govmakerday-estw.eventbrite.com
http://www.ontario.ca/open
http://gov20radio.com/2013/10/gtec2013/

Top URLs in Tweet in G8 (@OpenGov):

https://healthcare.gov/
http://www.whitehouse.gov/lWV7k
http://www.consumerfinance.gov/blog/making-regulations-easier-to-use/
http://aseyeseesit.blogspot.com/2012/01/economy-hasnt-stalled-for-members-of.html
http://www.commonblog.com/2013/10/23/eagle-tribune-editorial-public-records-need-to-be-available-to-its-citizenry/
http://mobile.twitter.com/OpenGov
http://blogs.state.gov/stories/2013/10/31/making-governments-more-open-effective-and-accountable
http://open.dc.gov/
http://www.sielocal.com/SieLocal/informe/1025/Ingresos-por-el-concepto-de-multas
http://mei-ks.net/?page=1,5,787

Top URLs in Tweet in G9 (Australia):

http://paper.li/cortado/1291646564
http://icma.org/en/icma/knowledge_network/documents/kn/Document/305680/Transparency_20_The_Fundamentals_of_Online_Open_Government
http://www.oaic.gov.au/about-us/corporate-information/annual-reports/oaic-annual-report-201213/
https://controllerdata.lacity.org/
http://cogovsnapshot.cofluence.co/
https://info.granicus.com/Online-Open-Gov-October-29-2013.html?page=Home-Page
http://www.oaic.gov.au/news-and-events/subscribe
https://oaic.govspace.gov.au/2013/10/30/community-attitudes-to-privacy-survey-results/
http://www.cebit.com.au/cebit-news/2013/towards-open-government-esnapshot-australia-2013
http://journalistsresource.org/studies/politics/digital-democracy/government-transparency-conflicts-public-trust-privacy-recent-research-ideas

Top URLs in Tweet in G10:

http://on.undp.org/pUdJj
http://europeandcis.undp.org/blog/2013/10/17/a-template-for-developing-a-gov20-opengov-project/?utm_source=%40OurTweets
http://www.govloop.com/profiles/blogs/12-favorite-quotes-from-code-for-america-summit
http://www.accessinitiative.org/blog/2013/10/east-kalimantan-community%E2%80%99s-struggles-underscore-need-proactive-transparency-indonesia
http://www.scribd.com/doc/178983441/Montenegro-Inspiring-Story-open-government
http://www.scribd.com/doc/178988676/Indonesia-case-study-open-government
http://www.setimes.com/cocoon/setimes/xhtml/en_GB/features/setimes/features/2013/10/18/feature-01?utm_source=%40OurTweets
http://www.opengovpartnership.org/summary-london-summit-commitments
http://feedly.com/k/1arGpdC
http://slid.es/kendall/open-records

Digging in open data dirt, Climate Corporation finds nearly $1 billion in acquisition

“Like the weather information, the data on soils was free for the taking. The hard and expensive part is turning the data into a product.”-Quentin Hardy, in 2011, in a blog post about “big data in the dirt.”

soil-type-graphic_wide-7a4e4709ff8554fc2abafaa342589fccf0524189-s6-c30

The Climate Corporation, acquired by Monsanto for $930 million dollars on Wednesday, was founded using 30 years of government data from the National Weather Service, 60 years of crop yield data and 14 terabytes of information on soil types for every two square miles for the United States from the Department of Agriculture, per David Friedberg, chief executive of the Climate Corporation.

Howsabout that for government “data as infrastructure” and platform for businesses?

As it happens, not everyone is thrilled to hear about that angle or the acquirer. At VentureBeat, Rebecca Grant takes the opportunity to knock “the world’s most evil corporation for the effects of Monsanto’s genetically modified crops, and, writing for Salon, Andrew Leonard takes the position that the Climate Corporation’s use of government data constitutes a huge “taxpayer ripoff.”

Most observers, however, are more bullish. Hamish MacKenzie hails the acquisition as confirmation that “software is eating the world,” signifying an inflection point in data analysis transforming industries far away from Silicon Valley. Liz Gannes also highlighted the application of data-driven analysis to an offline industry. Ashlee Vance focused on the value of Climate Corporation’s approach to scoring risk for insurance in agribusiness. Stacey Higginbotham posits that the acquisition could be a boon to startups that specialize in creating data on soil and climate through sensors.

[Image Credit: Climate Corporation, via NPR]