Bill Howe – 91̽��News

Wide-Open accelerates release of scientific data by automatically identifying overdue datasets

Jennifer Langston — Thu, 08 Jun 2017 18:03:54 +0000

WideOpen is a new open-source tool to help advance open science by automatically detecting datasets that are overdue for publication. Its use on the Gene Expression Omnibus (GEO) led to the dramatic drop of overdue datasets, with 400 datasets released within the first week. Photo: PLOS Biology

Advances in genetic sequencing and other technologies have led to an explosion of biological data, and decades of openness — both spontaneous and enforced — mean that scientists routinely deposit data in online repositories. But researchers are only human and may forget to tell a repository to release the data when a paper is published.

A new tool called Wide-Open, developed by 91̽�� and Microsoft researchers ,��Ի� , and described in an publishing June 8 in , hopes to get around this problem and help advance open science by automatically detecting datasets that are overdue for publication.

Open data is a vital pillar of open science, enabling other researchers to reproduce results and use the same datasets to produce novel discoveries. While many scientific journals now require published authors to make the data underlying their findings publicly available, these policies often go unenforced. The challenge is substantial – the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus repository (GEO) alone contains 80,985 public datasets, spanning hundreds of tissue types in thousands of organisms – and the rapid growth in data makes it difficult for journals or data repositories to “police” whether datasets that should be made publicly available actually are.

The Wide-Open system is available under an ; it uses text mining to identify dataset references in published scientific articles that should be publicly accessible, and then parses query results from repositories to determine if those datasets remain private.

Grechkin and his team tested their tool on two popular data repositories maintained by the NCBI – GEO and the Sequence Read Archive (SRA) . Wide-Open identified a large number of overdue datasets, which spurred repository administrators to respond by releasing 400 datasets in one week.

“We developed a simple yet effective system that has already helped make hundreds of datasets public,” said lead author Grechkin, a doctoral student in the Allen School of Computer Science & Engineering. “Having an impartial and automated system enforce open data policies can help level the playing field among scientists and generate new opportunities for discovery.”

For more information, contact Maxim Grechkin at grechkin@cs.washington.edu.

This article was adapted from a PLOS Biology press release.

91̽��anthropologist: Why researchers should share computer code

Kim Eckart — Thu, 25 May 2017 15:17:41 +0000

For years, scientists have discussed whether and how to share data from painstaking research and costly experiments. Some are further along in their efforts toward “open science” than others: Fields such as astronomy and oceanography, for example, involve such expensive and large-scale equipment and logistical challenges to data collection that collaboration among institutions has become the norm.

Meanwhile, a variety of academic journals, including several in the Nature Research family, are turning their attention to another aspect of the research process: computer programming code. Code is becoming increasingly important in research because scientists are often writing their own computer programs to interpret their data, rather than using commercial software packages. Some journals now include scientific data and code as part of the peer-review process.

And now, with the May 25 online publication of a by , 91̽�� associate professor of anthropology, and 13 other colleagues at universities across the United States and Europe, there are conventions and tools that researchers can use to make code sharing easier and more efficient. The team’s paper advocating the sharing of code appears in Nature Neuroscience, while the journal in an editorial a pilot project to ask future authors to make their code available for review.

Making the programs behind the research accessible allows other scientists to test the code and reproduce the computations in an experiment — in other words, to reproduce results and solidify findings. It’s the “how the sausage is made” part of research, Marwick said. It also allows the code to be used by other researchers in new studies, making it easier for scientists to build on the work of their colleagues.

“What we’re missing is the convention of sharing code or the tools for turning data into useful discoveries or information,” Marwick said. “Researchers say it’s great to have the data available in a paper — increasingly raw data are available in supplementary files or specialized online repositories — but the code for performing the clever analyses in between the raw data and the published figures and tables are still inaccessible.”

Other Nature Research journals, such as and provide for code review as part of the article evaluation process. Since 2014, the company has encouraged writers to make their code available upon request.

The Nature Neuroscience pilot focuses on three elements: whether the code supporting an author’s main claims is publicly accessible; whether the code functions without mistakes; and whether it produces the results cited.

“This is a commitment from a high-impact journal to raise software to the status of a regular research product, that it’s not just a tool that gets discarded along the way, or hidden on a researcher’s computer where no-one else can benefit from it,” Marwick said. “In the future, scientific disciplines will be shifting to a position where you need to share your code as well as your data. It will be easier to reproduce someone’s new discovery, and incorporate their discoveries into your own work.”

Imagine this scenario, Marwick said: A neuroscientist is trying to find new ways to identify early-stage tumors using 3-D brain imagery. She comes up with an algorithm that can pick out specific pixel values in an image, which helps lead to early tumor detection. By sharing the computer code and its mathematical algorithm, the scientist could facilitate a breakthrough.

The Nature Neuroscience paper resulted from a two-day workshop held in 2014 in the United Kingdom, to Marwick, an archaeologist, was invited because of his efforts in using code and promoting open science in archaeology. A Senior Data Science Fellow at the 91̽��eScience Institute, Marwick is active in the institute’s Reproducibility and Open Science Group, which works on issues and practices around tools and practices to enhance data sharing, preservation and reproducibility.

, associate director of the eScience Institute, said code sharing is part of the future. “Reproducibility is literally the definition of science, and as science moves from the lab to the computer, code sharing must be at the core of how we conduct research and train students.”

An open science approach to sharing code is not without its critics, as well as scientists who raise legal and ethical questions about the repercussions. How do researchers get proper credit for the code they share? How should code be cited in the scholarly literature? How will it count toward tenure and promotion applications? How is sharing code compatible with patents and commercialization of software technology?

Marwick, who specializes in prehistoric human evolutionary ecology in Southeast Asia and Australia, has been advocating for code-sharing and related open science initiatives in archaeology through the Society of American Archaeology.

“I’m just trying to shift the needle in my discipline to a practice that benefits everyone — researchers and the public,” he said.

###

91̽��students put data science skills to use for social good

Mon, 31 Aug 2015 15:53:30 +0000

The Data Science for Social Good team that worked on the family homelessness project. Photo: Craig Young / 91̽��

They could easily spend their days poring over statistical methods for a genetic study or sorting through data about consumer behavior on the other side of the globe.

But this summer, data scientists at the 91̽��’s took a break from their typical work helping researchers and professors to incorporate cutting-edge technologies and data-based methods into their academic pursuits. Instead, they harnessed their expertise to address pressing urban issues closer to home.

In June, the institute launched the UW’s program, an initiative that paired data scientists with students and local nonprofit and government partners. These interdisciplinary teams worked on projects to reduce family homelessness, improve , foster community well-being and map better sidewalk routes for people with mobility challenges.

The initiative, modeled after similar programs at the and , fits with the eScience Institute’s mission to advance data-driven science in all fields, said , the institute’s associate director.

“Interdisciplinary is part of our brand, and the social good aspect is a powerful extension of that,” he said. “People want to have an impact. This was a chance for students to apply their skills to projects that have some relevance.”

One student team worked with the and the Seattle-based nonprofit to determine the best combination of services and programs to lead homeless families to permanent housing. The two organizations are working with King, Pierce and Snohomish counties on a multiyear to halve family homelessness in the region by 2020.

Program fellows attended tutorials on a variety of topics. Photo: Craig Young / 91̽��

One major, data-centered challenge to these efforts is keeping track of homeless families as they move through government and nonprofit services. The 91̽��team took on the task of linking records representing homeless families across different services in the three counties, a challenging undertaking since counties don’t necessarily track or define households the same way.

“The first problem we encountered in this pipeline was defining households,” 91̽��graduate student and team member Chris Suberlak said during a recent presentation of the team’s findings. “It would be ideal if the household ID that was provided was consistent. In at least one county, it wasn’t.”

After grouping individuals together into the correct family units, the team developed criteria for defining a single instance of homelessness. They aggregated information about the programs and services families used during that single episode — for example, staying in an emergency shelter and then moving into transitional housing.

The team also developed an interactive diagram showing the different paths families took through the system of government and charitable services. They created a , typically used to show the flow of energy or costs between two points, to help providers identify programs and trajectories that may help homeless families secure permanent housing.

Gates Foundation spokesperson Anne Martens said the team’s work has already made a difference, since it analyzed data at a level beyond what the counties are equipped to handle. After looking at the Sankey diagram, she said, a county worker recognized that something was amiss in the data from a provider and was able to fix the problem.

“It’s already been a boon to the counties’ decision-making,” Martens said. “They would not have had the time or the capacity to do this without the support of the data scientists.”

Teams worked out of the eScience Institute’s 91̽��headquarters in the Washington Research Foundation Data Science Studio. Photo: Anissa Tanweer / 91̽��

Another team worked with King County’s paratransit program, which provides door-to-door bus service for people with disabilities. The team analyzed the highest-cost rides, developed more precise usage predictions and created a web-based tool that immediately locates alternate buses when a bus breaks down.

The third team mined data to develop a community well-being framework, measuring indicators such as neighborhood diversity, socioeconomic status and places to connect. They created an interactive online map that shows well-being indicators for Seattle’s neighborhoods.

The fourth team built on the success of an developed last spring. The app shows maps of Seattle’s sidewalks, highlighting obstacles and elevation changes for people with mobility challenges, particularly those in wheelchairs. The data science team used computational geometry and routing algorithms to construct a graph that connected the city’s fractured network of sidewalks across street crossings, then created an algorithm for the app to map customized, accessible sidewalk and crosswalk-based routes around the city that avoided curbs, construction sites and steep grades.

The four projects were chosen from among 11 proposals submitted. More than 140 students from 91̽��departments ranging from political science to mechanical engineering applied to participate. Sixteen undergraduate and graduate students were selected, along with six high school students from the UW’s program.

The 10-week internship also included tutorials for students on programming languages, presentations from local tech companies and readings on various topics. About half of the interns came in with little prior programming experience, said Micaela Parker, an eScience Institute program manager.

“They learned enough to be able to advance these projects and make them happen,” she said. “That’s pretty phenomenal.”

, the Bill & Melinda Gates Chair in Computer Science & Engineering, said the initiative demonstrates the utility of data science in tackling a host of societal challenges that students are eager to work on.

“I think people are energized by the ability to work on something that is both technically challenging and makes the world a better place,” he said. “That’s what Data Science for Social Good is about. It’s technically challenging and it leaves the world — or in our case, the city — a better place.”

AAAS symposium looks at how to bring big-data skills to academia

Michelle Ma — Fri, 13 Feb 2015 21:24:01 +0000

There’s a new kind of researcher on campus, one who doesn’t fit into the usual nooks and crannies at a university.

They are data scientists — students, faculty members and staff — who are building the tools and crafting the methods to help researchers analyze vast amounts of data now abundant in every field, from the physical and social sciences to the humanities, natural sciences and engineering. The very nature of their skill set is interdisciplinary, but the university system doesn’t always reward them for the time they spend developing techniques and software to advance science.

Big-data in academia symposium
1:30-4:30 p.m. Sunday, Feb. 15
Room LL21F,

These data scientists are sought after by industry — to mine customers’ preferences for more targeted advertising or to analyze traffic patterns to build more sensible roadways — and also are needed in academia to process gene sequences or astronomical amounts of star data. But traditional university career paths can be a poor fit for these experts.

This dilemma, and what universities can do to change it, is the topic of a symposium Feb. 15 at the American Association for the Advancement of Science in San Jose, California. The session, “,” is led by 91̽�� faculty members and brings together experts from the University of California, Berkeley, and New York University.

At the UW, an interdisciplinary organization called the , which recently was awarded several prestigious grants, is advancing the research and practice of data-intensive discovery across campus, in part by attracting data scientists to explore new career paths that blend independent research, interdisciplinary consulting and teaching, and development of new software and methods.

, associate director of the eScience Institute and co-organizer of the conference symposium, will talk about how the UW’s programs are designed to help researchers interact with industry partners, particularly to make big-data analysis techniques and methods easier for everyone to use.

The newly opened data science studio at the 91̽��is a space on campus open to anyone who needs help or wants to exchange ideas about big data. Photo: U of Washington

“We are trying to centralize these data-scientist roles at universities and give them the prestige and autonomy they would receive in similar industry jobs,” Howe said. “This could ultimately attract more early career researchers and practitioners to the field.”

The eScience Institute also has established a new postdoctoral fellow program to explicitly identify and reward young researchers who operate at the intersection of their own domain and data science. By building a community of these rising stars and helping to position them for prestigious faculty positions, 91̽��eScience aims to promote a model of interdisciplinary data-intensive science as the norm rather than the exception, Howe added.

The UW’s presenters will talk about their early successes in bringing data science to campus, including:

A new : A physical space on campus open to anyone who needs help with big data or wants to exchange ideas and techniques for working with large datasets. The UW’s studio opened in January and has been busy, Howe said, citing the in-person, “water-cooler effect” aspects as important for the collaborations that are happening.
The : Research labs from across campus send one person to work side by side with data scientists two days a week for the academic quarter. The goal is to train researchers to tackle their big-data projects, then bring those skills back to their respective labs. The studio also hosts more informal office hours for researchers to ask for guidance on smaller projects.
A data science : Brings together thought leaders from universities and industry to talk about topics related to data analysis, visualization and applications to other fields.
A new doctoral track in big data: Graduate students in a number of participating departments take courses and focus a portion of their research on methods in data-intensive science.

The symposium’s presentations and speakers are:

Ed Lazowska, 91̽��

Cecilia Aragon, 91̽��

Joshua Bloom, University of California, Berkeley

Juliana Freire, New York University

Fernando Perez, University of California, Berkeley

Bill Howe, 91̽��

###

For more information, contact Howe at billhowe@uw.edu.

Grant will support interdisciplinary, data-intensive research at UW

Michelle Ma — Tue, 12 Nov 2013 19:01:29 +0000

Researchers across the 91̽�� campus soon will be able to collaborate in an unprecedented way with a new team of data scientists to advance research through .

The UW, along with the University of California, Berkeley, and New York University, are partners in a new five-year, $37.8 million grant from the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation that aims to accelerate the growth of data-intensive discovery across many fields.

“All across our campus, the process of discovery will increasingly rely on researchers’ ability to extract knowledge from vast amounts of data,” said 91̽��project lead , a professor of computer science and engineering and director of the .

“To remain at the forefront, the 91̽��must be a leader in advancing the methodologies of data science and putting them to work in the broadest imaginable range of fields.”

The new initiative was announced Tuesday (Nov. 12) as a featured talk at a event highlighting public-private partnerships that support big data research.

The UW’s award with UC Berkeley and NYU builds upon existing investments in the eScience Institute – created in 2008 to focus on data-intensive discovery across campus – and the , now almost 15 years old. More than a dozen faculty members are working to implement the initiative at the UW.

91̽��faculty members are seen during a proposal working session last summer. Photo: 91̽��

At the UW, the grant will mainly fund salaries for new research positions, including five data scientists who specialize in software and will work with researchers across campus, four postdoctoral data science fellows pursuing interdisciplinary research and four partially funded research scientists stationed in other departments and centers. A dedicated “data science studio” on campus will have meeting areas and drop-in workspaces to encourage collaboration across the UW’s colleges and schools.

These new resources will allow faculty members to submit short-term project proposals that require data science expertise, which could include analyzing a large dataset, accessing cloud resources or scaling up a statistical method, said , co-lead of the new effort and a 91̽��affiliate assistant professor of computer science and engineering. A social scientist could, for example, learn how to mine data from social media channels to help with a research project. Or, a geographer might want to know how weather data affect a landscape in real-time.

Faculty participants in the program would send a graduate student or research staff member to physically relocate for a period to work directly with the data scientists. The idea behind this embedded approach is to learn techniques, collaborate and then bring that knowledge back to individual labs and departments.

“We see enormous potential in the cross-pollination that happens by having participants co-locate in the data science studio,” Howe said. “These projects will help expose common problems and enable collaboration as we continue to scale up our investment in data science expertise.”

The 91̽��also has received a $2.8 million Integrative Graduate Education and Research Traineeship grant from the National Science Foundation. Together, the two grants will fund several dozen graduate students from a variety of departments to learn how to tackle big data in their research fields. The need to analyze vast amounts of data now touches nearly every department and discipline, and both grants will boost the university’s ability to prepare students.

Faculty members see this initiative as advancing the capacity for data-intensive scientific research and boosting Seattle’s leadership in data science, while attracting more top talent back to universities at a time when big data is more pervasive than ever before.

“These data scientists are coveted in industry as well as in academia,” Howe said. “One of the missions we have in this effort is to provide competitive career paths and roles that allow these experts the freedom to apply their skills to the most important problems in science.”

###

For more information, contact Lazowska at lazowska@cs.washington.edu or 206-543-4755 and Howe at billhowe@cs.washington.edu or 206-221-9261.