In a briefing on Monday, research leaders across tech, academia and the government joined the White House to announce an open data set full of scientific literature on the novel coronavirus. The COVID-19 Open Research Dataset, known as CORD-19, will also add relevant new research moving forward, compiling it into one centralized hub. The new data set is machine readable, making it easily parsed for machine learning purposes — a key advantage according to researchers involved in the ambitious project.
In a press conference, U.S. CTO Michael Kratsios called the new data set the “most extensive collection of machine readable coronavirus literature to date.” Kratsios characterized the project as a “call to action” for the AI community, which can employ machine learning techniques to surface unique insights in the body of data. To come up with guidance for researchers combing through the data, the National Academies of Sciences, Engineering, and Medicine collaborated with the World Health Organization to come up with “high priority” questions about the coronavirus related to genetics, incubation, treatment, symptoms and prevention.
The partnership, announced today by the White House Office of Science and Technology Policy, brings together the Chan Zuckerberg Initiative, Microsoft Research, the Allen Institute for Artificial Intelligence, the National Institutes of Health’s National Library of Medicine, Georgetown University’s Center for Security and Emerging Technology, Cold Spring Harbor Laboratory and the Kaggle AI platform, owned by Google.
The database brings together nearly 30,000 scientific articles about the virus known as SARS-CoV-2. as well as related viruses in the broader coronavirus group. Around half of those articles make the full text available. Critically, the database will include pre-publication research from resources like medRxiv and bioRxiv, open access archives for pre-print health sciences and biology research.
“Sharing vital information across scientific and medical communities is key to accelerating our ability to respond to the coronavirus pandemic,” Chan Zuckerberg Initiative Head of Science Cori Bargmann said of the project.
The Chan Zuckerberg Initiative hopes that the global machine learning community will be able to help the science community connect the dots on some of the enduring mysteries about the novel coronavirus as scientists pursue knowledge around prevention, treatment and a vaccine.
The CORD-19 data set announcement is certain to roll out more smoothly than the White House’s last attempt at a coronavirus-related partnership with the tech industry. The White House came under criticism last week for President Trump’s announcement that Google would build a dedicated website for COVID-19 screening. In fact, the site was in development by Verily, Alphabet’s life science research group, and intended to serve California residents, beginning with San Mateo and Santa Clara County. (Alphabet is the parent company of Google.)
The site, now live, offers risk screening through an online questionnaire to direct high-risk individuals toward local mobile testing sites. At this time, the project has no plans for a nationwide rollout.
Google later clarified that the company is undertaking its own efforts to bring crucial COVID-19 information to users across its products, but that may have become conflated with Verily’s much more limited screening site rollout. On Twitter, Google’s comms team noted that Google is indeed working with the government on a website, but not one intended to screen potential COVID-19 patients or refer them to local testing sites.
In a partial clarification over the weekend, Vice President Pence, one of the Trump administration’s designated point people on the pandemic, indicated that the White House is “working with Google” but also “working with many other tech companies.” It’s not clear if that means a central site will indeed launch soon out of a White House collaboration with Silicon Valley, but Pence hinted that might be the case. If that centralized site will handle screening and testing location referral is not clear.
“Our best estimate… is that some point early in the week we will have a website that goes up,” Pence said.