Scientific Analyst Need Robust Big Data Architectures To Solve The Challenges Of Mining And Analyzing Genomic Data

Logical analysts require vigorous big data models to solve the difficulties of Data mining and break down genomic Data, and some say the Apache Spark engine is appropriate for the job.

When you consider business that face enormous big data analytics challenges, web companies like Facebook, Netflix.

Also, Google regularly ring a bell. You may likewise point to online retailers, which approach gigantic stores of clickstream and client data. Logical research labs doing genomic data analysis aren’t in the public eye as much, but they’re increasingly in the thick of things with big data.

Genomic Data
information on human or animal genomes and the DNA they contain is swelling like a tidal wave. That’s pushing researchers looking to mine and analyze all that data to think about new data architectures, and some are finding that the Apache Spark processing engine and other big data technologies are a good fit for their work.

The first human genome took about a decade to sequence at a cost of nearly $3 billion.

But, as the available methods improved, both the time and cost of sequencing forensic DNA have dropped abruptly.

Today, genomic data analysis is a growing focus of scientific research, with much of the work aimed at finding new ways to treat diseases.

Aided by such efforts, a number of treatments that are tailored to the specific genetic characteristics of patients are becoming available for medical conditions, like cancer, heart disease and diabetes.

But all of the genomics activity is creating a big data crunch. A 2015 research paper published in the journal PLOS Biology estimated the amount of genomic data produced over the next 10 years would outpace the data volumes generated by astronomy-related organizations and by both YouTube and Twitter.

A Reasonable Requirement For Data Big Analytics Speed

With so much data flooding in, “it’s going to require innovations in computing to maintain our current pace in biomedicine,” said Cotton Seed, a senior principal software engineer at Broad Institute, a collaborative research center in Cambridge, Mass., that was set up by MIT and Harvard in 2004.

For Seed, a lot of that innovation is happening in Spark. Speaking at Spark Summit East 2017 in Boston last week, he said he and his team built a genomic research platform on Spark that leverages the technology’s SQL querying function and library of machine learning data algorithms to speed up the data mining and analytics process.

Broad Institute is currently working on projects to map out genetic traits that tend to be associated with certain types of cancer and the genetic makeup of microorganisms that live in the human body, among other initiatives.

Seed said Spark is useful in those efforts because it can connect to different data stores and lets analysts interact with it in different query languages — SQL, Python or Scala, whichever most closely fits their work. When they’re writing queries, “it’s important that [analysts] be able to ‘speak’ as close as possible to the languages of biology,” he said.

The speed at which Spark handles data volumes and its scalability also make the platform attractive for genomic data analysis and data mining uses, said Zhong Wang, a computational biologist and genomics researcher at Lawrence Berkeley National Laboratory in Berkeley, Calif., during another presentation at the Spark conference.

Wang heads a research team that studies the genetic-level interactions between microorganisms in the guts of animals.

The studies produce far too much data for anyone to mine and interpret manually in a spreadsheet, so the team uses Spark and machine learning data algorithms to parse the data and identify meaningful correlations.

Spark Includes Additionally Processing Power

Prior to adopting Spark, Wang and his colleagues in 2009 deployed a six-server Hadoop cluster to run their data analyses, using the Apache Pig scripting and big data analysis platform. But processing times were slow, he said.

Also, the analysts were trying to build graph-based data algorithms, which weren’t very compatible with a MapReduce-based programming environment like Pig.

A few years after, the team switched to Spark running against data stored in Amazon EMR, a cloud-based Hadoop distribution from Amazon Web Services that was formerly known as Elastic MapReduce. Wang said the Spark system has improved processing times, even as the amount of mined data moving through the platform continues to grow.

Like Seed, Wang said the ability to write applications for Spark in a variety of fairly easy-to-learn languages is another plus.

It means data analyst like him can do most of the development work needed for genomic data analysis projects, rather than having to rely on data engineers or data scientists.

“I’m not trained as a computer scientist, but I can write Scala and Python Spark applications,” Wang said. “It’s not possible to hire an expensive engineer just to do this for us.”