Cutting Edge Tools Used in Mining Big Data Analytics Architecture

Predictive modeling, machine learning and other advanced analytics applications help uncover the business esteem from underneath big data systems but for many users, it requires a lot of tools and effort.

Before Hadoop cluster was conveyed five years ago, retailer Macy’s Inc. had big problems analyzing all of the sales and marketing data its systems were generating. And the problems were only getting bigger as Macy’s pushed aggressively to increase its online business, additionally tightening up the data volumes it was hoping to explore.

The organization’s traditional data warehouse architecture had severe processing limitations and couldn’t handle unstructured data, such as text. Historical data was also largely inaccessible, typically having been archived on tapes that were shipped to off-site storage facilities.

Data scientists and other analysts “could only run so many queries at particular times of the day,” said Seetha Chakrapany, director of marketing analytics and customer relationship management (CRM) systems at Macy’s. “They were pretty much shackled. They couldn’t do their jobs.”

The Hadoop framework has mitigated the circumstance, giving a big data analytics architecture that also underpins fundamental ideas of business Intelligence (BI) and announcing forms.

Going forward, the cluster “could truly be an enterprise data analytics platform” for Macy’s, Chakrapany said. Already, along with the data analytics teams using it, thousands of business users in marketing, merchandising, product management and other departments are accessing hundreds of BI dashboards that are fed to them by the system.

Be that as it may, there’s significantly more to the Macy’s big data enivironment than the Hadoop bunch alone. At the front end, for instance, Macy’s has sent an assortment of analytics tools to meet diverse application needs. For measurable investigation, the Cincinnati-based retailer utilizes SAS and Microsoft’s R Server, which depends on the R open source factual machine learning language.

Several other tools provide predictive analytics, data mining and machine learning capabilities.

That includes water, Salford Predictive Modeler, the Apache Mahout open source machine learning platform and KXEN the latter an analytics technology that SAP bought three years ago and has since folded into its SAP Business intelligence Objects Predictive Analytics software.

Also in the picture at Macy’s are Tableau Software’s data visualization tools and AtScale’s BI on Hadoop technology.

A Superior Approach To analyze Big Data

All the different tools are key elements in making effective use of the big data analytics architecture, Chakrapany said in a presentation and follow-up interview at Hadoop Summit 2016 in San Jose, Calif. Automating the advanced analytics process through statistical routines and machine learning is a must, he noted.

“We’re constantly in a state of experimentation. And because of the volume of data, there’s just no humanly possible way to analyze it [manually],” Chakrapany said. “So, we apply all the statistical algorithms to help us see what’s happening with the business.” That includes analysis of customer, order, product and marketing data, plus clickstream activity records captured from the Macys.com website.

Similar scenarios are increasingly playing out at other organizations, too. As big data platforms such as Hadoop, NoSQL databases and the Spark processing engine become more widely adopted, the number of companies deploying advanced analytics tools that can help them take advantage of the data flowing into those systems is also on the rise.

In an ongoing survey on the use of BI and analytics software conducted by SearchBusinessAnalytics owner TechTarget Inc., 26.7% of some 7,000 respondents, as of November 2016, said their organizations had installed predictive analytics tools.

Additionally, looking forward, predictive analytics topped the list of technologies for planned investments over the next 12 months. It was cited by 39.5% of the respondents, putting it above data visualization, self-service BI and enterprise reporting — all more mainstream BI technologies.

A TDWI survey conducted in the second half of 2015 also found increasing plans to use predictive analytics software to bolster business operations. In that case, 87% of 309 BI, analytics and data management professionals said their organizations were already active users of the technology or expected to implement it within three years.

Other forms of advanced analytics — what-if simulations and prescriptive analytics, for example — are similarly in line for increased usage, according to a report on the survey, which was published last December.

Data Algorithms Tools Used To Find Meaning In Data Sets

Machine learning tools and other types of artificial intelligence technologies, deep learning and cognitive computing among them are also getting increased attention from technology users and vendors as analytics teams look to automated algorithms to help them make sense of data sets that are getting bigger and bigger.

Progressive Casualty Insurance Co. is another company that’s already there. The Mayfield Village, Ohio-based insurer uses a Hadoop cluster partly to power its Snapshot program, which awards policy discounts to safe drivers based on operational data collected from their vehicles through a device that plugs into the on-board diagnostics port.

The bunch depends on the Hortonworks dissemination of Hadoop, just like the one at Macy’s. Around 60 figure hubs are devoted to the Snapshot activity, and Progressive’s big data analytics architecture icludes tools, for example, SAS, R and H2O, which the organization’s data scientist use in dissecting the driving data prepared in the Hadoop framework.

The data scientists run predictive data algorithms backed up by heavy-duty data visualizations to help score customers participating in the program on their driving safety. They also look for bad driving habits and possible mechanical problems in vehicles, such as alternator issues signaled by abnormal voltage fluctuations captured as part of the approaching data.

The predictive analytics and machine learning capabilities are “enormous,” said Pawan Divakarla, Progressive’s data and analytics business leader. “You have so much data, and you have fancier and fancier models [for analyzing it]. You need something to assist you, to see what works.”

Going Further On Big Data Analytics

Yahoo was the first production user of Hadoop in 2006, when the technology’s co-creator, Doug Cutting, was working at the web search and internet services company, and it claims to be the largest Hadoop user today.

Yahoo’s big data analytics architecture includes more than 40,000 nodes running 300-plus applications across 40 clusters that mix Hadoop with its companion Apache HBase database, the Apache Storm real-time processing engine and other big data technologies. But the Sunnyvale, Calif., company’s use of those technologies continues to expand into new areas.

“Indeed even after 10 years, we’re still uncovering benefits,” said Andy Feng, vice president in charge of Yahoo’s big data and machine learning architecture. Feng estimated that, over the past three years, he has spent about 95% of his time at work focusing on machine learning tools and applications.

In the past, the automated algorithms that could be built and run with existing machine learning technologies “weren’t capable of leveraging huge data sets on Hadoop clusters,” Feng said. “The accuracy wasn’t that good.”

“We always did machine learning, but we did it in a constrained fashion, so the results were limited,” added Sumeet Singh, senior director of product development for cloud and big data platforms at Yahoo.

However, he and Feng said things have changed for the better in recent years, and in a big way. “We’ve seen an amazing resurgence in artificial intelligence and machine learning, and one of the reasons is all the data,” Singh noted.

For example, Yahoo is now running a machine learning algorithm that uses a semantic analysis process to better match paid ads on search results pages to the search terms entered by web users; it has led to a 9% increase in revenue per search, according to Feng.

Another machine learning application lets users of Yahoo’s Flickr online photo and video service organize images based on their visual content instead of the date on which they were taken. The algorithm can also flag photos as not suitable for viewing at work to help users avoid potentially embarrassing situations in the office, Feng said.

These new applications were made possible partly through the addition of graphics processing units to Hadoop cluster nodes — Feng said the GPUs do image processing that conventional CPUs can’t handle. Yahoo also added Spark to the big data analytics architecture to take over some of the processing work.

In addition, it deployed MLlib, Spark’s built-in library of machine learning algorithms. However, those algorithms turned out to be too basic, Singh said. That prompted the big data team to develop CaffeOnSpark, a library of deep learning algorithms that Yahoo has made available as an open source technology on the GitHub website.