The Future of Big Data – and How Liberal Arts Fit In

“What is Big Data and how could it impact my career?” In my current work, I often get asked this question.  Here’s a quick primer on Big Data, its present and potential benefits, and why some of its most groundbreaking future users will possess a liberal arts education.

Big Data is not static – At the start of this decade, most large data sets were stored in organized data warehouses where older data was routinely deleted to make room for new data. With the recent advent of tools like Hadoop, data warehouses have become “data lakes,” large-scale repositories that hold vast amounts of raw data. With that transformation, orderly has become disorderly, and new tools like Apache Hive and Pig have created a Wild West environment, offering possibilities for structured data (think Excel spread sheets) and unstructured data (such as PDF files, photos and audio files) to be joined together in meaningful ways to find new insights and relevance.

Data is proliferating – Sensor and storage costs are declining so rapidly that sensors can be added to products just in case there is a future need. One startup I worked with added sensors to a consumer product even though they had no idea how and if they would use the data generated.  The cost was so low the option to add the sensor was an easy decision. Sensors might be easy to install but existing data is precious. Large companies are investing large sums in acquisitions simply to get data, and machine learning applied to the acquired data can reap many rewards.

Tension exists between privacy and utility – Data collection and analysis can result in more effective, customized products and services (consider the potential for personalized medicine).  But nefarious use in the wrong hands is a frightening prospect. The NFL manages the tension by keeping a registry of injuries at a third-party provider.  Players, the players union and the NFL all have an interest in the data, but these are not always aligned and access needs to be tailored. Geo-fencing data is a new trend where GPS or RFID tags are used to define a geographical boundary for data that – coupled with access control – helps the third party to manage different interests.

Digital divide still exists – When I ask colleagues about the power of data to manage the complexity and lower the cost of healthcare, the short answer is “Not yet.” Until tools and apps are more widely available, systemic benefits will be slow to develop. Today health apps available to consumers (patients) fall into three groups: adherence, engagement and remote monitoring.  Opportunities exist to merge the benefits of all three types of apps to a platform that will transform healthcare from an episodic experience (when you are at the doctor’s office or in an emergency room) to a streaming experience focused on healthy lifestyle, prevention and early intervention. These benefits are costly and will initially only accrue to those who can pay.

Skills shortage and obsolescence are real – Everyone I speak with sees a growing need for data scientists and data-literate employees. Indeed, the employment landscape is changing so quickly that some employers’ best new hires come straight from school, where they have been exposed to the newest thinking. In an industry where the tools of the trade are no more than 18 months old, few employees have experience using them. One colleague noted that “Java and Bayesian stats are not new, but most of the other tools we use in our company did not exist three years ago.”

Data literacy alone is not enough – The liberal arts are highly valued even in a data-driven world. Employers I talk with hire for flexibility and critical-thinking skills because the field is changing so quickly. The consensus seems to be: “Math and science skills are easier to identify because you can test for them; problem-solving skills and risk taking are harder to identify.” In many STEM fields, students are used to problem solving by finding “the one” correct answer; but, in a commercial setting, employers are not looking for just one answer to a problem, but also for three alternatives that could work, including the one that is optimal when considering other factors (policy, regulatory, financial, etc.). I recently mentored a comp sci and history double major with multiple opportunities because her educational experience suggested someone who was comfortable using both sides of her brain and would likely bring broader, more critical and more durable thinking to problems.

Project-based learning is crucial for applicants who want to distinguish themselves in the job search process. Anyone can say they know Python or Scala, but the applicant who can say “I used skill X on this project for this organization and created these results” will have a more compelling story. Find a project – size and scale do not matter as long it has a beginning, middle and end – with conclusions you can blog about or talk about in an interview. In addition to demonstrable results, project-based learning also hones collaboration and communication, soft skills that are valued in the marketplace. The smartest person is not necessarily the best employee; those with data fluency who are passionate about their work, can tell a story and collaborate are the most valuable.

Text analytics loom large – If I were studying computer science today at Williams, I would focus on text analysis. Analyzing unstructured text like Twitter feeds for sentiment is challenging (e.g., spotting double entendres and sarcasm) but has powerful potential for political campaign strategy and for understanding consumer opinions about products.

Shannon McKeen ’85 is an adjunct faculty member at the UNC Kenan-Flagler Business School and a consultant with the National Consortium for Data Science. He can be reached at [email protected].