The work of data scientists is much admired but often misunderstood. High profile data leaks and questionable outcomes of machine learning have affected public perception, but data and those working with it are only going to become more important going forward. BDJ spoke to Karen Hao, Massachusetts Institute of Technology engineer and contributor to Quartz, on the challenges facing data scientists like herself.
When you think of a ‘tech company’, major players like Apple and Facebook immediately spring to mind. Both invest heavily in the development of emerging technologies for their infrastructure and products. But they are not alone in occupying the tech sphere. Today, companies across every industry dedicate a significant amount of their budget to digital development.
One of the key drivers has been the vast amount of data generated by businesses in their daily operations, particularly with regard to their customers. Booking a restaurant through an online reservation platform gives that company all kinds of data to analyse - ranging from bounce rates on their web page to click-through rates in subsequent newsletter emails.
And this is just one example of a company that generates data. The data generated by different companies and their desired applications for this data may be different, but there is still a need to process and interpret it.
Data scientists are in huge demand and can command handsome salaries. However, just because the demand is high, this doesn’t mean their profession is exempt from the challenges of today’s marketplace.
In High Demand, yet Misunderstood
Despite being an asset to an organisation, the work of data scientists is largely misunderstood. “I think people outside the field have a loose understanding of data science, but can't say precisely what kinds of problems data scientists can work on,” Hao says.
This is not entirely unexpected, as the technical nature and flexibility of the role is rarely explained to a non-technical audience. Developers who work on the coding of individual platforms are often more visible to the consumer market, as they are the ones building the tangible product the user base consumes.
And the confusion extends to the problem of misinformation surrounding the harvesting of user data. “I think people get confused about what data a typical website collects,” says Hao.
“One of the challenges faced by a data scientist is condensing complex insights into manageable information for these key people, and being able to justify how what they do meshes with their needs.”
However, the lack of understanding goes beyond the consumer sphere. “I think even people within the field can struggle to articulate the boundaries of data science,” Hao adds. This can become especially prevalent when data scientists have to work between departments within a company and explain to people therein, as well as shareholders, the work the data scientists are undertaking.
Shareholders, in particular, are unlikely to want to hear about the underlying intricacies of the projects, but rather the larger findings from the data, and how this can be actioned into the strategy of the company going forward. One of the challenges faced by a data scientist is condensing complex insights into manageable information for these key people, and being able to justify how what they do meshes with their needs.
Commodification of Data
It is here that perhaps one of the most important challenges for data scientists becomes fully clear. The potential monetary value of user data is immense, and the aforementioned department heads and shareholders are acutely aware of this.
“I think the challenge facing the field – though not necessarily the people in the field – is the commodification of data science,” says Hao. “There are tons of boot camps now at which people can learn data science and pivot their career paths. I attended one myself. The problem is, many of these boot camps focus on the practice of data science without going into the theory.”
“There are tons of boot camps now at which people can learn data science and pivot their career paths. The problem is, many of these boot camps focus on the practice of data science without going into the theory.”
This could be seen as a natural byproduct of the increased demand for data scientists in recent years. As with any profession or skill set, if the demand is high, there will be an abundance of training resources to cater to that demand. Not all of these will be of sufficient quality, however, and some offer inferior qualifications.
The Importance of Data Science in Machine Learning
The danger of this sub-par training is especially concerning when connected to the emergence of machine learning. Companies are pouring vast resources into its development, and data scientists are an integral part of this.
The potential applications of machine learning are already being felt in fields including healthcare, finance and even the criminal justice system. The importance of highly scrutinised development at the foundations of the technology are paramount.
It has been well publicised already how bias is able to creep into machine learning in a variety of ways. Intentional corruption of data sources is an easily identifiable source of bias, but it can also occur through completely unintended and often misunderstood prejudices in the groups that produce, collect and process the data.
An example of this is the phenomenon known as ‘word-embedding’, whereby machine learning algorithms find clusters of words together in data sets and, in turn, begin to associate these words with each other. Apple was widely criticised when users noticed that iOS autocorrect offered the male businessman emoji by default when the word ‘CEO’ was typed. Although Apple closely guards its machine learning algorithms, it has been widely theorised that this gender-based assumption for CEO was due to the biased datasets used in the auto-complete algorithm – ones that primarily had CEOs referred to as male.
Such bias is often not deliberate, but merely reflective of established societal patterns. Data scientists need to be aware of this, nonetheless, and be trained not only in the skillset to process and spot data patterns, but also in the ethical considerations involved in these projects.
“We are just beginning to understand how much poorly implemented machine learning can impact society, so it's concerning to think that data scientists entering the field might not be adequately prepared to tackle those challenges.”
“We are just beginning to understand how much poorly implemented machine learning can impact society, so it's concerning to think that data scientists entering the field might not be adequately prepared to tackle those challenges,” says Hao.
How do Data Scientists Address These Problems?
Data scientists are able to take advantage of the fact they are in high demand, and they therefore have a wide selection of industries to move into. This is an attractive perk of the profession, but it also gives data scientists an opportunity to address the aforementioned machine learning issue found in their field.
“The act of implementing a machine learning model probably doesn't differ much from industry to industry,” says Hao. “But if you want to be responsible in your practice of data science, you should have knowledge of the industry you are applying it to and be thoughtful about the trade-offs you're making in your models.”
Public Perception of Data Use
Major news stories about hacks and mishandling of user data have undoubtedly affected public perception of how the private sector, in particular, treats its databanks. The Cambridge Analytica scandal is perhaps the most prevalent of these scandals in recent years.
Because of the increasing diversification of businesses, and the partner relationships that exist across industries, user data leaks leave more and more sensitive information exposed. These can include email addresses, physical addresses, and credit card information.
GDPR has been rolled out in an attempt to protect user data integrity, but its emergence has generated a whole new set of challenges for data scientists.
“I’m a data scientist working in the media and GDPR eliminates data on our EU readers. We can no longer factor their behaviours into our decisions.”
“I’m a data scientist working in the media and GDPR eliminates data on a whole swath of our readership, which prevents us serving that population better,” Hao says. “Unlike social-media platforms, we don't see any identifiable information, so we can't triangulate specific readers.
“What we track is aggregate behaviour – the total number of page views we've received in a month, the average depth that readers scroll in an article, the average number of site visits per user, and so on. We track these things to guide our product and editorial decisions. Did adding that feature make our website easier or harder to use? What topics do readers like reading the most? Without data on our EU readers, we can no longer factor their behaviours into our decisions.”
This loops back to perhaps the primary hurdle that data scientists face: a lack of knowledge from the public, and maybe even legislators, about their jobs. The perception of nefarious practices surrounding the analysing of user data is being fueled by data scandals such as the Facebook/Cambridge Analytica controversy, when these are actually extraordinary cases of systemic failings. In reality, data scientists are perfectly placed to improve UX for consumers, and scalability within a business.
The Diverse Nature of Data
Due to the diverse nature of data science, those in the profession are not limited to traditional corporate ventures. Hao gives examples of data-scientist utilisation that have impressed her: “I really like companies that are working with cities to tackle urban problems, like Microsoft with its CityNext program and Alibaba with its City Brain project.”
Publicising projects such as these, which show it to be a powerful, versatile area of the digital economy, could go a long way towards clearing up the confusion and suspicion dogging data science.
Despite the growing public focus on corporate data use, even the largest companies in the world are still struggling to get it right. In October, hackers exploited a gap in Facebook’s cybersecurity to gain access to the personal information of 50 million user accounts, the biggest data breach in the company’s history. The breach also gave the hackers to the apps users used on top of their Facebook accounts, such as Tinder, Instagram and Spotify.
Facebook has fixed the bug and apologised, but the leak will do nothing to help the company’s reputation. Technically, the breach could be in violation of GDPR, if Facebook is found to have not taken adequate steps to protect its users’ data, making it potentially liable to a $1.63 billion fine. If it came to a fine, it would be the first major example of GDPR legislation punishing large businesses.
llustrations by Kseniya Forbender
To contact the editor responsible for this story:
Margarita Khartanovich at [email protected]