How regulatory pressure is reshaping big data as we know it
Interpretability and explainability
The issue of interpretability, and by extension, explainability, is the foremost challenge to the statistical variety of AI popular today. It’s also the most demonstrable means by which the intensified regulatory environment can constrict big data deployments, especially in verticals mandating model risk management such as financial services and insurance. According to David Wallace, global financial services marketing manager at SAS, the federally implemented SR 11-7 requires financial organizations “to show that they are managing and governing the use of the models, the creation of the models, [and] what data is used to feed the model.” Paradoxically, the cognitive statistical models with the greatest accuracy—involving deep learning and multifaceted, deep neural nets—are also the most difficult to explain, or “to know what’s going on inside their model,” said Ilknur Kabul, SAS senior manager of AI and machine learning research and development. For predictive models, interpretability involves understanding when the input came in, what was weighted, how it was weighted, and how that affected the output, said Mary Beth Moore, SAS AI and language analytics strategist.
It’s possible to balance the dichotomy between transparent machine learning models and more accurate, yet opaque, advanced ones with the following interpretability and explainability measures:
♦ Leave one covariate out: This technique uses trial and error to individually preclude single variables from complex machine learning models and analyze the effect on their scores. Ideally, notable scoring differences inform the nature and importance of that variable. “Machine learning tells you something, but if you don’t know the context around it, that number doesn’t mean anything,” stated Jans Aasman, CEO of Franz.
♦ Surrogate modeling: By training a simpler model with the inputs and predictions of a complex model, data scientists can analyze the importance of variables, trends, and coefficients in the surrogate to infer how they affected the original. Local Interpretable Model-Agnostic Explanations (LIME) is a surrogate model. “You generate a regularized regression model in a local vision: You can generate it in a transformed space or an untransformed space, and you try to generate explanations using that,” Kabul said.
♦ Partial dependence plots: This approach “shows you the function of the relationship between your small number of multiples and your model prediction,” Kabul said. “So it measures all the inputs of the other model input variables, while giving you the chance to select one or two variables at a time.” By plotting these variables with visual mechanisms, modelers determine their influence on the model’s results and plot the average model prediction for each variable of a single model output.
♦ Individual conditional expectation (ICE): ICE plots are similar to partial dependence plots but offer detailed drilldowns of variables and “help us to find the interactions and interesting [facets] of the dataset,” Kabul mentioned. With ICE, modelers replicate the observation they’re seeking to explain with each unique variable, score it, and plot those results.
Interpretability and explainability are horizontal concerns, limiting the use of deep learning in certain verticals such as healthcare, according to Aasman. Still, Moore noted, if an organization is in a highly regulated industry and transparency is important, they can simply select a different algorithm, Moore advised.
PII and privacy
Privacy complications for big data stem not only from the tide of regulations but also from the diverse classifications that compliance involves. Although PII is the most ubiquitous classification, others include payment card information, federal tax information, classifications for the National Institute of Standards and Technology’s various regulations, and vertical classifications such as protected health information. Often, the ability to satisfy classifications for one regulation does not do so for others (such as PII for GDPR and for Arizona’s or California’s standards, for example). A classification “may not even be at a data element level; it can be at a group level” in which certain data combinations subjects the data to regulatory requirements not applicable to the individual data element, observed David Marco, president and CEO of EWSolutions. It is worse for global organizations because they “operate across multiple [regulatory] jurisdictions and are not dealing with one regulation, but tons of different regulations,” said Swamy Viswanathan, CPO of ASG.
The object of this unprecedented regulatory pressure is trust—which is bifurcated to reflect ethics for consumers and pragmatism for organizations. “You can’t [overlook] the complications for the questions of ethics and transparency,” noted Emad Taliep, MassMutual data analyst. This is especially true when talking about something that explores the intersection of a policy that was issued to somebody based on health factors and deciding how much the premium is, Taliep said. Considering regulations and personal data in this regard fosters consumer trust in organizations. Conversely, the basis for compliance with regulations and facilitating consumer confidence is the ability to trust one’s data for accurate classifications, tagging, and cataloging. “We have to make that data reliable. We have to make sure that there’s some curation on top of it, that the data makes sense, the metrics are common; that it’s not a Tower of Babel,” said Frank Bien, CEO of Looker.