Free-form text is one of the most challenging data sources for companies. You can’t ignore this growing data source’s ability to yield new understanding of your customers and give you an edge over competitors.
Let’s walk through an example of a high-profile bank whose publicly available consumer complaint data held a gold mine of information. Wonder if this bank could have avoided the heavy fines if they had “listened” to their data.
After attending this webinar, you will be able to:
> Understand how to work with text mining data sources
> Apply advanced analytical methods to easily extract patterns
> Use your insights to develop actionable results
Just think, what opportunities could be hiding in your stores of untapped text?
Free form text is one of the most challenging data sources. Call Center transcripts, medical records, social media posts, the survey responses. Really any free form response fields holds valuable insights for your business, but you need a way to extract those insights from the data. Our speaker today, will show you how he used SAS Visual Text Analytics to extract a goldmine of information from a public database. Reid Baughman is a member of the Zencos data science team. He applies text mining to help clients gain insights into their most perplexing business issues.
Hi everyone. Thanks for joining. I’m really excited to show you some examples of text mining and how you could possibly apply it to your business.
My name is Jamie D’Agord and I will be your host today. But before we get started, I like to take care of some quick housekeeping while we wait for a few others to join. We just want to let you guys know that this is not a sales presentation. We’re just sharing what we’ve learned. We’re sharing our expertise and experience with you guys.
We’ve muted everybody so that there aren’t any distractions for our presenter. And please feel free to ask any questions throughout the presentation today and Reid will answer them in the end. You’ll see arrows pointing to where you have the option to ask a few questions. We have a few polls for you guys today and the poll will be displayed on your screen and you have about 30 seconds to answer it and let’s go ahead and try one now.
Jaime D’Agord: 02:01
So the question is how have you used text analytics in your organization? So you have three choices. No, ah, sorta but still learning and yes, but want to go to the next level.
We’ll give you guys a few more seconds to answer here and we’ll share the pool with you. So it looks like some of you have explored it and then some of you have not until I’ll pass it back over to Reid now, who will take us through the presentation for today.
Wonderful. Thank you so much, Jamie. All right. So, we won’t walk you through a concrete example of applied text mining. We’ll, we’ll talk about some of the details of how it works, but most of you want to walk you through, an application for how to explaining, could have been used in this, the scenario we’ll talk about here. And briefly to overview what we’re going to talk about today. We’ll first give you some background to the problem that we’re trying to solve as well as the data they’ll be using to solve it. We’ll talk about data handling and parsing and kind of a unique way you do that with text mining. And then finally we’ll apply a couple of different techniques one unsupervised and one supervised method to be able to extract insights from our text data.
I’ll start off with some of the different use cases of text mining and how it can be useful. We’ve worked a lot of financial institutions and banks that have large internal compliance divisions that have to monitor employee email or employee chat. As well as a variety by the types of text data. And so that’s one.
For each application for text mining, companies oftentimes will send out a survey, you know, asking how they’re doing this company and all of the employees filled out. And many times the surveys will include a free form response at the, at the bottom where they get to the kind of say whatever they want. So text money gets the ability to go through thousands of those responses and quickly figure out what the trends are. Similarly with customer satisfaction, if you have a product you’re selling and you want to see what are your stakeholders and customers are seeing about you online, on Yelp or on Google reviews or a variety of other platforms, this is also great way to aggregate that information and to distill it into common complaint topics.
On the other hand, if you want to look at your competitors products and see what people are saying about those, whether good or bad, this can also give you an insight and give you some competitive intelligence.
So, today we’ll be talking about some datas data set that came from the consumer financial protection bureau, which is a regulatory agency. And they, they held a database where customers of various banks can come in and complain about different banking products. But first, before we get into that anymore, let’s take a quick poll. turn it back over to Jamie for this one.
Alright, so our poll number two. just a second here. [Now I am sharing the poll on the screen] Have you ever left an online review or provided feedback to a company in an online form? Your choices are yes or no.
[Display poll and waits 20 seconds] All right, I’ll give you guys just a few more seconds here and we’ll go ahead and end this poll. And it looks like everybody has had an opportunity to share their feedback in an online form. I’ll pass it back over to Reid.
Yeah, so the answer is overwhelmingly Yes. So this will be what we’re going to see next two will be, they’re very familiar to many of you. So the data we’re looking at here and the problem we’re trying to look at is from this CFPB database, Wells Fargo, ended up getting $185 million fine. And many of you have heard about this Ducker case. And the reason why they were fined was because many of their employees had been opening accounts without the authorization of the account holders. And they’re doing this to get bonuses and different perks that the company [offered]. What we wanted to try and do today was, you know, to pretend we were wells Fargo pre 2016 if we’re looking at this complaint data, could we have noticed this emerging pattern or emerging trend of people complaining about unauthorized account opening?
So that’s sort of the, the investigation we’re gonna do with our text mining analysis. Now, after this happened, Wells Fargo, quickly went into damage control mode and have tried their best to do good by their customers and to fix things since then. But at the time, CFPB director, Richard Cordray, had this to say. And, the point I wanted to really underline here was the importance of monitoring carefully. Anything your stakeholders are saying about you. It may not cost you $185 million fine. If you’re not listening to your stakeholders, but it could, it could cost you in the form of a missed opportunity or other types of misses. So, let’s now pivot over to the tool we’re gonna use, the SAS Visual Text Analytics. The SAS Model Studio is a nice interface that allows you to do drag and drop analytics and it has these things called pipelines and the pipelines allow you to connect a sequence of different notes.
So the first node, we’ll talk about all these notes and sequence and how they relate to our project. So the first one I talk about is the data node and what our data looks like. So the data we collected was about two years of data complaint data from the CFPB. Over that time, there were 50,000 complaints logged about Wells Fargo, 5,000 of those which contained a narrative. Now, this complaint narrative actually looks like this. This is a screenshot from the CFPB website and this, this white box here is where the customer could go in and type whatever they wanted to about the complaint. Additionally, the banks were allowed to respond to the complaint. Some cases sort of defend themselves or give some, some, explanation as to whether or not they, they compensated, the complainant. And so there’s a, there’s a field that, contains weather, contains information on whether or not the complaint was closed with or without relief. So, the customer complaint narrative, that raw text field, we’re going to perform some unsupervised techniques on this field. And then later on we’ll use some supervised techniques on a combination of data from both of both the narrative and from the company response. Now we’ll explain more, at the end, what, what the differences between unsupervised and supervised [text]. But for now, these are the two fields that we’ll be focusing on.
Okay. So let’s, go ahead and do one more quick poll. Alright, so for poll number three, our question for you is, have you used any tools such as SAS Enterprise Miner or SAS Visual Text Analytics before? So your options are no completely green or, yes, and I love them.
We’ll give you guys just a five more seconds here. All right. We’ll end this poll and we’ll go ahead and share the results with you.
Okay. Alright. It looks like there’s about about a 50/50 split. So, for those who responded yes, some of you may have used SAS enterprise miner and some of you have maybe used SAS visual text analytics. For those of you who haven’t used SAS to build text analytics or are still kind of new to this let’s proceed and go through with, with more of this analysis.
Okay. So the next thing we’re going to do is the text preparation and we’re going to use the text parsing node for that. Um, the text parsing node. I’m gonna Demonstrate it to you by using a few fictitious complaints, that aren’t too far from what I actually saw in my data. So these are three little short snippets. The first thing you do when you’re cleaning your text data to get it ready is you drop out any words that don’t have any semantic value or any meaning to your analysis. So, you know, prepositions, pronouns, things like that typically aren’t useful. You want to get rid of those so that they don’t clutter our analysis. The next thing we do is something called stemming. And in stemming, you want to take words that mean the same thing but are just different variations and you want to boil them down to the root form so they can be considered the same for analysis.
So in this case, fee and fees represent the same word, the same root word. So we want to alter fees to be the same as fee. So they could be treated the same. And the last thing I want to point out here is that, the way that that topic modeling works is if certain terms appear together frequently throughout complaints. So in this case, overdraft and fee appear together in 66% of our three fictitious claims here, then those are most likely represent a topic that comes out of our data. And it can be more than just two terms. It can be a lot of terms, but, anytime they appear together frequently they’re going to naturally arise as a topic out of our data. Okay. Another quick poll really quickly.
All right. So our next question, and this is our last poll for you before we let Reid take it into the second half of this webinar today, the question is pick the words you think the tool would not catch.
And to clarify which of these words would not get stemmed together with the others most likely.
Alright, I’ll give you just a few more seconds. I’m locking in those responses . All right. Okay. Most of you got the answer right there. This question may have been a little bit poorly worded, so apologies, if it was confusing, but the word bank banking and bank, would all get combined together into the same because they’re variations of the same word. Whereas financial institution, the tool wouldn’t be smart enough to figure out that actually is the same thing as a bank or banking. Now one caveat to that is you actually could, if you felt you wanted those to be the same word, you can actually go into SAS digital text analytics and manually tell it that you want financial institution to be treated the same as bank. So it is sort of a sort of Gotcha I guess. You actually can go in there manually and do it yourself if you want to change how it’s considering, stemmed words. But the default would not pick that up.
We’re going to cruise through now till the end. We won’t stop you for more polling till the very end. Okay, so we just did our texts cleanup in parsing and in VTA or visual text analytics. You can click on a term in this. In this case I clicked on the term account and see what words show as being related to it. And one thing I wanted to point out here is the word check and the word open. Both appear as fairly sizable bubbles, meaning they appear together frequently, throughout our complaints. And the line between them, is fairly thick, meaning that they appear together fairly frequently. So this is important because again, we’re looking to try to find, any complaints related to the unauthorized opening of accounts like checking accounts. So right here we can see we’re starting to kind of sniff down the right path, but we need to do some more analysis to see if we can uncover the actual body of complaints that are related to the unauthorized account opening.
So let’s go ahead now and dive into, supervised and unsupervised learning techniques. First, we’ll start with unsupervised learning. As mentioned prior. Essentially we just want to understand what themes naturally arise from our data. And then we can also use that knowledge and the results of that process to classify future documents. So for instance, if I text mine all the emails in your inbox, in your email inbox and I created topics and said that, you know, these a hundred emails refer to this project and these a hundred referred to this other project. What we could do after creating those topics is we could then tag or classify any new emails coming in and, and say by just based on the text in them. This email belongs in this category or this group, if that makes sense.
And the way that we do that in digital text analytics is through the topics node. So this table you’re looking at here, shows in order of most frequent the topics that came from our complaint data. So the most dominant topic that appears in 777 complaints contains the terms loan, money, try, house, and help. So most likely this is referring to people who maybe have, become underwater in their mortgage and, are complaining or have some issue that they’re trying to resolve. The next one includes the terms payment late and late fee. This is one that many of you have likely dealt with at some point in your life. For me in college I had to call them a lot to complain about late fees I would get when I over-drafted my account. And so that’s the fourth most dominant complaint and those terms and, that topic appears in about 655 of our complaints.
But the one I wanted to really underscore here is the sixth most dominant complaint, which contains the terms open, account, check, close account. So this right here looks like exactly what we’re looking for. It contains a lot of the key terms that we would expect to find in a complaint about the unauthorized opening of accounts. One more thing to mention here really quickly is the plus that appears in front of a term here means that several words were stemmed in order to create that term. So, for the first one open, it could also be the word opened or opening. Any variation would have been stemmed down into open. Okay. So if you wanted to try to look or dig more into this group of complaints to see more what they look like you can actually double click on the topic here and get some, examples of the raw complaint text.
So right here, the first one, is a smoking gun right there. “Wells Fargo opened a checking account in my name without my knowledge.” That is a verbatim complaint from one of the customers. The third one there, “Wells Fargo opened a credit card account that I said I did not want when I opened my checking account.” So we’ve nailed it we found at the topic that contains all the complaints about unauthorized account opening. So, now we kind of sit back and pause for a second. And if we were in Wells Fargo’s shoes, you know, had they been actively monitoring this complaint database, they would have been able to pick up on this, emerging trend had they been using text mining, text analytics, and it’s possible that they could have been able to kind of nip this in the bud before it became, you know, a huge problem. So, that wraps up the unsupervised part. Let me just quickly demonstrate how you might do a supervised approach to text mine and then we’ll go ahead and wrap up. So thank you for staying with me. Just a couple more minutes and then we’ll be done.
Now supervised learning in the case of our analysis here, if you remember back to when I was explaining the data, we have another column besides the complaints that talks about whether or not the company or the bank, provided relief to the complainant or not. So we want to understand which of these topics could be used to predict relief or in other words, which complaints were most likely to receive relief later on. Sort of the business, use there could be, you know, if certain complaints end up proceeding relief anyway, you know, maybe we could find a way as a business to sort of preempt these complaints and stop them before they get logged into this public database. And there could be other ways you could, you know, protests analysis as well. For that we use the final node and visual text analytics, the categories node and this just repairs the data to move it over into SAS visual analytics, which is another component of SAS Viya.
And now we’ve, we stept over to visual analytics and here, we’ve been able to build a decision tree and the decision tree, I won’t get into too much detail about how it works right now, but essentially it’s going to use the topics to try to sort, which complaints did and didn’t receive relief and kinda create some criteria for us to know which complaint topics were most indicative of receiving relief. So this bar at the bottom, this rectangle, shows which topics did the best job of sorting the complaints into those two categories. And, there’s a bunch of numbers on here that probably looks confusing, but I wanted to draw your attention to the third line. This box here where it says score for fee, late fee, late bank wave. What it’s saying here is if the complaint scores higher along this topic, meaning it’s more related to it, and scores lower on a topic below it, which is modification, bank, mortgage, and loan modification, that that complaint most likely will receive relief from the bank.
So, in other words, if I call in and complaint about a late fee, but don’t complain about a mortgage or a loan modification, then I’ll most likely receive relief from the bank. That’s kind of like the, nutshell of what this model is doing here. So, you know, Wells Fargo executives could look at analysis like this and say, you know, it looks like we’re getting lots of complaints here in this public database regarding late fees and maybe we should consider a different approach to it because we’re having to pay them back the fee anyway later on. So anyway, that’s one way you can do it. There’s a lot of different ways you could structure the step analysis based on your business and based on the needs and questions you have of your data. So that’s it. I’m going to go ahead and just wrap up with a couple of thoughts here. Text mining and visual text analytics allows you to quickly find themes in unstructured data and you can leverage both unsupervised and supervised methods to find a different unique, solutions to business problems. And also to extract pretty powerful insights. So thanks so much for your listing and attention and I’ll turn it back to Jamie really quickly.
All right. And before we get into the Q and A, I ‘d just like to thank Reid for his interesting view to help quickly SAS Visual Text Analytics made that process.
We just wanted to share with you guys that on September 18th we’re going to have a [Building Tasty Dashboards that Users Love: Our Easy Peasy 5-Step Process ] webinar led by Tricia Aanderud, our senior director of visual analytics and data science practice here at Zencos. We’ll be sending you guys an invitation to that. So please feel free to join us if, if that topic interests you. And finally, I’d just like to remind everyone that we share a lot of content on our blog and also on social media. So please go out there and follow us and check out some of our videos and some of our insights on the Zencos website.
Let’s go ahead and jump into our Q and A session. It does look like we did get a few questions along the way here. I’d like to pull those up really quick.
Okay. So the first question I see here [from an attendee]. His question is “how do you gather or aggregate all of your customer review data?” This is a great question, in this instance for my example, I was able to pull it directly down from the CFPB. It was a fairly straight forward and easy process, however, for looking at, you know, data that might be stored online like on Yelp or on Google reviews or maybe if you wanted to look at what people were seeing on Twitter, you’d have to use a different approach like web scraping to be able to aggregate all that data to kind of pull it down and then to store it so that you could then analyze it.
So, oftentimes I’m just getting the data can be a fair amount of work to do and a SAS Visual Text Analytics does have some EPIs that allow you to connect that data. In some cases, other times you have to kind of do some dirty work to get it, get it available. So, great question.
Another attendee here said, “can you build supervised and unsupervised model with text data in SAS? “And, the answer to that is yes. You know, both of the models I showed you with the creating topics and then also trying to predict which ones received relief on both of those were done using SAS Viya visual text analytics and, it’s also available in some of their other products as well. So the answer to that, it’s, it’s definitely yes.
The next question here “does SAS provide a list of stock words that are industry specific?” Great question. So to clarify, stock words are words that you would want to exclude from your analysis and I’m going to answer you with, I know that they do give you a standard stock word list that, out of the box is not business specific. But I know they have a lot of customized solutions and I’d actually be surprised if they didn’t have, some industry specific list. The other thing too is you can actually create and customize your own stuff or lists. So if you know your business well enough to know that you don’t care about certain words, you can actually create your own, excel file. Just type them in there or copying, paste them from other, some other source and you can load that up into SAS and then it’ll automatically exclude those terms.
Two more questions real quickly. One person said, “where can I find out more about text analytics with SAS?” You can chat with us. We are a [SAS gold] partner. We do both installs and deployments and we also do consulting around services like data science. If you look at our website or if you want to talk with us, you know, those are both solid options.
The last question, “Does Zencos offer, text analytics services?” And if it wasn’t already clear, yes we do! Yes, we are very enthusiastic and love doing text mining. We would be happy to talk with you about use cases and see if you have data that’s suitable for it and a problem that can be solved with text mining. So feel free to reach out to us if you have any questions about your particular problem. And we could definitely start with something, you know, simple and informal if you wanted to start to kind of explore that avenue, with your business. So that’s it for the questions, I’ll turn it back to Jamie.
All right so like Reid just said, if you have some text or a use case, you want to run past us, please share them, we’ll follow up with you after the Webinar. And with that, we’re here at the end of today’s session and we’d like to thank you guys for joining. I hope you guys have a wonderful rest of your day.