Hacking the Language of Loan Defaults
- Graduate students at Smith School of Business are using data from Kiva, the crowdfunding platform, to design algorithms that can predict whether or not loans will be repaid based solely on what loan applicants write about their projects.
Imagine you’re in financial services and responsible for assessing loan applications. What if you could tell, from the application itself, whether or not the loan will go into default? Would you be willing to trust natural language processing, powered by artificial intelligence, to make that call for you?
Improving tools to predict the best investments can mitigate a lot of risk for lenders. Though business loan defaults in Canada hovered around one percent in recent years, they’ve climbed to nearly seven percent during recessions.
Students in the Queen’s Master of Management in Artificial Intelligence (MMAI) and Master of Management Analytics (MMA) programs offered by Smith School of Business, are doing just that. They are using data from Kiva, the crowdfunding platform, to design algorithms that can predict whether or not loans will be repaid based solely on what loan applicants write about their projects.
The biggest reason to examine the Kiva loan data set is that it’s public and large, says Stephen Thomas, director of the Smith School of Business’s MMAI and MMA programs. Thomas uses Kiva’s historical dataset to give students practice with real-world data addressing current business issues.
“No one has gone the route of our course, to build a model to predict whether someone is going to pay back their loan or not,” says Thomas. “This is a great dataset because this is exactly what we’re trying to teach in our course, things like risk assessment of a loan.”
Analysing Kiva Stories
Kiva loans have a higher repayment record than one might expect for unsecured loans to people or business that, in many cases, have been written off by traditional lending agencies. Kiva lenders have little more to go on than a story about the person or business. Yet typically under two percent of recipients default on their Kiva loan.
In the loan application, people offer a brief description — what Kiva calls a “story” — of how they plan to use the funds. Information on all Kiva loans, including their history and what they are used for — are available online.
“They’ve scrubbed it of personal information,” says Thomas, “so we don’t have their name or social security number or anything like that. But they’ve put it all up there for research purposes. Some [stories] are very short and direct and some are very long and rambling and contain their whole life story.”
Kiva stories often read like this:
Sonia is 41 years old and divorced. She has a higher education. Sonia runs a grocery store in a rented space. This is her main source of income. Having started in 2015, Sonia has $2612 of inventory which brings her an average monthly income of around $480. With the goal of further developing her business, Sonia applied for a loan of $2883 to buy food products to increase sales in her grocery store. Sonia plans to reinvest the earnings from the loan into further expansion of her store.
The graduate students use several methods of natural language processing to analyze stories and fine tune the machine-learning models. A straight tally of the occurrence of words (bag of words) or pairs of words (bag of n-grams) are the simplest methods used. Cleaning techniques such as stemming (which associates words with common roots, such as ‘walk’ and ‘walking’) and the more-advanced lemmatization (which associates words with common meanings, such as ‘is’, ‘was’, and ‘been’) are applied as well. More complex approaches included topic modelling, word and document embeddings, and sentiment/tone analysis.
Finding Linguistic Patterns
Can the algorithms distinguish between what’s background and what’s part of the proposal in a typical Kiva loan application? “In a roundabout way, yes,” Thomas says. “We’re applying lots of different natural language processing techniques on this story. The simplest kind is just to count which words are in the description and ignore order. Is the word ‘cow’ in there, yes or no? Is the word ‘shirt’ in there, yes or no? A more sophisticated way is to run a topic model that will automatically learn themes of related words that are present in all the descriptions. For example, there might be ‘cow, dairy, farm’ or ‘shoes, clothes, shirt’.”
‘Repay’ is a term that frequently appears in applications but correlates highly with loan defaults. “In this kind of weird way, if your friend keeps telling you that they’ll repay you, trying to reassure you, maybe that’s a sign that they’re not as trustworthy,” Thomas says.
Other terms predictive of default include ‘cows’; ‘budget’; ‘religious’; and ‘Catholic’. By contrast, individual and sets of words that correlate with loan payment include ‘law’; ‘rainy’; ‘drought’; and ‘rice, farmer, ranch’.
‘Repay’ is a term that frequently appears in applications but correlates highly with loan defaults
“I would not have guessed any of those things,” says Thomas. “But the machine learning algorithm will just learn that’s what the dataset says.”
One would expect financial services firms such as banks to leap on such technology but Thomas says it may be a harder sell than expected. So far, banks generally have resisted using advanced analytics to green-light or deny a loan, trusting more in human judgment.
“They think of this as a black box,” says Thomas. “Or they think of science-fiction movies where the AI turns evil. Luckily that’s decreasing over time as more and more people understand how these models actually work.”
— Adrienne Montgomerie