Moving fast, moving slow

Updates from MIT's generative AI working group

Jul 22, 2024

The Labor Question

Klarna, a Swedish tech company, was an early partner with OpenAI to develop an intelligent customer-facing chatbot powered by ChatGPT. They boasted that the AI assistant could do the work of 700 customer service agents. It was apparently handling two-thirds of their customer service chats – millions of interactions – since they deployed it.

The announcement caught attention because not long before they deployed the AI chatbot, Klarna announced it was laying off 700 of its workers.

It’s becoming a more common story: tech layoffs coinciding with the adoption of and experimentation with generative AI. I often wonder reading these articles: could companies really be getting such benefits from generative AI that they are replacing hundreds of workers?

I’m particularly surprised because in our dozens of conversations with members of our working group, even the organizations that have made the most progress in scaling generative AI have not yet seen significant effects on their headcount in one direction or another (and some companies have openly committed not to lay off people as a result of the technology).

In Klarna’s case, it’s pretty clear that the layoffs and the AI adoption were unrelated. And the claim that generative AI can do the work of 700 agents requires more context.

First off, Klarna doesn’t seem to employ any customer service agents themselves. A spokesperson says “the company’s customer service is supported by four to five large third-parties that collectively have over 650,000 employees, and that it offers customers the option to speak with human agents if that’s what they prefer.”

So what did Klarna mean when they claimed that its AI bot could do the work of 700 agents? They said, “We chose to share the figure of 700 to indicate the more long-term consequences of AI technology, where we believe it is important to be transparent in order to create an understanding in society.”

It seems like the number could be wishful thinking, or a way of saying that the bot is capable of doing a lot of work. But merely taking on a lot of chats doesn’t mean the bot will reduce the need for workers – or even make the process better.

One of the companies in our generative AI working group offers a different way of thinking about the impact of these new technology tools.

Unlike Klarna, which emphasized how many workers an AI bot could replace, this company – a large global manufacturer – isn’t measuring the gains from generative AI in terms of existing productivity metrics. Their processes are often too complex to boil down to one metric that introducing AI could push in the right direction. All sorts of factors influence their teams’ productivity, they say. They can’t easily measure the impact of a new technology tool.

Instead, they see generative AI applications as tools that can help them transform processes and make them better overall. Their approach is to train teams in the capabilities of generative AI (they’re planning to train thousands of managers) so they can use these tools to become better at doing their core tasks.

It’s not surprising that a manufacturer would take this approach. It could be because they’ve dealt with automation technologies for decades and realize that it’s not as simple as just replacing workers, even if new technologies promise productivity gains.

For example, companies that adopt robots on average become more productive and profitable, but they also on average hire more workers – and workers of different skillsets. In research on robotic process automation (RPA), we have found similar patterns: even when organizations try to automate tasks and reduce costs, it’s hard to eliminate entire roles since the bots can often only do a small slice of a given job.

Also relevant: Early after the release of ChatGPT in late 2022, there was a wave of lofty predictions for the impact of generative AI on productivity and labor. Some prominent economists and analysts have introduced more caution in a recent report from Goldman Sachs and essay in The Wall Street Journal.

Notes from Q2 Working Group Meeting

The last working group meeting introduced the idea of “moving slow to move fast,” a line we borrowed from colleagues in MIT’s Leaders for Global Operations program. It’s a stark contrast to the oft-quoted mantra in software: “move fast and break things.”

We haven’t met an organization that wants to move slow when it comes to adopting new technology. What the moving slow idea really signals is learning in preparation to move quickly when the right opportunity presents itself. It’s the need to learn fast (even if it looks like you’re moving slow) so that you can scale up successfully.

As we’ve talked to companies experimenting with generative AI, we often hear two things: the first is that they’re sprinting to develop applications for their top use cases because they want to deploy faster than their competitors. The second is that everything around the technology is changing so fast. How can organizations ensure that today’s sprints aren’t obsolete with the next technology release?

AI Safety Institute

The first conversation of the meeting was an introduction to the newly-formed AI Safety Institute, which the Biden Administration established to advance the science of AI safety. Here is a helpful vision statement to get acquainted with the Institute.

Elizabeth Kelly, executive director of the Institute, said the new body under the National Institute for Standards and Technology would work with internal experts and third parties to test new AI models, systems, and agents. The Institute’s goal is to test models before they are deployed with a focus on preventing national security threats. They are also sensitive to potential societal harms models could cause.

The Institute is also prepared to issue voluntary guidance and identify best practices for safety best practices. They’re also equipped to conduct ongoing technical research so their methods for evaluating frontier models can evolve as fast as the underlying technology does.

While noting that the Institute is not a regulator and its testing and guidance are voluntary, Kelly made a comparison to aircraft, which are evaluated according to safety standards. And if something goes wrong, the National Transportation Safety Board has the ability to investigate and draw lessons from any incident.

The AI Safety Institute is focused on frontier AI models – it is not focused on the application layer of AI, which is subject to the regulations of existing agencies depending on the industry domain (e.g. healthcare, finance, etc.).

Kelly concluded with an emphasis on why it’s important to invest in robust safety practices: if something goes wrong with early applications of AI – or if early cases of safety challenges are mishandled – society could lose out on large potential benefits of AI in terms of, for example, drug discovery, individualized education, and carbon capture technology.

Agile Learning at Randstad

Thomas Jajeh is Chief Digital Officer of Randstad Enterprises, the world’s largest HR Services company. Thomas and his colleagues are focused on the ways that LLMs can augment recruiting and the worker experience. Thomas emphasizes that he and his colleagues at Randstad have approached generative AI as a capability. They recognize that the models and the potential use cases will change, and they want to have the skill to adapt and understand where they can use the technology to their advantage.

In the conversation, Thomas said that he is accustomed to developing new technology applications from the bottom up using “the collective brain power of your workforce.”

But deploying generative AI thus far has required both a top-down and a bottom-up strategy. The top-down has provided the rules of the road – governance and ethics principles – and the experimentation of finding use cases and iterating on them has come from the bottom up with close ties to users.

He emphasizes that the company’s early use cases were not in areas where the company had a comparative advantage. Randstad sees itself as a people and service company and its primary focus is not to develop (AI) tools. This was about learning what the technology was about and where it could be useful.

His team started by developing an app using LLMs that would draft job descriptions for clients recruiting new hires.

While the MVP was just a very basic version of what Randstad wanted to achieve, every iteration showed significant improvements and increased recruiter NPS and productivity. Randstad started the iterative improvement process with 50 recruiters, then did one iteration a day for forty days, improving bit by bit and rolling it out to more users along the way. Ultimately it reached thousands of recruiters and won widespread adoption – the average time to make a job posting went from 20 minutes to approximately three minutes.

He attributes the eventual success to working closely with users. But he doesn’t want to do a victory lap, either. He says it’s possible that if they waited a year, they could have just picked a job posting bot off the shelf and would not have had to build one themselves. But the process of building what they did will make them more capable to develop AI tools that make use of their comparative advantages in data and expertise.

In describing his company’s general approach to thinking about AI applications, Jajeh said, “We want to develop AI that help humans to become better in what they do.”

He described this in contrast to AI applications that were purpose-built to replace human tasks or cut humans out of the loop.

Part of the motivation for Randstad’s approach is that their surveys on worker satisfaction consistently point to human connections as being essential for workers to enjoy and stick with jobs.

He says that as he looks out 20, 30 years from now, he imagines empathy and the human touch will continue to be essential in recruiting and HR since that’s what people consistently want in their work.

Scaling Call Center Automation at Xerox

Shivani Agarwal is VP of AI and Intelligent Automation at Xerox.

Shivani is leading the deployment of generative AI in Xerox’s call centers, which has recently been scaled to take on tens of thousands of messages per month – about 10% of the company’s overall incoming customer email. This is particularly remarkable because client support at Xerox fields more complex questions than the typical call center.

As Shivani explains the scale-up process, the infrastructure for generative AI was prepared long before ChatGPT was released.

For years, Shivani and her team have been focused on improving the company’s call center operations, which are core to its business. Their enterprise and mid-market clients all over the world need answers fast when something goes wrong with a machine, so its client response needs speed and expertise to troubleshoot any issues that arise.

Their first investments were to optimize call and email intake with a rigorous ticketing process that put their customer data into a common format. Then they could use RPA bots to complete the tickets that were standard – and for the tickets that were non-standard, they could route the questions to Xerox experts. They also have a traditional AI tool deployed across their call centers and field operations that guides their agents and technicians to the most optimal solutions for the issues raised by clients.

Through this process, they already had the knowledge articles organized that documented Xerox’s expertise. They could already make the vector database that would inform the generative AI tools where to look for answers.

Now their generative AI deployment is reading emails that come in from clients, extracting relevant information required for creating and assigning tickets to appropriate queues that the RPA bots can then create and address or assign to their human agents to address. The generative AI tools can also point agents to the right knowledge articles so that they can be more versatile. Say a junior agent only had limited experience in one machine. Previously they might have referred a ticket to a specialist in that area. Now they can handle it themselves.

The legacy of introducing RPA also helped build trust. They had a legacy of enlisting “citizen developers” from call centers to learn how to use the technology and figure out ways to integrate it into their own business processes – the tasks that they knew well.

They already had trust with some call centers through the RPA deployments, so this generative AI move wasn’t a dramatic or fear-inducing event as much as it could have been.

Common Threads

A key theme that tied together the meeting was the human in the loop. For the AI Safety Institute, there’s important human input on how to avoid the risks that models pose. For Randstad, there was a choice to develop models with the human in the loop because the human is core to their value proposition. And at Xerox, the human in the loop is an expert who can become more versatile with generative AI tools – and the human in the loop needs to buy into the process transformation.

September 2024 Events

In addition to our quarterly working group meetings, we’re organizing a new, more interactive kind of virtual gathering for working group members — a series of office hours where working group members can chat with their peers who are building generative AI tools or programs with wide relevance to industry. Save the date for our first two sessions:

September 3, 12:30-1:30p ET: “AI Essentials” with Lisa Gevelber, Google
September 20, noon-1p ET: “AI tools for frontline workers” with Meredith Jordan, Craig Lenzen, and Brad Thompson, Target

What We’re Reading

“Let me ask ChatGPT.” We’ve heard it a lot in the last year, at work and at home. With GPT-4o’s multimodal capabilities, this may evolve to “let me show GPT.” The faster, cheaper model could enable the chatbot to interact more naturally with the world. At first glance, GPT-4o’s performance in text, reasoning, and coding intelligence is not very different from that of GPT-4 Turbo and GPT-4. However, the model marks a significant jump in audio, vision, and multilingual capabilities. This expanding frontier coupled with the human-like response latency and ability to interrupt means Generative AI will soon be more usable for customer service applications, medicine, and more. The added support for non-English languages is one of the most significant leaps from GPT-4 but a recent review of GPT-4o’s token library reveals that some languages may be misrepresented. In particular, the Chinese corpus underlying GPT-4o’s tokenizer is overrun by spam and inappropriate content. It serves as a reminder that good data is the first step to a robust model.

“Let Google do the Googling for You” says Liz Reid, VP, Head of Google Search. Within days of its launch, the new feature, AI overviews, was everywhere in the news for claims such as recommending glue on pizza. Some of these early results originated from parody and satirical content from sources such as Reddit and The Onion. Since then, Google seems to have scaled back on the scope of AI overviews, both in what kinds of searches triggers them and what source materials they use. At the same time, there are concerns from publishers and digital marketers that the feature will undermine the SEO traffic they rely on to get paid.

The growing use of LLM-based chatbots for search begs the question of information diversity: are chatbots truly a replacement for traditional web-based information retrieval and arbitration? A new study looks at the effects of LLM-based conversational search systems on diverse information seeking. They find that LLM-powered chatbots could lead to more echo chambers compared to a web search. Furthermore, they found that if an LLM already shared a user’s biases, such as through its training data or RAG techniques, it could exacerbate the effects of an echo chamber. With more and more companies electing to use LLMs to conduct initial research, it is important to be cognizant of the sycophantic nature of LLMs that can encourage confirmation bias.

A recent paper looks at sycophancy and many more metrics for trustworthiness assessment of LLMs. It establishes benchmarks for truthfulness, safety, fairness, robustness, privacy preservation, and machine ethics. Undoubtedly, increasing trustworthiness is an important corequisite to building successful generative AI based applications. Another approach to increasing trustworthiness and safety is figuring out how LLMs actually arrive at a particular response. Sam Altman says, “we certainly have not solved interpretability.” Still, a recent paper from Anthropic is a major step in this direction. They aimed to see if interpretable features exist in large language models. They say a feature exists if 1) a feature direction is active if its corresponding concept is represented in the LLM input or 2) artificially steering a feature direction should change the output accordingly. Through this process, they found features corresponding to physical entities such as the Golden Gate Bridge and ones more abstract such as conversations about keeping secrets. Interestingly, they found a feature associated with sycophantic praise, which when activated, yields vastly different content. These findings provide hope that we may soon have greater control and understanding over our models.

Ask MIT

Here’s a question from one of our members that touches on a theme we often hear discussed:

“What are some takeaways from interactions with members of the working group regarding Generative AI based decision support? Particularly when quantitative data is involved. What types of data is provided and what types of questions are asked? What are the strategies for validation?”

Within the working group, we have seen few successful, scaled use cases in quantitative-centric environments. Part of the reason for this is that most commercially available LLMs at this time excel at comprehension tasks but lack the reasoning and context required to make meaningful conclusions from quantitative data. As noted in recent findings, the present day technology struggles with statistical and causal reasoning, particularly in settings requiring heavy data parsing and analysis ie. reading in an Excel sheet. While models may exhibit causal reasoning abilities when provided general scenarios, there is a noticeable dropoff in performance when asking for relationships to be derived from specified data.

On the other hand, we have seen some working group members have success with comprehension-centric tasks that have small mathematical components. For these use cases, we see quantitative data being introduced through RAG and prompt engineering to ensure the accuracy of calculations. At this stage, these systems have a human in the loop who can use their judgment to determine whether the LLM outputs are reasonable and consistent with their knowledge of the industry. Some companies are also using monitoring systems to compare LLM outputs against previous human-only iterations of the task to glean possible discrepancies.

If you have had success integrating LLM-based tools into your quantitative tasks, please let us know – it would be great to learn from you! Please contact workofthefuture@mit.edu with any questions, comments, or contributions.

Generation AI

Discussion about this post