Models for implementing generative AI

Field notes from experiments in industry and academia

Apr 03, 2024

Top-down vs. bottom-up

Across the dozens of companies we’ve interviewed about their experiments with Generative AI, we’ve noticed a common tension. There are forces within organizations – often from the top – advocating for planning and patience. Typically led by a task force of senior representatives across an organization, they start by setting policy and recognizing how the new technology can align with the company’s overarching strategy. The vision for how new technologies might transform the company is clear, but they are often deliberate in rolling out new tools across the organization. For example, a large company in a tightly-regulated industry established a committee of senior leaders dedicated to ensuring that all employee interactions with their LLM align with their AI governance principles and contribute value to the organization.

And then there are forces pushing for speed and agility, the type of startup mentality associated with “failing fast.” They recognize that the technology is evolving quickly, and there is no dominant design. But their goal is not to build an application that lasts for years. It’s to figure out how these tools might be useful now – and adjust as they change and improve. One company in our working group achieved this by empowering their software development teams to experiment with different LLMs while collaborating closely with their colleagues in business units who will become the end users of the generative AI tools they develop.

Both forces exist within every company we’ve encountered. Even the most top-down, central task force driven corporations have individual teams experimenting with how LLMs can transform their work. And even the most decentralized, entrepreneurial organizations have some efforts to establish governing principles and provide training on how to use these tools responsibly. The common challenge is how to balance these competing forces in a way that makes companies more productive and better places to work – all while advancing their core mission.

There seem to be three different models for doing this.

Letting 1,000 flowers bloom (and monitoring their health). Many companies have built sandboxes for employees to use an LLM for their own purposes. With a little training and some guidelines on what types of use cases are permitted, companies can trust their frontline workers to identify the best ways to use Generative AI tools in their own jobs. The challenge is identifying the best ideas to scale. That’s where learning by monitoring comes in – this approach requires a central team or mechanism to understand what’s paying off so that participants can learn from one another.

Targeting roles to reconfigure. When some companies began learning about LLMs, they immediately knew the roles they could transform – call center workers, marketing copywriters, or HR associates. In these cases, technical teams partnered with process experts to build applications to make those jobs more productive, allowing the people in those roles to do more added value work. A new article suggests this is similar to past digital transformations. Recent research suggests that novice and low-skilled workers in particular stand to gain tremendous productivity and soft benefits.

Developing a center of excellence. Companies with high concentrations of tech talent - either in software development or data science organizations - can create emerging centers of excellence focused on learning how to develop and iterate on LLM-based applications. Often in small teams, these centers build prototype after prototype of new applications integrating LLMs. Some of the prototypes are niche or even trivial. But at this point, the centers are not tied to something they built that has outlived its usefulness. Instead, their goal is to develop a capability to build new applications that meet their needs as the technology changes.

ICYMI: (MIT News article: MIT Launches Working Group on Generative AI)

In the coming months, we will release case studies that expand on these ideas and share more about what we’ve been learning from members of our Working Group on Generative AI and the Work of the Future.

We invite you to engage in the working group by responding to the newsletter’s open questions, participating in our quarterly meetings, and sharing your experience working with Generative AI at workofthefuture@mit.edu.

February 2024 Working Group Meeting: Key Takeaways

The 2024 MIT AI Conference, hosted by the MIT Industrial Liaison Program in collaboration with the MIT Working Group, took place February 28 - February 29. We hosted panel discussions with industry leaders and professors on Generative AI investment and implementation, as well as brief talks highlighting perspectives from financial services, healthcare, consulting, and transportation. Here is the agenda along with some top takeaways from organizations across industries:

One Size Doesn’t Fit All

Organizations feel an urgency to bring generative AI into their businesses. A common challenge has been determining the best use cases and developing robust solutions. In some cases, the risk of hallucinations in high-stakes scenarios has led organizations to develop custom guardrails. Jonathan DeBusk, Director of AI, Automation, and Workforce Science at IBM wanted to mitigate misinformation from their HR chatbot regarding IBM’s travel expense policies; they were able to dramatically reduce hallucinations by partnering with their research teams.

However, other use cases have been more challenging to address. During testing, when asked “What roles are good for women at IBM?”, the chatbot responded with “HR”, propagating biases present in the original training dataset. Designing guardrails to prevent such behaviors is a difficult engineering and organizational challenge as it requires threading company and human values into the tools as they are being built, and being comfortable adjusting LLMs when they go awry.

Leaders at the Mass General Brigham health network are piloting an ambient documentation tool - an application that captures and summarizes doctor-patient conversations - to help clinicians keep Electronic Health Records up to date. The initial qualitative feedback has been very positive; it’s allowed clinicians to better connect with patients during the visit rather than focus on taking notes. However, the tool is not equally beneficial for doctors in all specialties.

For instance, Adam Landman, MD, Chief Information Officer and SVP, Digital, at Mass General Brigham notes that ophthalmologists may spend most of the visit on a non-verbalized examination, so the new tool doesn’t add much value. Moving forward, it is important to consider the “phenotype” of the users – the structure of their daily work and their preferences when it comes to new technologies – when recommending solutions.

Organizational Compatibility

LLMs are fundamentally a user-centric technology. Improvements in productivity, experience, and quality are directly tied to which workers adopt them and how they eventually use them in their jobs. Some organizations are finding that due to the nature of their businesses, they’re not able to fully exploit LLMs.

Consider another pilot use case at Mass General Brigham: an inbasket messaging LLM to assist clinicians in drafting responses to patient emails. In an effort to mitigate responses with risky hallucinations, they prevented their LLM from providing medical advice. However, they think this made the tool far less useful to doctors and contributed to low adoption. Now, they’re re-evaluating how restrictive to be as they balance usability and usefulness with responsibility.

Levels of uptake can be influenced both by the quality of the application and the available bandwidth of the user. Leaders at Cushman & Wakefield, a global commercial real estate company, have been experimenting with applications of generative AI across their organization. One was a digital buddy to assist property managers in navigating the organization. However, due to high turnover in these roles, there wasn’t enough available bandwidth to roll out this digital buddy. More generally, some companies have seen a disconnect in what central management perceives as high-impact applications vs. what real-life users experience.

ROI evaluation

Bringing in a transformative technology, especially one with compute demands as high as those of LLMs, is expensive. Organizations across different industries are compelled to prioritize use cases with higher added value. In terms of hard ROI, Jonathan DeBusk of IBM points out that Generative AI can make workers more productive, enabling them to do higher value work in the time they free up.

Sal Companieh of Cushman & Wakefield points out that the hard ROI calculation is complicated as some use cases may strictly add cost in the short term but generate more transactions and revenue in the long term. For real estate agents, faster email drafting and market research can free up time to take on more clients. She measures the impact of each use case, considering capacity gained and subsequent changes in the staffing model.

Adam Landman of Mass General Brigham brings up the connection between soft ROI such as higher provider job satisfaction and improved patient population health that lead to hard ROI such as clinician and patient retention.

The Workforce of the Future

What does this all mean going forward? It’s a question many industry leaders are grappling with. For Jonathan DeBusk of IBM, it means dramatically reducing or eliminating the manual drudgery in HR jobs, substituting it with higher-paying, more strategic work in the future. He sees managers having more thoughtful and less transactional conversations with their mentees as clerical responsibilities migrate to digital buddies. Adam Landman of Mass General Brigham sees a future without burnout for clinicians where they are able to support more patients with higher quality, more equitable care. Sal Companieh of C&W says that some work in commercial real estate may change enough to warrant changing job descriptions and a shift in the necessary skills to succeed.

Our research goals are to not only capture the best practices around the integration of LLMs in the workplace but also the best practices around piloting different solutions and pivoting in response to feedback.

Ask Us Anything

We’ve been talking to all sorts of companies about their early use cases of Generative AI and their plans for deploying these technologies. In this and future newsletters, we will work through some of the questions that we’re hearing. If you’d like to pose a question for the newsletter, please write to us at [workofthefuture@mit.edu] with the subject line “Ask MIT.”

“We have seen a mixed bag of some companies sticking to one Generative AI vendor while others experiment with several foundational models and vendors. Which approach is better?”

Ultimately, the answer depends on your organization and the resources available to you. The performance difference between the out-of-the-box, base models and task-specific, fine-tuned models or those that use RAG is likely to be larger than the performance differences among base models. Given that contracting and integrating multiple vendors can be quite costly, it is important to consider other avenues. Nevertheless, when comparing different vendors, the official benchmark reports from the vendors are useful. Based on the particular business needs, you may choose a vendor that performs better in math (ex. MGSM) but worse in common knowledge (ex. HellaSwag). For open source models, HuggingFace’s Big Benchmark Collection is a great starting point. However, you may want to develop your own task-specific benchmark for a more robust evaluation. After base model selection, it is very likely that your organization will want to specialize LLMs toward specific tasks. To maintain the simplest and most consistent user experience, there have recently been efforts to orchestrate different specialized LLMs through LLM routing. An LLM router can be anything from a simple classifier to another LLM; its role is to assign queries to the appropriate specialized models based on the query topic, predicted query difficulty, desired quality level, etc. While still in development, semantic similarity is a possible approach for matching queries to topics.

— Sabiyyah Ali, MIT Work of the Future Fellow

Recent Advances

How can we effectively use Generative AI inside our organizations? MIT Prof. Kate Kellogg recently published a paper on this with a multidisciplinary team of researchers including Fabrizio Dell’Acqua, Karim Lakhani and Edward McFowland III from Harvard Business School; Ethan Mollick from The Wharton School; and Hila Lifshitz-Assaf from Warwick Business School. The paper shows that current AI capabilities form a jagged frontier that can make it extremely difficult for organizations and professionals to use AI effectively.

The jagged frontier is where some tasks are easily done by AI, while others, though seemingly similar in difficulty level, are outside the current capability of AI. The researchers conducted a field experiment with 758 BCG consultants and gave them one of two tasks: develop a new product and solve an analytic problem. For tasks inside the frontier (such as coming up with a list of ideas for a new shoe, choosing the best idea, developing a prototype, and writing a 2,500-word article describing the end-to-end process for developing the shoe as well as lessons learned), consultants who had access to GPT were significantly more productive and produced significantly higher quality results. However, for tasks outside the frontier (such as analyzing qualitative interview data and quantitative financial data to recommend to the CEO which of three brands the company should focus on), consultants who had access to GPT had worse performance. Another finding was that lower skilled employees benefited more from GPT than did higher skilled employees, which has major implications for long-term role reconfiguration in teams. As lower skilled employees take on new tasks with the help of Generative AI, higher skilled employees will be freed up to spend more time on complex tasks or invent new tasks altogether.

Because of the jagged frontier–the non-intuitive strengths and weaknesses of AI performance relative to humans, and how it is changing over time–it will be important for organizations to identify which of their common tasks fall inside versus outside the frontier and to train employees to work seamlessly and accurately with LLMs. This will be particularly challenging given that the frontier is rapidly expanding.

Another difficult facet of LLM applications for organizations is ROI evaluation and forecasting. Headcount reduction in white-collar professions has been cited as one of the sources of hard ROI. At the same time, a recent MIT study examines the feasibility of replacing workers entirely using AI rather than just automating parts of their tasks. It finds that at present, it may not be economically attractive to replace workers on a large scale. The authors argue that as costs for AI services decrease over time, job displacement is inevitable but slow enough to allow for reskilling and the introduction of new policies to protect workers.

Generation AI

Discussion about this post