Software that enhances AI development workflows without being embedded in your application code: IDE extensions, CLI utilities, testing frameworks and observability solutions.

Adopt

These tools represent mature, well-supported technologies that are ready for production use. They offer excellent productivity gains and proven track records in real-world development workflows.

Software engineering copilots

AI-augmented development represents the most significant shift in software engineering since the introduction of compilers. This transition is permanent, and teams that are not actively building capability in this area are already falling behind. We’ve placed Software Engineering Copilots firmly in the Adopt ring.

The tooling landscape offers two broad categories. Model-agnostic interfaces let teams choose and switch between LLM providers: OpenCode stands out for its polished terminal experience and breadth of integration, supporting 75+ model providers including local models for privacy-sensitive teams. Cursor, Windsurf and Zed are standalone editors supporting multiple models. CLI tools such as Aider and Cline work with various providers, Warp reimagines the terminal with AI-enhanced command suggestions, and Cody focuses on enterprise-scale codebase understanding. Provider-specific tools such as Claude Code, Gemini CLI and OpenAI Codex are optimised for their respective models. Claude Code has become extremely popular, reflecting the underlying model’s capability; its interface has inspired several imitations. GitHub Copilot and Tabnine offer traditional IDE integrations with their own model stacks.

Two distinct approaches have emerged: free-form “vibe coding” and structured development methodologies. Kiro exemplifies this choice by offering both: a conversational coding mode for rapid iteration and a dedicated specs mode where AI assists developers in drafting requirements and design decisions through specification files before code generation. Traycer similarly emphasises upfront planning for complex tasks. Cursor enables teams to codify standards through .cursorrules, embedding architectural patterns and guidelines directly into AI assistance.

Senior engineers derive the greatest value, leveraging AI for routine tasks whilst maintaining quality oversight. Junior developers often struggle to evaluate AI suggestions, occasionally accepting flawed implementations or overlooking edge cases. This points to a clear organisational priority: intentional training around effective AI collaboration. Success correlates with careful workflow integration and a “trust but verify” mindset. The gap between teams that embrace these tools effectively and those that don’t will only widen as the tooling continues to improve.

Provider-agnostic LLM facades

The LLM landscape evolves rapidly, making today’s optimal choice potentially outdated within months. We recommend implementing a facade pattern between your application and LLM providers, rather than building directly against specific APIs. This approach reduces vendor lock-in and enables easier testing of alternative models as they emerge. When considering whether to write your own code, be sure to consider tools such as the lightweight AISuite, Simon Willison’s LLM library and CLI tool, or heavyweight alternatives such as LangChain and LlamaIndex.

This recommendation reflects our team’s experience seeing projects hampered by tight coupling to specific LLM providers, and the subsequent maintenance burden when transitioning to newer, more capable models.

Notebooks

We’ve placed Notebooks in the Adopt ring because they have become the de facto standard for data science and machine learning experimentation and prototyping. The interactive nature of notebooks, combining code execution with rich text explanations and visualisations, makes them particularly valuable for AI/ML workflows where iterative exploration and clear documentation of model development are essential.

Widespread adoption across both industry and academia, plus an extensive plugin ecosystem and integration with popular AI frameworks, demonstrates their maturity as a method of interacting with code. We especially value how notebooks facilitate collaboration between technical and non-technical team members, as they can serve as living documents that combine business requirements and technical implementation in a single, shareable format.

Jupyter notebooks are the most widely used, supporting multiple languages including Python and Julia. The cloud platforms provide their own implementations: Google Colab, AWS Sagemaker Notebooks, Azure Notebooks, Databricks Notebooks. And there are language specific notebooks, such as Pluto.jl for Julia, Clerk for Clojure, Polynote for Scala.

Trial

These tools show promising potential with growing adoption and active development. While they may not yet have the same maturity as Adopt tools, they offer innovative approaches and capabilities that make them worth exploring for forward-thinking teams.

MLflow

We have placed MLFlow in the Trial ring due to its potential as a lightweight and modular option for teams seeking to manage the machine learning lifecycle. Its open-source nature makes it an attractive alternative to the more monolithic cloud-based MLOps platforms provided by vendors such as AWS, Microsoft and Google. A key advantage of MLFlow is its ability to avoid vendor lock-in, offering teams the flexibility to maintain control of their infrastructure and adapt workflows as their needs evolve.

That said, realising the benefits of MLFlow requires teams to have a certain level of technical expertise to configure and integrate it into their existing systems effectively. Unlike cloud-native behemoths such as SageMaker or Vertex AI, MLFlow does not provide an all-in-one, plug-and-play experience. Instead, it offers modular components that must be tailored to specific use cases. We recommend assessing MLFlow if your organisation values flexibility, has the technical proficiency to manage integrations, and prefers avoiding dependency on proprietary platforms early in your MLOps journey.

Vector databases

Vector databases have emerged as specialised tools for managing the high-dimensional data representations (embeddings) required by AI models. They enable efficient similarity search across text and images. Prominent solutions include Pinecone, Qdrant, Milvus and Weaviate.

We’ve generally placed vector databases in the Trial ring, as they have proven valuable for specific use cases such as semantic search and recommendation systems. However, their adoption should be carefully evaluated based on individual requirements. Traditional databases may be sufficient for simpler operations and avoid the data consistency challenges of keeping embeddings synchronized with underlying content changes across databases. Alternative approaches, such as Timescale’s PGAI vectorizer, bring vector embedding search directly into the Postgres database, ensuring embeddings remain synchronised with underlying content changes.

If a vector database is required for your use case, the choice of provider often depends on factors such as scale requirements and whether a managed or self-hosted solution is preferred. Pinecone leads in production readiness but comes with the costs of a managed service, while open-source alternatives such as Qdrant and Milvus offer greater control but demand more operational expertise.

For teams prioritising rapid prototyping and developer experience, Chroma has emerged as a popular choice. Its Python-first approach, minimal configuration requirements, and intuitive API make it accessible for developers without extensive database expertise. A 2025 Rust rewrite delivered significant performance improvements, and a new cloud offering extends its reach. However, Chroma remains best suited for prototyping and small-to-medium scale applications rather than enterprise workloads requiring SLAs and role-based access control.

LanceDB takes a different approach as an embedded vector database, similar in philosophy to SQLite. Rather than running a separate server, LanceDB operates as a library within your application, using Apache Arrow’s columnar format to query vectors directly from disk at near-memory speeds. This makes it particularly compelling for local AI assistants, edge deployments, and scenarios where data must remain on-device. LanceDB is the only embedded vector database option in the Node.js ecosystem. The trade-off is that embedded architectures have inherent limits for high-concurrency workloads and can suffer cold-start latency in serverless environments.

Local model execution environments

Tools such as Ollama, LM Studio and AnythingLLM provide accessible ways to run open weight models on local hardware. These environments enable rapid experimentation with open weight models from providers including Meta (Llama), Mistral, DeepSeek, Alibaba (Qwen), and OpenAI (gpt-oss) without API costs or sending data to external services. Many now support advanced capabilities including web search, tool calling via Model Context Protocol (MCP), and connections to commercial APIs for hybrid workflows.

These tools serve various evaluation needs: developers testing AI features during development, teams comparing model responses for specific use cases, and organisations exploring AI capabilities with sensitive data that cannot leave their infrastructure. The range spans from command-line interfaces such as Ollama to graphical applications such as LM Studio, accommodating different technical backgrounds and preferences.

We’ve placed these in Trial as they offer a valuable alternative approach to model evaluation alongside cloud-based testing. They’re particularly useful for privacy-sensitive prototyping and scenarios where extensive experimentation would be cost-prohibitive via APIs. Teams should consider these tools as one option among many for model evaluation, weighing their benefits against the overhead of local setup and maintenance.

LLM observability tools

The increasing complexity of agentic systems has created a need for development-time observability that goes beyond traditional production monitoring. When LLM applications involved simple API calls, understanding system behaviour was straightforward. Modern agentic builds involve multi-step reasoning, tool orchestration, RAG retrieval and chains of LLM calls where a single user request might trigger dozens of internal operations. Debugging why an agent produced an unexpected result requires visibility into every step of that chain.

We’ve created this section as distinct from Production AI monitoring platforms, which focus on drift detection and performance degradation in deployed systems. LLM observability tools address a different need: understanding what happened inside your application during development and debugging. As agentic architectures become more prevalent, this visibility becomes essential rather than optional.

Phoenix, from Arize AI, has emerged as a leading open-source option in this space. Built on OpenTelemetry, it avoids vendor lock-in whilst providing tracing and evaluation capabilities. Phoenix offers auto-instrumentation for popular frameworks including LangChain, LlamaIndex and DSPy, as well as direct integrations with OpenAI, Anthropic and AWS Bedrock. Teams can self-host for free or use Arize’s cloud offering.

Langfuse is the most popular fully open-source alternative, available under MIT licence with no restrictions on self-hosting. It combines tracing and evaluation capabilities with strong support for multi-turn conversations. Langfuse integrates well with existing workflows and offers a generous free cloud tier for teams not ready to self-host.

For teams already committed to the LangChain ecosystem, LangSmith provides native integration that understands LangChain’s internals and surfaces them in debugging views designed for that framework. Helicone takes a different approach as a lightweight proxy: route your API calls through Helicone’s endpoint and gain observability without SDK changes, whilst also benefiting from gateway features such as caching and rate limiting.

Assess

These tools represent emerging or specialized technologies that may be worth considering for specific use cases. While they offer interesting capabilities, they require careful evaluation due to limited adoption or uncertain long-term viability.

AI application bootstrappers

AI application bootstrappers generate complete applications from prompts or designs. The market has matured rapidly, with Lovable (formerly GPT Engineer) emerging as a leader alongside established tools such as V0, Bolt.new and Replit Agent. Google entered the space in 2025 with Firebase Studio. These tools can dramatically accelerate the creation of demos and prototypes, taking projects from concept to working application in hours rather than days.

The capabilities are improving at a remarkable pace. Lovable’s visual editor now allows Figma-like manipulation with automatic code updates. V0 excels at generating production-ready React components. Bolt.new runs full-stack development entirely in the browser. Enterprise adoption is growing, with major companies using these tools for internal tooling and rapid prototyping.

However, we remain cautious about production use. Success with these tools still correlates strongly with existing software engineering expertise. Senior developers can effectively use them as accelerators, understanding how to refactor generated code and establish proper architectural boundaries. Teams without this expertise risk shipping code they cannot maintain, debug, or evolve. The gap between “working demo” and “production-ready system” remains substantial, and we’re particularly concerned about organisations building on bootstrapped foundations without the capability to evaluate what they’ve built.

We’re watching this space with great interest. The pace of improvement suggests these tools may move toward Trial in future radars. For now, we recommend them primarily for prototyping and proof-of-concept work, with clear separation from production codebases unless your team has the engineering depth to take full ownership of the generated code.

Visual computer use agents

AI agents that interact with computers through visual understanding, controlling screens via mouse clicks and keyboard inputs as a human would, have matured but remain risky. Claude Computer Use allows Claude to control desktops and browsers by seeing the screen and reasoning about interface elements. OpenAI Operator, powered by their Computer-Using Agent (CUA) model, focuses on web browser automation through a managed environment. Browser Use offers an open-source alternative with flexibility across multiple model providers.

Reliability for bounded tasks has improved significantly, with standard office workflows now seeing success rates in the high 80s. However, serious security concerns temper any enthusiasm. Prompt injection attacks, where malicious instructions hidden on web pages hijack agent behaviour, represent a systemic vulnerability across all visual browser agents. OpenAI has publicly acknowledged that this problem “may never be fully solved,” and security researchers warn that these agents “don’t yet deliver enough value to justify their current risk profile” given their access to sensitive data like email and payment information.

We’ve kept visual computer use in the Assess ring. For many automation needs, programmatic approaches via APIs and workflow automation platforms remain both more reliable and more secure than visual interaction. Visual computer use is best suited to isolated environments where the agent cannot access sensitive data or navigate to untrusted websites. Teams considering these tools should grant minimal permissions, avoid broad instructions like “do whatever is needed,” and maintain human oversight for any high-stakes actions.

This section was previously titled “Agentic computer use”.

Lakera

Lakera is an AI safety and robustness platform designed to detect and mitigate risks in machine learning systems. It provides mechanisms for testing and analysis to help developers identify weaknesses or vulnerabilities in AI/ML models prior to deployment. This makes it particularly appealing in contexts where reliability and safety are paramount, such as finance, healthcare, or any domain subject to compliance constraints.

We have placed Lakera in the Assess ring because while it addresses an important need for AI safety, the platform has several practical limitations that require careful evaluation. Currently, Lakera supports only text-based scanning, teams using multimodal AI systems with images, audio, or video will find gaps in coverage. Custom scanning capabilities for business-specific terms or PII detection rely on regex patterns rather than context-aware analysis, which can quickly hit limitations in complex scenarios.

Performance considerations vary significantly between deployment options. The SaaS offering may provide adequate performance for many use cases, but has text size limitations that require applications to handle chunking. Self-hosted deployments offer more control but require substantial GPU resources for acceptable performance. Additionally, Lakera’s scanning is non-stateful, each prompt and response is scanned in isolation without awareness of the broader conversation context, and only ‘user’ and ‘assistant’ message types are recognised.

Given these constraints, Lakera may provide valuable safety assurance for straightforward text-based AI applications, but organisations should carefully assess whether its current capabilities align with their specific AI architectures and safety requirements. We recommend conducting thorough proof-of-concept testing that includes your specific modalities and performance expectations before determining if Lakera fits your use case.

Structured output libraries

We’ve placed structured output libraries in the Assess ring as increasingly important tooling for production AI applications that need reliable, typed responses from LLMs.

Libraries such as Instructor, Outlines and Marvin address a common challenge. LLMs naturally produce freeform text, but applications typically need structured data: JSON matching a schema or selections from valid options. These libraries constrain LLM outputs to match specified structures, either through clever prompting, logit manipulation, or grammar-based generation.

The practical value is significant. Instead of hoping an LLM produces valid JSON and writing brittle parsing code, developers can specify Pydantic models and receive guaranteed-valid objects. This reduces error handling complexity and makes LLM outputs composable with traditional software. For agentic systems, structured outputs are essential. Agents need to produce function calls and decision objects that downstream code can reliably process.

We’ve placed these in Assess rather than Trial because the space is rapidly evolving and best practices are still emerging. Instructor has gained significant traction for its simplicity and Pydantic integration, while Outlines offers more sophisticated constrained generation for teams needing fine-grained control. Teams should evaluate which approach matches their reliability requirements and performance constraints. The native structured output features increasingly offered by model providers (OpenAI’s JSON mode, Anthropic’s tool use) may reduce the need for external libraries in some scenarios.

Hold

These tools are not recommended for new projects due to better alternatives or limited long-term viability. While some may still have niche applications, they generally represent technologies that have been superseded by more effective solutions.

Conversational data analysis

Tools such as pandas-ai, tablegpt, promptql, and Julius enable natural language querying of databases and datasets, offering significant productivity benefits for knowledgeable data analysts. Modern database-specific Model Context Protocol (MCP) servers can provide substantial context to models, including schema understanding and data contents. Our experience with JUXT’s own XTDB database revealed remarkable moments where models navigated complex table structures with apparent ease, demonstrating genuine potential for accelerating data analysis workflows.

For experienced analysts, these tools represent a meaningful productivity boost, rapidly converting natural language requests into draft queries that can be refined and optimised. However, our experience also reveals challenges: generated queries can be inefficient or occasionally incorrect despite appearing plausible. The technology sometimes struggles with nuanced requirements and may produce suboptimal approaches that experienced analysts would avoid. Uber’s experience with their internal QueryGPT tool demonstrates both the potential and the complexity, highlighting the significant number of example queries and guardrails required to achieve reliable results.

We’ve placed conversational data analysis in the Hold ring not because the technology lacks value, but because successful deployment requires users capable of understanding and validating generated queries. These tools offer substantial benefits for data teams with appropriate expertise, but should be approached cautiously by those unable to review and debug AI-generated database queries.

For teams with strong analytical capabilities, these tools can meaningfully accelerate exploratory data analysis and routine query generation, treating AI output as sophisticated first drafts requiring expert review.

Get industry news, insights, research, updates and events directly to your inbox

Sign up for our newsletter