27 August 2024 / NEWS

Introducing Cerebras Inference - AI at Instant Speed

Cerebras has unveiled its new AI inference solution, claiming it to be the fastest in the world, outpacing NVIDIA GPU-based clouds by 20 times and delivering industry-leading cost efficiency.

Powered by the third-generation Wafer Scale Engine (WSE-3), this solution can process 1,800 tokens per second for Llama3.1 8B models and 450 tokens per second for Llama3.1 70B models, all while maintaining high accuracy with native 16-bit weights. The system's exceptional memory bandwidth and unique chip design eliminate traditional bottlenecks, enabling real-time AI responses. With open API access and competitive pricing, Cerebras Inference aims to revolutionize the development and deployment of large language models (LLMs) across various industries.

This breakthrough allows for more sophisticated AI workflows, such as enhanced real-time intelligence and complex tasks like code generation, which previously required extensive processing power and time. As Cerebras expands support to even larger models, its platform is set to open new possibilities in AI innovation.

To read more and benefit from those changes you can find out more here.

Introducing Cerebras Inference - AI at Instant Speed

Introducing LLaVA V1.5 7B on GroqCloud

Revolutionizing Enterprise Applications with NVIDIA's NIM Agent Blueprints

Subscribe to Kavour

Subscribe to Kavour