LLMs for Filipino: Can They Really Understand It?

🧠 GPT-4, Claude Opus, and Mixtral were consistently best at understanding and making Filipino language content.
⚠️ Popular multilingual models like mT5 and ByT5 had a lot of trouble with how Filipino sentences are put together and making sense.
💬 Code-switching and how words change make Tagalog and Cebuano especially hard for normal LLMs to understand.
📊 Open-source models such as Yi 34B and DeepSeek did well at Filipino summarization tasks.
🌍 Not enough good Filipino and Cebuano data sets is a big problem for making language models better.

AI is becoming more local. This helps businesses and creators in the Philippines a lot. But chatbots, voice AI, and copy tools are now common. People keep asking: Can today's large language models (LLMs) really speak Filipino languages like Tagalog or Cebuano well? This is what the new FilBench test wanted to find out. FilBench tested more than 20 popular models. It shows how well today's AI understands and makes Philippine languages, and how poorly some do. This matters for platforms like Bot-Engine that need correct, multilingual automation.

What Is FilBench, and Why It Matters

FilBench is a new multilingual test kit. It measures how well large language models (LLMs) understand and make text in Philippine languages, especially Filipino, Tagalog, and Cebuano. Other tests mostly look at English or big global languages. FilBench fills a big gap in the AI world: there are no local testing tools for Southeast Asian languages.

FilBench does more than check if translations are right. It looks at more language skills by testing these areas:

Instruction-following: Can models correctly understand and follow user commands written in a native language?
Summarization: How well can the model make Filipino content shorter into clear summaries?
Reading comprehension: Does the LLM understand the meaning and context in local content?
Reasoning: Can it solve logic tasks in Tagalog, Cebuano, or mixed-language text?
Open-ended generation: Can it write free-form content smoothly, using correct cultural and sentence rules?

These tasks are all set up to copy how real users would work with AI systems in the Philippines. This is very important for platforms like Bot-Engine that run sales bots, help virtual assistants, and make learning tools in local businesses.

Evaluation Methods: How Performance Is Measured

FilBench uses a strong way to test LLMs. It looks at how much they can do and how many things they can do in Philippine languages. The testing system has two main parts:

Zero-shot Evaluation

In zero-shot tests, the model does tasks without any examples beforehand. This shows if the model can use what it knows more broadly across languages it was not specifically trained on. How a model does in zero-shot tests often shows if its training data had enough Filipino input or similar language styles.

Few-shot Evaluation

Few-shot learning gives the model a few examples before it tries a task. This tries to measure how well models can change how they act when shown examples with context. For Filipino, just a few good examples can greatly improve understanding and how smoothly it writes. This makes few-shot learning very helpful for languages with little data.

Skill-Based Task Types

Instruction-following: Models should do something based on prompts in a native language (e.g., “Isulat mo ang buod ng balita.”).
Summarization: This involves making long Filipino or Cebuano texts into simple statements. It keeps the main points and tone.
Reasoning: This shows how logical a model is when working with local ways of thinking and culture.
Comprehension: This involves reading a local-language passage and answering questions or summarizing what it means.
Open-ended generation: This shows the model's creativity, how smoothly it writes, and how well it uses local sayings and ways of speaking in its own writing.

By putting these different tasks together, FilBench copies how a real user would interact. It gives a complete picture of each model’s strengths and weaknesses in Filipino conversations.

How LLMs Scored on FilBench

FilBench's first findings showed one thing clearly: not all LLMs are the same when it comes to Philippine languages. There was a big difference in how paid, private models and open-source ones performed. This showed big language differences from their training data, model size, and design.

Top Performers

The models that were consistently best at all task types were:

GPT-4 (OpenAI)
Claude Opus (Anthropic)
Mixtral (Mistral)

These models were very smooth, recognized code-switching (Taglish) well enough, and adapted well to different ways of prompting. For summarization in Filipino or Cebuano, GPT-4 and Claude Opus scored highest. This was because they understood language and made text accurately in a balanced way.

🧠 Claude Opus did complex instruction-following tasks with Filipino commands almost as well as a human.
📖 Mixtral understood reading in a detailed way, especially when reading news articles and opinion pieces written in Filipino.

Outstanding Open-Source Models

Several open-source models surprised researchers with what they could do:

Yi 34B: Did very well in Filipino summarization and creative text making tasks.
DeepSeek: Good at both reasoning and instruction-following.
Zephyr: Was great in fast-response settings, with better-than-average Filipino sentence building skills.

Statistically, models like Yi and DeepSeek even did better than older, multilingual models such as mT5 and ByT5 when tested on real Filipino content (FilBench, 2024).

Underperformers

Models that did poorly included:

mT5
ByT5
Older regional LLMs without Southeast Asian token training

These models were called multilingual, but they often could not make coherent sentences, complete sentences, or match meaning well. Their problems came from:

Too broad pretraining data
Poor tokenization for Tagalog word parts
Bad handling of code-switching

This confirms what many developers thought: just because an LLM is multilingual does not mean it really understands smaller or regional languages.

Why Filipino & Cebuano Are Hard for LLMs

Philippine languages, especially Tagalog and Cebuano, have many language challenges. These push the limits of normal LLM designs.

1. Code-Switching Prevalence

Philippine users often switch between English and Tagalog, sometimes in the middle of a sentence. This is called Taglish. It makes it harder for models to break down words and figure out context. For example:

“Kailangan ko ng report, mas maganda kung naka-format siya in PowerPoint.”

Only the best models can understand the task, what the user wants, and the formatting instructions all at once in such mixed language.

2. Morphological Richness

Tagalog verbs often use complex word parts. These show tenses, who is doing the action, and directions (e.g., mag-, -in-, -um-, -an). Words change a lot depending on sentence structure. Models without a deep understanding of how words change often misunderstand or handle such changes badly.

3. Sparse Training Data

English, Chinese, and Spanish are most common in language datasets. Few good open data sets exist for Filipino or Cebuano. Wikipedia entries in Tagalog are only tens of thousands. English has millions.

This difference makes LLMs struggle unless they are given more specific data sets or trained using transfer learning methods.

Real-World Implications: Business, Bots, and Brands

This question is not just for study. FilBench's results have a huge business impact, especially in automated tasks. For companies using platforms like Bot-Engine, language understanding changes how much money they make.

Common Use Cases

Customer Support: Bots that misunderstand local words lose user trust and send problems to a human.
Sales Funnels: Wrong messages in Taglish make sales less likely and hurt brand trust.
Microlearning & EdTech: AI tutors that can't read Cebuano questions correctly miss chances in rural student markets.
E-commerce NLP: Product descriptions written in awkward Filipino push users away, even if translated correctly.

Impactful Insight

Models that do well in FilBench bring real improvements for users. Bots can now do these things:

Understand what users mean more precisely, even in mixed-language messages.
Make smooth, human-like Filipino responses.
Build trust by using local subtleties. This is a big help in building trust.

The Role of Prompt Engineering

When even good models don't do enough, prompt engineering becomes the main way to get good results.

Techniques That Help

Few-shot Prompting: Show models how local-sounding content is put together using Tagalog examples.
Translation Then Rewrite: Let models write in English. Then, give them instructions to rewrite in Filipino or using local sayings.
Contextual Priming: Put local cultural references in the system prompt. This helps make sure meanings match better.

These methods have been shown to improve results by more than 25% in tests where humans rated results, done by developers using Bot-Engine.

Implementation Tooling

Platforms like Bot-Engine now let you:

Make custom prompt sections for each language
Set up language detection to start things
Have multimodal pathways for voice, text, and UI interfaces that know the language.

The Ethics and Inclusion Perspective

Beyond the technical parts, there is a more important talk about including all languages in the AI age. If Philippine languages aren't handled well by models:

Billions of interactions go through badly-translated English.
Millions are not included in voice assistants, edtech tools, and chatbots.
Cultural details are missed, making everything the same online.

Ethical Questions to Ask

Whose data is most important in language models?
Why are African, Southeast Asian, and Indigenous languages still on the edges of LLM development?
Are businesses making some languages seem more important by only offering “English-first” experiences?

FilBench acts as a tool that finds problems. It shows not just technical problems but fairness problems in today's AI world.

How Businesses Can Choose the Right LLM for Filipino

To choose the best model, look at what your application needs and what technical limits you have. Use this guide to decide:

Use Case	Prefer These Models	Why
Chatbots & Support Agents	GPT-4, Claude Opus, Yi 34B	Does well with instruction-following and feelings
Content Writing & Blogs	Mixtral, DeepSeek, GPT-4	Best at accurate summarization
On-Premise or Cloud Control	Falcon 2, Zephyr, Yi	Open-source & easy on your setup
Multilingual Routing	GPT-4, Claude Opus, Mixtral	Best at logic across languages

Don't just compare model size. Also think about licensing costs, server loads, and speed issues. For local businesses, the best model is not the biggest one. It's the one made for how your users speak.

Bot-Engine’s Perspective on FilBench Insights

At Bot-Engine, we have begun using what FilBench found in how we make products and plan customer solutions.

What That Looks Like:

Putting good-performing models in Filipino first for ready-to-use bot tasks.
Making Taglish detection and sending prompts to the right place automatic. This helps with smoother language.
Building prompt libraries with default phrases that know local culture and backup plans in Filipino.

These moves greatly increase how many people use bots. This is true especially in fields like offshore education, fintech, and utilities, where local language use differs a lot.

The Road Ahead: Towards Better Southeast Asian Language Models

FilBench is only the start. An AI world that includes more people and works better will need these things:

Regular Tests: Not just once a year. Tests should happen monthly and for specific tasks.
More Data Sets From Native Speakers: Data sets gathered from many people, open-licensed, in dialects like Hiligaynon, Ilocano, Waray.
Cross-border Work: Between universities, small to medium businesses, and public leaders in Southeast Asian regions.
Open tools that work together: To check new models against local standards quickly.

We need to aim for more than just "working" bots. The goal is natural, fair AI for all.

Smarter, Localized AI Is the Key to Scalable Automation

Thanks to tools like FilBench, the gap between global AI models and real local language experiences is finally closing. For brand leaders, startups, and public platforms in the Philippines, this means smarter bots, more trust with users, and automation that can grow and really feels Filipino.

Look at models rated well by FilBench. Connect with tools like Bot-Engine to put them to work in your automated tasks. Because in the age of AI, language and culture should never be something thought of later.

Citations

FilBench. (2024). FilBench: Evaluating Language Models on Philippines Languages.
OpenAI. (2024). Welcome GPT OSS: New open-source model family released. Retrieved from https://openai.com/
Technology Innovation Institute. (2024). Falcon 2: Trained on 5000B tokens, supports 11 languages. Retrieved from https://www.tii.ae/
Open Foundation Community. (2024). Cross-linguistic LLM Benchmarks and the Path to Larger Global Inclusion.