Arabic LLMs in STEM: Are They Truly Ready?

🧪 Falcon 40B-Instruct got only 22.26% correct on Arabic STEM tasks. This shows current limits.
💻 Arabic code making with HumanEval-Ar is still much weaker than English versions.
🔄 LLMs translate better from Arabic to English than the other way around in technical topics.
🌐 Multilingual models do not do as well unless trained for Arabic and STEM content.
🧩 The 3LM benchmark is the first to fully test how Arabic models do in both STEM and code.

More Need for Arabic LLMs in STEM Domains

Arabic-speaking learners, teachers, and digital makers have been at a disadvantage for a long time in AI. Language models are mostly trained on English content, which means they do well on technical tasks—but only in English. But Arabic LLMs are still behind, especially with STEM (Science, Technology, Engineering, Mathematics) knowledge and making code. The Middle East and North Africa (MENA) region uses more AI in education, digital learning, and automation. So, good Arabic language models are urgently needed. Platforms like Bot-Engine, which run bots and assistants in many languages, often have problems because their skills vary across languages. This is why 3LM is important. It is a benchmark made to test how Arabic LLMs work in STEM and coding. This marks a big change.

The Difference in Multilingual AI Skills

Large language models (LLMs) like GPT, PaLM, and Claude have made progress with many languages. But this support is often not deep. Just because a language is in an LLM’s training data does not mean the model truly understands it—especially in technical areas like math or software making. Arabic, especially, has several long-standing performance problems:

Less correct answers for questions in specific fields: In STEM multiple-choice tests, Arabic models score much worse than English-trained ones.
Does not handle code rules and logic well: Even simple coding tasks make more mistakes when given in Arabic. This shows a difference in understanding instructions and making code smoothly.
Little data for Arabic technical terms and rules: Good, field-specific Arabic text is rare compared to the lot of English content.

Prior benchmarks like XTREME and XGLUE have tested general language understanding and translation. But they do not do enough for harder tasks like making code or physics equations. A model trained on general language will not do well in STEM on its own. It needs special changes. This problem is very big for Arabic LLMs.

What is 3LM and Why It Matters

The 3LM benchmark—Arabic LLMs in STEM and Code—wants to fix the testing difference. It does this by giving specific, trained challenges that measure technical thinking, coding skill, and subtle translation for LLMs in Arabic.

Key Goals and Features:

✅ Real STEM questions from school programs in Arabic-speaking countries.
✅ Made-up datasets to test unusual cases and fill in missing parts of public education data.
✅ Two-way translation tasks to check how well complex technical ideas are shown between Arabic and English.
✅ Code making tasks, including a translated HumanEval benchmark, changed for Arabic prompts.

Creating a benchmark like 3LM makes sure AI developers and tech platforms do not have to depend on stories or English results to see how well a model works in Arabic settings. 3LM focuses directly on Arabic skills in clear, logic-driven areas. This is needed for real AI that works in many languages.

More Details: How the 3LM Benchmark Is Set Up

The 3LM benchmark does not just use one kind of test. Instead, it breaks down how well a model does into parts. These parts aim to test different skills.

1. STEM Multiple-Choice Questions

It mixes real-world education materials from Arabic schools and universities with made-up questions. These questions are created to cover more topics across all main science fields: physics, chemistry, biology, and math.

These multiple-choice queries test understanding and thinking in subject-specific ways. They show the types of problems students and teachers see every day.

2. HumanEval-Ar for Arabic Code Generation

This part, changed from OpenAI's HumanEval dataset, gives Python coding tasks, all written with Arabic prompts. For example, a user might be asked to "أكتب دالة تحسب عدد الأعداد الأولية في مصفوفة" ("Write a function to count prime numbers in an array").

Testing how well a model can read this prompt, understand the logic, figure out what the task needs, and write good, correct code. This shows how useful it can be in automated tools—such as programming tutors or chatbot debuggers.

3. Translation Evaluation Tasks

This part tests the ability to translate technical STEM content between English and Arabic, which use advanced terms (e.g., "differential equations," "molar mass," "Boolean logic").

Arabic to English: How well can a model understand an Arabic science explanation and show it in exact technical English?
English to Arabic: Can the same model rebuild ideas in Arabic wording without losing meaning?

By carefully testing all three, 3LM gives a full picture of how a model works in Arabic STEM.

Performance: How Current Arabic LLMs Compare

As of early 2024, the leaderboard rankings from the first 3LM test shows clear differences in model ability:

Model	STEM Task Accuracy	HumanEval-Ar Accuracy
Falcon 40B-Instruct	22.26%	14.23%
XGLM-7.5B	15.33%	Lower (Unpublished)
BLOOMZ-7B1	<15%	N/A

(Source: Technology Innovation Institute, 2024)

The numbers mean:

🚀 Falcon 40B-Instruct, a model trained for Arabic, is best at both STEM and code tasks.
🚫 Multilingual models, like BLOOMZ or XGLM, do not do as well. They often cannot do more than basic translation.
📉 Even the best model tops out at ~22% on a STEM test. This means there is much room to get better, compared to English results on similar tests (which often score 50–70%).

These are first-try test results. Models have not been further trained or prompted for the tasks. This gives information. But it likely shows less than the best performance. Still, constant problems point to basic design and data flaws.

The Special Problem of Arabic Code Generation

Turning programming tasks into Arabic may seem easy. But making LLMs that can answer well shows big problems:

🗣️ Unclear Prompts: Arabic grammar changes a lot between places. Many coding terms come from English or are translated in different ways.
🧾 Missing Terms: There are no standard dictionaries for programming in Arabic. Models must guess connections. This leads to unclear or wrong code.
🐍 Python Logic Does Not Match: Even when the meaning is clear, turning tasks into correct control flows, data handling, and output rules is very hard.

HumanEval-Ar helps with these by giving clear prompts with known answers. This allows for fair testing of if a function is right, not just if its syntax is correct.

This way of testing is key for developers making assistant bots for Arabic programmers. For them, correct output is a must.

Translation Direction Is Important: EN → AR vs AR → EN

Simple translation tasks may seem balanced. But technical language brings in hidden differences:

From Arabic to English: LLMs often do better here. They use larger English word lists and more steady STEM patterns in their training data.
From English to Arabic: How well they do drops. This is because there are not enough exact word choices for Arabic technical terms and not enough coding patterns with context.

Academic Arabic has complex sentence structures. These change between subjects (for example, math versus biochemistry wording). Models trained on general MSA data or news articles often miss the subtle points needed to make correct translations in STEM.

3LM’s two-way testing catches this difference well. It helps researchers find translation problems that stop correct understanding or saying things in automated systems.

Ways to Test Behind 3LM

To test well across such different data types, the people who made 3LM use many ways to test:

✅ How correct answers are for multiple-choice questions, based on being totally right.
✅ Tests that run code, verifying if Arabic-prompted functions solve problems correctly.
✅ BLEU and TERP scores on translation tests, combined with human reviewer opinions for how true it is, its style, and if it covers the meaning.

It is important that these tests work for any model. This means the benchmarks are fair no matter how the LLM is built or trained.

Developers can use their own models and test them using the same measures. They do not need to change fine-tuning or prompt style.

Real-World Uses for Automation Platforms

The 3LM benchmark is not just for academics. It means better products for platforms that need good AI that works in many languages.

Examples include:

🤖 Bot-Engine Assistants that explain STEM ideas in Arabic to students via chat or voice.
🔧 Coding support bots, run by Arabic prompts, helping beginners learn to debug or compile in Python.
📚 Educational LMS platforms using AI to make quizzes or summaries automatically in Modern Standard Arabic and local dialects.

Clear and steady benchmarks give engineers the test numbers they need to adjust their bots for real users. The Arabic ML community did not have this before.

Chances for STEM Teachers and Arabic Content Creators

Arabic LLMs with stronger field-specific thinking can speed up making content and helping learners. Among the top uses:

✍️ Teachers use AI to make practice STEM questions automatically, matching local school programs.
📓 AI tutors put into two-language learning apps. This is very helpful in online learning between countries.
🎥 Tools for making video content. These use natural-sounding Arabic voiceovers and science summaries made by AI.

Arabic LLMs trained and measured by benchmarks like 3LM could become key co-creators. They could be used widely across thousands of classrooms or teaching websites.

Problems Still Facing Arabic Technical Models

Arabic LLMs are making progress. But they still have several ongoing problems:

Not enough dialects represented: Most models stick to Modern Standard Arabic. This puts learners using Levantine, Gulf, or Egyptian Arabic at a disadvantage.
Lack of data: Not enough open-source datasets in Arabic tech areas stops code-making training.
Slow thinking process: Longer word parts in Arabic mean tasks need more computing. This makes performance worse.
Cultural differences in source content: Imported datasets may miss terms that matter to the region and school-level grammar.

To fix these problems, many people will need to help. Especially researchers in Arabic-speaking universities.

Call for Open Collaboration

3LM is not a private project. Instead, it asks for help from the Arabic-speaking and AI development world:

💡 Send in new multiple-choice questions in Arabic STEM areas.
🔄 Help check or make better translated coding prompts.
📢 Work together on new testing systems, especially for newer LLM systems like Llama 3 or Mistral.

This could create another set of Arabic NLP benchmarks, like MMLU or BIG-Bench. But these would focus on skills that matter to the region.

What Comes Next: From Benchmark to Big Advances

Besides testing, what comes next for Arabic LLMs in STEM means:

🧠 Retrieval-Augmented Generation (RAG) models that get real science papers to make Arabic answers better.
🧪 Ways to strengthen school programs. These use few-shot prompting with Arabic examples.
🧬 Open work hubs like Hugging Face Spaces. But these would only focus on Arabic and two-language technical content.

Picture an Arabic engineer asking a local chatbot to explain a machine learning equation. They get the same good help as an English-speaking engineer. That is the goal.

Ready to Try 3LM?

You can check out and use the 3LM benchmark through its open-source home at Hugging Face. Researchers and developers can:

🔍 Get test data from GitHub or Hugging Face Datasets.
🚀 Use their own LLMs to test against 3LM leaderboards.
🛠️ Change prompts or scoring systems for their own tests.
🧠 Help grow the dataset or add support for dialects.

We are moving towards AI that works in many languages. Tools like 3LM do not just show differences. They help us fix them.

Last Points: Are Arabic LLMs Really Ready?

Arabic LLMs are getting better. But the path ahead is still hard. The 3LM benchmark brings needed clarity, repeat tests, and ways to compare for technical LLM tasks in Arabic. It fills a problem that was open for too long. From STEM homework help to automatic tutoring, the gap between what is done now and what can be done is getting smaller.

If you make bots, automate things, or handle Arabic STEM content, use 3LM now. Do not just measure, but grow.

Citations

Technology Innovation Institute. (2024). 3LM: A Benchmark for Arabic LLMs in STEM and Code. Retrieved from Hugging Face Blog.

Falcon 40B-Instruct: 22.26% accuracy on STEM multiple-choice.
Falcon models: 14.23% on HumanEval-Ar for Arabic code generation.
XGLM-7.5B: 15.33% on STEM multiple-choice. Multilingual models do not do as well compared to Arabic-trained ones.