Pipeline

From raw Japanese text to readable pitch accent.

The system combines established morphological analysis with lexical pitch-token combination and a custom rule engine. The goal sounds straightforward: turn difficult and abstract pitch accent information into something learners can read directly, without hiding the applied rules and grammar decisions that produced the parsing result.

01

Tokenize the sentence

Input text is split into word objects with MeCab and UniDic. Each token carries the information needed for later decisions: surface form, lemma, reading, part of speech, conjugation details, pitch drop candidates and grammar metadata.

02

Combine pitch tokens

Adjacent morphological tokens are merged into larger pitch units when a dictionary or combiner lookup shows that they behave as one accent phrase.

03

Apply context rules

After lexical pitch-token combination, the sentence-level rule engine handles suffix accenting, particles, auxiliaries, conjugations, compounds, grammar boundaries, and known exceptions before deciding how the accent behaves in context.

04

Render the result

The parsed data is returned to the frontend as structured output and rendered as readable Japanese: furigana, pitch drop marks, devoicing marks, color-coded pattern classes, and clickable explanations for rules that were applied.

Input 残念ながら、この街は戦闘行為禁止だから。

Tokenize 残念ながらこの街は戦闘行為禁止だから

Combine Lexical pitch-token lookup morph tokens 残念ながら surface match lookup total_combiners["残念ながら"] drop = 5 result 残念ながら combined pitch token

Rules Sentence-level pitch rules suffix rule 戦闘 + 行為戦闘行為, drop = 5 non-combine rule 禁止 does not compound Heiban, drop = 0 previous word 禁止 context drop = 0 condition previous word is Heiban true result だ + から → だ＼からだ gets drop = 1

JSON { html, json_data, pitch_accents, applied_rules }

Output 残念ざんねんながら、この街まちは戦闘せんとう行為こうい禁止きんしだから。

Accuracy

Measured on real prose, not toy examples.

Pitch accent parsing becomes difficult when the input is no longer a clean dictionary entry. That is why akusento is constantly stress-tested against long-form novel text, where compounds, names, kana spellings, particles, suffixes, and ambiguous readings appear naturally.

99.64% estimated content-token accuracy

179 logged corrections

118,861 characters reviewed

~19 error-free sentences per streak

How the benchmark was collected

The current production benchmark is actively audited against real, unfiltered prose from 村上春樹『ねじまき鳥クロニクル』第３部. The evaluation follows a deliberately strict standard: akusento's output is checked against professional audiobook narration at the mora level. Every apparent pitch, reading, chunking, or contextual deviation is paused, analyzed, and researched. Only deviations that reflect systematic parser issues are logged as engine errors.

Each logged error is split into specific, highly granular failure classifications, such as compounding rules, parts-of-speech, boundary chunking, or contextual homophones. This structural transparency forms a concrete algorithmic debugging loop, isolating lexicon lookup limitations from actual contextual runtime failures.

57 compounding issues
47 pitch issues
31 chunking issues
21 contextual homophones
16 reading issues
7 part-of-speech issues

To prevent the accuracy score from being artificially inflated, the estimate maps the raw character gaps between errors back to words using a dynamically calculated 2.37 characters-per-token metric. This value is derived directly from the audited text by strictly filtering out single-hiragana grammatical particles (は, が, に, etc.) and punctuation. As a result, the benchmark exclusively measures the engine's performance on core content words, such as complex compounds, conjugating verbs, and proper nouns.

This strict evaluation translates to an average flawless streak of ~658 characters (roughly 19 consecutive literary sentences) before the parser makes a single mistake. Non-standard orthographic author variances are safely flagged and systematically separated from these core evaluation metrics.

Download Latest Benchmark (June 2026)

Archived Datasets May 2026 遠まわりする雛 Baseline 99.52% 87 errors 44,998 chars

Sentence context has been removed from the public reports for copyright reasons.

What these numbers actually mean: the benchmark is an internal, manually audited development run on real literary text, not a universal claim that every possible input will be 99.64% correct. It is published because transparency matters: the errors are counted, categorized, and used to improve the parser.

Design principles

Readable output without hiding the complexity.

Context over lookup

Dictionaries are useful, but Japanese is not spoken as isolated entries. akusento is built around sentence context: what comes before, what comes after, and how grammar changes the accent shape.

Explainable rules

When the parser applies a sentence-level rule, the frontend will expose that decision. This makes the tool useful not only as a deterministic answer machine, but also as a learning surface.

Real-world edge cases

The rule system is shaped by actual failures: ambiguous readings, counter expressions, lexical pitch-token combination, suffix behavior, deaccenting chains, verb-noun ambiguity, and punctuation around grammar boundaries.

Standard Tokyo Japanese

akusento focuses on standard Tokyo-style pitch accent. Proper nouns, dialectal forms, rare literary expressions, and creative spellings can still be difficult, but each logged error becomes a concrete path for improvement.

Current status

Research Preview

akusento is currently a research preview in active development. The public site includes documentation, cached parser examples, and a static preview of the interface, while the live parser backend remains private during testing.

Development is focused on improving accent-rule coverage, parser accuracy, evaluation methods, and explainable output. Public access to the backend is planned when the parser is ready for broader use.

If you are learning Japanese, teaching pitch accent, working with Japanese-language tools, or interested in the technical side of the parser, feedback during this closed preview is especially welcome.

Interested in testing akusento or sharing feedback? Contact hello@akusento.com.

A parser built for the messy reality of Japanese.