DP21 - Multi-Modal Interactions & Experiences
| ID: | ML-Draft-026 |
| Title: | DP21 - Multi-Modal Interactions & Experiences |
| Status: | approved |
| Authors: | The Meta-Layer Initiative |
| Group: | N/A |
| Date: | 2026-05-04 |
| Revision: | 00 |
| Pages: | 10 |
| Words: | 4812 |
DP21 ensures the meta-layer works across modalities—text, voice, visual, spatial, and beyond—without fragmenting meaning, trust, or governance. It introduces a semantic-first architecture where meaning is preserved and consistently rendered across interfaces. The draft tackles risks like modality drift, voice oversimplification, immersive platform capture, and sensor-driven surveillance. It emphasizes accessibility, cross-modal consistency, and participant control over how interactions are experienced. The core principle: the interface can change—but the truth conditions shouldn’t. Otherwise, every device becomes its own version of reality.
This draft articulates Desirable Property 21 (DP21) as the condition under which the meta-layer can be experienced across multiple sensory modalities, devices, and interaction paradigms without fragmenting meaning, agency, trust, or governance.
DP21 defines how the Metaweb supports visual, auditory, textual, spatial, tactile, and emerging modalities so participants can engage in ways that are natural, accessible, and context-appropriate.
The meta-layer must not be confined to screens or text interfaces. It must function as a cross-modal civic interface that can appear in browsers, voice systems, AR/VR environments, ambient computing contexts, and future interaction substrates.
If DP21 is weak, predictable failures follow: the meta-layer becomes screen-bound; accessibility gaps widen; immersive systems become siloed; voice and AI interfaces lose context; meaning degrades across modalities; and participation becomes limited to technically privileged users.
DP21 connects directly to:
DP21 does not prescribe specific hardware, platforms, or interaction styles. It defines the minimum conditions under which multi-modal participation remains coherent, accessible, and governed.
The current web is primarily visual and text-based, with fragmented support for cross-modal continuity.
Meanwhile, interaction surfaces are rapidly diversifying:
These systems often operate in isolation, creating fractured user experiences and inconsistent trust surfaces.
Examples of failure:
DP21 reframes modality as a first-class design concern. The meta-layer must ensure that meaning, trust signals, governance rules, and participation pathways persist across modes.
A healthy DP21 implementation must answer:
Without DP21, the meta-layer becomes fragmented, exclusionary, and unreliable.
DP21 is not simply about supporting more devices. Multi-modal systems introduce new ways for meaning, trust, consent, and agency to fracture. A visual interface can show nuance through layout, color, hover states, and layered context. A voice interface may compress all of that into one sentence. An AR interface may place a signal in physical space, changing how the participant experiences authority, urgency, and social pressure.
For this reason, DP21 treats modality as a governance surface. Every new mode of interaction changes what participants can perceive, how they can respond, what data is collected, and how power is experienced.
Different interface modes may evolve as separate ecosystems, each with its own data model, permissions, affordances, and governance assumptions.
Example: A browser overlay supports community trust annotations, while an AR headset displays only platform-approved spatial labels. A participant moving between the two environments sees different versions of reality without knowing which guarantees changed.
Why this matters: If modalities become silos, the meta-layer loses its role as shared context. Trust becomes dependent on the device or vendor rather than the underlying civic substrate.
Multi-modal systems can appear inclusive while quietly degrading experiences for people who rely on assistive technologies.
Example: A visual overlay includes provenance badges, dispute status, and community annotations, but a screen reader receives only the page text. The participant hears the claim but not the trust context that other participants see.
Why this matters: Accessibility is not equivalent to basic access. DP21 requires that meaning, trust, and agency survive translation into accessible forms.
The same object, claim, participant, or governance action may be represented differently across modalities.
Example: A controversial claim has a detailed contextual overlay in text, a short warning in voice mode, and a red spatial marker in AR. Each representation carries a different emotional weight and interpretive frame.
Why this matters: Context fragmentation can change what participants believe the system is saying. If each modality tells a slightly different story, governance becomes unstable.
AR, VR, and spatial computing environments can become proprietary civic spaces controlled by headset vendors or platform operators.
Example: A city deploys public-service overlays in a vendor-controlled AR environment, but residents can only access civic annotations through approved devices and approved identity systems.
Why this matters: The meta-layer must not trade the flat-web platform problem for a spatial-platform problem. Immersive systems need open, governable, interoperable trust surfaces.
Voice interfaces are powerful because they feel natural, but they are also compressive. They often force complex information into linear, short-form summaries.
Example: A voice assistant says, “This source is considered unreliable,” without explaining by whom, according to what rubric, with what confidence level, and whether the rating is disputed.
Why this matters: Trust signals must not become authoritarian declarations when translated into voice. DP21 requires voice interfaces to preserve uncertainty, source, scope, and recourse.
Multi-modal systems often rely on cameras, microphones, location, motion sensors, biometric signals, gaze tracking, and environmental scanning.
Example: An immersive overlay requests room mapping, eye tracking, voice input, and location data to provide contextual interaction, then reuses those signals for personalization or reputation inference.
Why this matters: Multi-modal richness can become surveillance richness. DP21 must be bound tightly to DP4 data sovereignty and privacy.
AI agents may behave differently depending on whether they are speaking, writing, acting in AR, or mediating a tactile or ambient interface.
Example: A text agent carefully cites uncertainty, but the same agent in voice mode speaks with confidence and omits caveats to sound more natural.
Why this matters: Participants should not receive weaker governance simply because an interaction feels conversational or immersive. Agent behavior must remain policy-bound across modalities.
Richer devices can create richer participation opportunities. Participants with headsets, high-end phones, or always-on assistants may gain more visibility, faster response, or stronger coordination capacity.
Example: A governance meeting includes spatial collaboration tools that allow headset users to annotate a shared model while phone-only participants can only watch.
Why this matters: Multi-modal capability must not become a new class divide. DP21 requires parity of agency, even when interface richness differs.
Different modalities may represent the same signal with different emotional or social force.
Example: A low-confidence caution shown as a small visual icon becomes a loud audio interruption in voice mode or a red warning halo in AR.
Why this matters: Representation affects behavior. Systems must map severity, confidence, and urgency carefully across modalities.
Real-time multi-modal systems can desynchronize. A participant may see one state, hear another, and act on outdated information.
Example: During a live civic deliberation, a visual overlay updates vote status immediately, while the voice interface reads an earlier summary. Participants make decisions from inconsistent system states.
Why this matters: Trust depends on temporal coherence. Systems must signal freshness, delay, and synchronization status.
Multi-modal systems can overwhelm participants by presenting too many signals at once.
Example: A participant receives a visual warning, audio cue, haptic pulse, AI suggestion, social annotation, and governance prompt simultaneously.
Why this matters: More modalities do not automatically create more clarity. DP21 requires calm, layered, participant-directed interaction design.
Ambient interfaces can influence participants without being consciously noticed.
Example: A room-scale system subtly changes lighting, sound, or notification patterns to encourage agreement, urgency, purchase, or disengagement.
Why this matters: Ambient interaction can bypass reflective agency. DP21 must ensure that ambient signals remain legible, consented, and interruptible.
Spatial and tactile interfaces can affect movement, attention, and physical safety.
Example: An AR overlay appears while someone is crossing a street, or a haptic notification distracts a worker using dangerous equipment.
Why this matters: Multi-modal governance must include situational safety, not only information integrity.
The meta-layer must support multi-modal interactions such that meaning, trust, governance, and participation remain coherent and accessible across all forms of interaction.
Modality should expand agency, not fragment it.
A participant should be able to move from a browser to a phone, from text to voice, from a screen to an AR layer, or from an immersive environment back to a simple assistive interface without losing the essential structure of the interaction. The presentation may change, but the underlying meaning, permissions, trust signals, and rights should remain intact.
Example: A participant encounters a health claim online. In a visual browser overlay, they see provenance, community annotations, and expert citations. In voice mode, they hear a concise but scoped summary: who disputes the claim, what confidence level applies, and how to request more detail. In AR, the claim is marked with a contextual indicator that opens the same evidence graph. Each modality feels natural, but all point back to the same governed semantic object.
What this feels like: The system adapts to the participant and context without changing the truth conditions of the interaction.
Without this: Each interface becomes a separate reality.
DP21 requires a layer that separates meaning from presentation.
The system must first know what something means before deciding how to show, speak, vibrate, spatialize, or summarize it. A trust signal, governance rule, consent boundary, identity claim, or reputation state should exist as a structured object before it is rendered into any particular modality.
This layer includes:
The goal is not to make every modality identical. The goal is to make every modality faithful.
Failure mode: presentation-driven divergence, where the interface invents meaning rather than rendering governed meaning.
The meta-layer should support a broad range of devices, including:
Device support must be capability-aware. A phone, headset, screen reader, and voice-only device cannot provide identical experiences, but each should preserve core agency.
A low-capability device should still allow participants to:
Failure mode: device privilege, where only participants with high-end devices can fully participate.
The meta-layer should support interface modes such as:
Each interface mode has strengths and risks. Visual systems can carry density but may exclude non-visual participants. Voice systems feel natural but can flatten complexity. Haptics can signal urgency but can also become intrusive. Spatial overlays can make context embodied but may create coercive presence.
DP21 requires every interface mode to be evaluated not only for usability, but for its effect on agency, comprehension, privacy, and governance.
Failure mode: mode enthusiasm, where novelty drives adoption before governance catches up.
All modalities must derive from shared semantic structures.
A participant, claim, annotation, trust signal, governance rule, consent object, reputation state, or AI action should have a stable underlying reference independent of how it is presented.
For example, a provenance warning should not be three unrelated artifacts across visual, voice, and AR interfaces. It should be one semantic object with multiple renderings.
A semantic object SHOULD include:
Failure mode: duplicated interpretation, where each modality generates its own version of the truth.
Accessibility must be foundational.
DP21 requires that multi-modal interaction expand participation for people with visual, auditory, motor, cognitive, sensory, linguistic, or situational differences.
Accessibility support SHOULD include:
Accessibility is not satisfied when content is technically reachable. Participants must be able to understand, decide, respond, and appeal.
Failure mode: assistive afterthought, where accessibility tools receive the surface content but not the governance context.
State and context must persist across modalities.
If a participant starts an interaction in one mode and continues in another, the system should preserve:
Example: A participant begins reviewing a civic proposal on desktop, continues by voice while commuting, and later joins an AR town hall. The system should maintain continuity without leaking private context or assuming consent across settings.
Failure mode: context reset, where each modality treats the participant as starting over.
Multi-modal systems often require richer data inputs. DP21 therefore requires stricter governance around sensors and environmental awareness.
Sensitive inputs may include:
Systems MUST define:
Where possible, processing should occur locally or in privacy-preserving environments.
Failure mode: surveillance through richness, where better interaction becomes the justification for excessive capture.
AI can help translate, summarize, narrate, adapt, and synchronize across modalities. This makes AI useful for DP21, but also risky.
AI mediation SHOULD preserve:
Example: An AI assistant converting a dense trust overlay into voice should not simply say, “This is false.” It should explain the relevant confidence, source, dispute status, and option to hear more.
AI modality adapters should be auditable. Communities must be able to inspect whether AI summaries systematically omit minority perspectives, soften warnings, exaggerate certainty, or change policy meaning.
Failure mode: translation as governance, where AI adaptation quietly decides what participants are allowed to perceive.
AR and VR shift the meta-layer from a screen-adjacent experience into an embodied environment. This changes the stakes.
Spatial overlays can attach meaning to:
These overlays must respect governance zones, consent, safety, and contextual integrity.
Example: A public monument may have multiple community annotations, historical context layers, accessibility routes, and civic dialogue spaces. Participants should know which layer they are seeing, who governs it, and how to contest or add context.
Failure mode: unaccountable spatial authority, where whoever controls the overlay controls the perceived meaning of the physical world.
When full modality support is unavailable, systems must degrade gracefully.
Graceful degradation means participants are told what is missing and given a meaningful alternative.
Examples:
Failure mode: silent degradation, where participants assume they have the full context but receive only a partial version.
Important signals must have equivalent representations across modalities.
A high-risk warning may appear as:
Equivalence does not mean sameness. It means each rendering preserves the function of the signal.
A modality equivalence map SHOULD define:
Failure mode: missing signal class, where a participant in one modality never receives a critical warning.
Every multi-modal representation must remain anchored to shared identifiers.
Context anchors may include:
Anchoring allows different modalities to point to the same thing.
Example: A visual annotation on a paragraph, a voice summary of that paragraph, and an AR representation in a classroom discussion should all refer to the same underlying claim object.
Failure mode: reference drift, where participants believe they are discussing the same object but each modality has attached context to a different target.
Participants must be able to give feedback through multiple modalities.
Feedback may be:
Multi-modal feedback must still produce structured feedback objects under DP18. A spoken concern, a tactile accessibility report, or a spatial annotation should not disappear into unstructured logs.
Failure mode: feedback inequality, where only participants using dominant interfaces can meaningfully shape the system.
Participants should be able to define how they want the meta-layer to communicate with them.
Preference profiles MAY include:
These profiles must be portable where appropriate and governed by DP4 privacy constraints.
Failure mode: interface paternalism, where the system decides the best mode for the participant without meaningful control.
The appropriate modality may depend on context.
A participant walking, driving, working, resting, attending a meeting, or entering a sensitive civic space may need different interaction boundaries.
Systems SHOULD support situational safety modes such as:
Failure mode: context-insensitive interaction, where the system provides the right information at the wrong time or in the wrong sensory form.
DP21 supports the fusion of diverse data types, but only when fusion increases understanding without eroding governance.
Multi-modal intelligence may involve:
The purpose of multi-modal intelligence is not to collect everything. It is to combine relevant signals into a more coherent understanding of a situation.
Example: During a disaster response, the meta-layer could combine public reports, satellite images, local sensor data, official advisories, and community annotations into a shared situational map. Participants using phones, voice assistants, or public kiosks could access the same underlying context in different forms.
DP21 requires that data fusion preserve:
Failure mode: data chaos, where many signals are fused without lineage, confidence, or governance.
Failure mode: fusion authoritarianism, where the system produces a single authoritative view from contested inputs without preserving dissent or uncertainty.
Multi-modal systems are governance systems because they shape perception, attention, and action.
Communities must be able to define:
Governance artifacts SHOULD specify modality-specific rules.
Example: A school zone may allow text and visual overlays, restrict biometric sensing, require captions for all audio, prohibit persistent student tracking, and require teacher or community approval for AR layers.
Failure mode: ungoverned modality expansion, where new interface capabilities become active before the community has defined rules for them.
A DP21-aligned implementation should be evaluated against the following questions.
Implementation patterns translate DP21 from principle into practice. These are not rigid prescriptions, but recurring architectural and design approaches that help systems maintain semantic continuity, accessibility, and governance integrity across modalities. They are intended to guide builders in making consistent decisions when adapting the meta-layer to new devices, senses, and interaction environments.
Build semantic objects before rendering interfaces. Presentation layers should adapt governed meaning, not invent it.
Create adapters for visual, voice, haptic, spatial, and assistive interfaces that consume the same underlying semantic objects.
Every core trust, consent, governance, and reputation signal should be tested across assistive modalities.
State should travel across devices and modalities through privacy-preserving synchronization.
Multi-modal applications should publish clear manifests describing sensor access, purpose, retention, and sharing.
When a modality cannot show full context, the interface should say what is missing and offer alternatives.
Voice systems should use standard language for uncertainty, provenance, dispute status, and request-for-detail prompts.
AR and VR overlays should display or make available the governing zone, source, scope, and contestation path for spatial annotations.
Participants should be able to store preferences for stimulation level, modality, language, accessibility, and AI assistance.
Feedback submitted by voice, gesture, text, or spatial annotation should generate DP18-compatible receipts.
DP21 expands agency by letting participants choose how they interact. It violates DP2 when modality choices become coercive, inaccessible, or locked to specific vendors.
Multi-modal systems create richer data flows. DP21 depends on DP4 to prevent sensors, biometrics, and environmental context from becoming surveillance infrastructure.
DP21 is an interoperability problem across senses, devices, and environments. Meaning must move across modalities without collapsing.
Communities must govern which modalities are appropriate in each zone. A meditation group, classroom, emergency response zone, and public debate space require different modality norms.
Multi-modal onboarding can meet participants through voice, visuals, interactive demos, spatial practice, and assistive formats.
AI agents must remain governed across text, voice, spatial, embodied, and ambient interactions. Containment cannot disappear when the interface changes.
Transparency must be perceivable across modalities. A trust signal that only works visually is not transparent to everyone.
DP21 depends on DP17. Multi-modal presentation is only coherent when grounded in shared semantic structures.
Participants must be able to provide feedback and receive reputation signals across modalities without losing structure, provenance, or recourse.
Multi-modal experiences make the meta-layer more present, memorable, and culturally accessible, especially through education, public installations, youth engagement, and immersive storytelling.
DP21 is currently an ML-Draft and serves as exploratory scaffolding for how multi-modal interaction becomes part of meta-layer infrastructure.
Advancement toward ML-RFC status SHOULD require:
ML-RFC promotion SHOULD be contingent on:
Early ML-RFC candidates may focus on:
DP21 will likely mature through several component RFCs rather than one monolithic standard.
DP21 ensures the meta-layer is not bound to a single interface, device, or sense.
It allows people to experience trust, context, and participation in ways that match their abilities, environments, preferences, and tools. It makes the meta-layer usable by more people, in more places, under more conditions, without sacrificing coherence or governance.
A DP21-aligned meta-layer does not treat voice, AR, haptics, accessibility tools, and immersive systems as side channels. It treats them as legitimate civic interfaces.
Interaction becomes flexible.
Meaning becomes portable.
Trust becomes perceivable.
Presence becomes multi-dimensional.
This is how the meta-layer becomes ambient, inclusive, and truly interoperable.
Related documents would appear here in the real datatracker.