EARLY PROTOTYPESeeking Feedback
RobotLogs Lakehouse

Data lakehouse and query engine for robotics

Query by topic, time range, or tags across your fleet's sensor data. Stream results to your analysis and visualization tools.

Learn More →Get in Touch

The Challenge

Fleet-Scale Storage

Robot fleets generate terabytes of multimodal sensor data: cameras, LiDAR, IMU, joint states. Legged robots (humanoids and quadrupeds) are particularly data-intensive with 28-40+ degrees of freedom, streaming high-frequency data from joint encoders and force sensors. Each robot produces thousands of files with intermittent connectivity and varying file sizes.

Query Performance

Engineers need instant access to sensor data for efficient debugging and visualization. Load only the data you need, when you need it.

Our Approach

We're adapting data lakehouse principles for robotics workflows: open formats on object storage, metadata-driven architecture, columnar analytics. Arrow Flight API provides streaming access to time-series data. Your robot logs remain in standard files while gaining database-like queryability.

Built with Rust and Arrow for performance. Kubernetes operators handle orchestration. S3-compatible storage is the only dependency — runs anywhere from local development to cloud scale.

Flexible Data Ingestion

The system supports common robotics formats like MCAP files and system logs. Custom converters can be added to handle proprietary formats and vendor-specific file types.

Columnar Storage Engine

Internally, all data is stored as RLD (RobotLogs Data), our columnar format derived from MCAP (the ROS 2 default). Similar to Parquet and ORC, RLD organizes data in columns for efficient analytics. The key difference is that RLD preserves opaque binary messages (Protobuf, ROS, and custom formats) exactly as recorded, without requiring structured schemas. RLD reorganizes MCAP's time-ordered chunks into Apache Arrow RecordBatches grouped by topic, enabling efficient columnar access through Arrow IPC format while maintaining message-level compatibility.

When querying for /joint_states, only that column is read from storage. Each column includes time-range indices to support precise queries for specific time windows. Teams already using MCAP gain columnar query performance without changing their tools or workflows.

RLD File Structure:

SectionFormatContent
HeaderMetadataFormat identification
MessagesArrow StreamAll messages grouped by topic
AttachmentsArrow StreamBinary attachments
SchemasArrow FileMessage schemas
ChannelsArrow FileTopic definitions
Message IndexArrow FileBatch locations & time ranges
MetadataArrow FileKey-value metadata
FooterFixed 28 bytesIndex offsets & checksums

Time-Based Organization

Each robot has a single continuous timeline. Data from different sources maps to timestamps on this timeline. Sources upload at their own schedules and the system correlates data by time.

Robot-42 Timeline
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Sensor data:  │←─30min─→│       │←─30min─→│
System logs:  │←──────────── 24 hours ──────────→│
Diagnostics:            │←─15min─→│
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
              10:00    12:00    14:00    16:00

Time-based queries return data from multiple sources together. The system maintains a continuous timeline without segmenting data into fixed sessions.

System Components

Upload Service

REST API that generates pre-signed S3 URLs for direct robot uploads. Creates upload records with tags and metadata for tracking and organization.

Lakehouse Operator

Kubernetes controller that orchestrates the processing pipeline:

  • Watches for new uploads and validates metadata
  • Schedules pluggable converter jobs to transform various formats to RLD
  • Manages metadata extraction and LSM tree compaction

Flight Service

Arrow Flight RPC server for high-performance data queries. Streams specific topics and time ranges directly from columnar storage without loading entire files.

Query API

Query robot logs using Arrow Flight RPC for high-performance streaming:

# Connect and query specific topics
client = flight.connect("flight.robotlogs.io:8815")

query = {
    "robot_id": "robot-42",
    "topics": ["/joint_states", "/imu/data"],
    "time_range": ["2024-01-01T10:00:00Z", "10:03:00Z"]
}

# Stream data as Arrow RecordBatches
for batch in client.stream(query):
    process(batch)  # Your analysis code

Integration

All data and metadata live in S3 as standard files — no proprietary formats or lock-in. Build your own analytics pipelines, integrate with existing tools, or query directly with Arrow Flight. RobotLogs Lakehouse provides the foundation; you choose how to build on top.

Open Source

Planning to open source under MIT license. The community edition will include full platform functionality with community-driven support. Enterprise customers can access stable releases, professional support with SLAs, and compliance documentation.

Use Cases

Foundation for building debugging and analysis tools on top of robot logs

Incident Investigation

Retrieve sensor data from specific time windows around failures. Access only the topics relevant to the incident.

Fleet Performance Analysis

Query across your robot fleet to identify patterns and anomalies. Analyze historical behaviors from walking gaits of legged robots to manipulation sequences. Compare performance across software versions.

Visualization Tools

Foundation to build visualization tools or integrate with existing ones. Stream data on-demand for interactive debugging and analysis.

Long-Term Archive

Store years of robot data cost-effectively with columnar compression. Query historical data as easily as recent logs.

Compliance & Reporting

Generate operational reports and maintain audit trails. Export data for regulatory compliance or safety investigations.

Training Data Pipeline

(Future)

Extract specific sensor streams for machine learning pipelines. The columnar format will enable efficient filtering by conditions and events.