Operational Guide

View as Markdown

Overview

How Moveworks actually calls your gateway in production: the sync cadence, what gets re-fetched vs. cached, how to bound your load via rate-limit signals, and the file size cap. Read this before sizing your gateway’s hosting environment or wiring up rate-limit headers.


How do I throttle Moveworks’ calls to my gateway?

Moveworks honors two complementary rate-limit mechanisms. Use either or both:

  • Proactive: rate-limit headers. Return X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset on every response. Moveworks reads them and adjusts its call rate to fit your advertised capacity, slowing down before you have to fail anything. Common header-name variants (X-Rate-Limit-*, RateLimit-* per RFC 9456) are also recognized.
  • Reactive: 429 + Retry-After. When you’re at capacity, return 429 Too Many Requests with a Retry-After header. Moveworks honors the wait value and retries.

Returning fewer items per response than the requested $top (with @odata.nextLink for the rest) is also a clean way to bound per-request work. Useful when your backend can’t sustain large response payloads.

See Errors for the expected error response format.


What does the sync pattern look like, and how should I size my gateway?

Moveworks performs scheduled full sync runs against your gateway. Each run walks the complete inventory of files, file metadata, file permissions, groups, group memberships, and users. There isn’t an “incremental diff” mode that only fetches changes since the previous run.

The main cost savings between syncs come from the file binary cache. If you return an accurate last_modified_datetime on every file, Moveworks skips re-downloading binaries whose timestamp hasn’t changed since the previous sync. File metadata, permissions, group memberships, and users are re-walked each run, but the binary downloads (typically the largest payloads) are skipped for unchanged files.

For capacity planning, here is what gets called on every sync versus what gets skipped via the file binary cache:

EndpointRe-called every sync?
GET /files (list)Yes
GET /files/{id} (metadata, including HTML body)Yes
GET /files/{id}/download (binary)Skipped when last_modified_datetime matches the cached value
GET /files/{id}/permissionsYes
GET /files/permissions/metadataYes
GET /groupsYes
GET /groups/{groupId}/membersYes
GET /usersYes

Concurrent scheduled syncs are skipped, not stacked: if a sync is still running when the next scheduled run would fire, the next run is skipped until the current one completes. You won’t see overlapping load even if first sync exceeds your scheduled cadence.

If your backend has limited capacity, use the rate-limit mechanisms above to bound the call rate. Total wall-clock time per sync extends to fit your advertised capacity; calls per second stay within what you allow.


Is there a maximum file size?

Yes. Moveworks caps individual file binary content at 25 MB. Files larger than this are downloaded by Moveworks, then rejected by the indexing pipeline with status FILE_SIZE_LIMIT_EXCEEDED. They are not indexed and will not appear in search results. The error is non-retryable; Moveworks does not retry oversize files on subsequent syncs.

A few implications worth being deliberate about:

  • Files over 25 MB consume bandwidth on both sides every sync. They are downloaded in full before the size check happens. If you have a known set of oversize files, filter them out at the source (don’t include them in /files) rather than letting Moveworks re-download them each cycle.
  • Returning an accurate content.size field on file metadata is recommended. Moveworks doesn’t currently pre-check size before downloading, but a future optimization will, and accurate size lets that short-circuit work.
  • The cap applies to binary file content (/files/{id}/download payload). HTML body returned inline via content.body is not subject to the same numeric cap, though oversized HTML payloads will still hit response-time limits.

Why is last_modified_datetime important on file responses?

It’s the cache fingerprint Moveworks uses to decide whether to re-download a file’s binary content. When the timestamp on a file matches what was returned in a previous sync, the /files/{id}/download call is skipped and the cached binary is reused. This is the primary mechanism that makes subsequent syncs cheaper than first sync.

Returning an accurate, monotonically-updated timestamp on every file is the single most effective way to reduce ingestion load over time, especially for corpora with large attachments. Files with stale, always-current, or missing last_modified_datetime values will be re-downloaded every sync regardless of whether their content actually changed.

One caveat: the cache applies only to binary file content (PDF, DOCX, PPTX, plain text). HTML content (files where content.mime_type == "text/html" and the body is returned inline via content.body) is re-fetched every sync regardless of last_modified_datetime. If your corpus is mostly HTML, ongoing sync load will be roughly the same as first-sync load.


Can I build my own APIs that have custom endpoints and responses?

No. The Content Gateway approach relies on you to follow the Moveworks Gateway spec. We’ve designed this spec based on OData and API design best practices.


What if I have multiple backend systems?

You can create as many gateways as you want. Moveworks will integrate with them all. We recommend 1 gateway per instance to avoid stability issues.


What if I already built a Gateway with a system and now want to build another gateway? What can I re-use?

Every content source system should typically be connected to a dedicated Gateway connector with new URLs & authorization. A majority of their previous gateway setup should be re-usable and you can duplicate your previous setup as a start. Only change will be when fetch content for the new gateway setup, make sure to retrieve it from the new source system.


What are legacy gateways?

These are older gateways built on the previous Moveworks search infrastructure. They may have performance issues, are harder to troubleshoot, and do not support permission ingestion. While they continue to be supported, we strongly recommend using Content Gateway when possible. Post in Moveworks Community if you have additional requirements.