third-party-api-data-sync

/home/avalon/.hermes/skills/software-development/third-party-api-data-sync/SKILL.md · raw

sync data from external API
pull data from third-party system
integrate external data source
map external schema to our database
build ETL pipeline
port API client from another language

Third-Party API Data Sync Pipeline

When to Use

Integrating a third-party system (SaaS API, partner system) as ground truth for your data
Porting an existing sync pipeline from one codebase/language to another
Building incremental sync with UPSERT logic between two schemas
Building authorized file-acquisition/catalog pipelines from third-party websites, especially when download targets are hidden behind HTML forms, intermediate pages, or expiring CDN URLs

Related references: - references/file-acquisition-hidden-form-meta-refresh.md covers the hidden form → meta-refresh → binary URL pattern and focused verifier. - references/authorized-file-acquisition-ops.md covers authorized bulk file retrieval, quarantine/static scan ops, batch-resume, final-retry backoff pitfalls, orphan container cleanup, and progress reporting. - references/open-notebook-quartz-readonly-recon.md covers read-only reconnaissance and integration mapping for OpenNotebook sources/search, source-backed wiki graph imports, Quartz contentIndex.json graph extraction, and safe real-corpus fixtures. - references/multi-source-provider-registry-freshness.md covers provider-specific freshness, conflicting catalog/OpenAPI/docs/validator evidence, supplemental inventory, safe removal reconciliation, live quotes, chronology, and end-to-end verification for model/tool registries.

Phase 1: Deep Analysis (DO NOT skip)

1A. Audit the source system

Clone/access the reference codebase that already integrates with the API
Read ALL sync-related files exhaustively — don't summarize, extract exact field names
Document every API endpoint, auth method, report ID, and field name
Map: Source Field Name -> Variable Name -> DB Column -> Transform Logic
Note multi-key fallbacks (APIs often have inconsistent field naming across versions)

1B. Audit your own schema

Get EXACT column definitions for every table you'll sync into
Include: column_name, data_type, is_nullable, column_default, constraints, indexes
Get sample data (3-5 rows) to understand real-world values
Note ID generation patterns (sequences vs manual vs UUIDs)

1C. Cross-reference mapping (the critical step)

Create a field-by-field comparison table: Source Field -> Our Column -> Match Quality
Identify 3 categories:
✅ Direct matches (same concept, compatible types)
⚠️ Similar but divergent (different formats, enums, or naming)
❌ Gaps (fields they have that we don't, and vice versa)
Document value differences (e.g., they use "JBA", we use "Bishop Arts")

1D. Write analysis report

Save as a markdown doc in the project (e.g., docs/integration-analysis.md)
Do TWO passes — first pass discovers, second pass audits/corrects findings
This report is the contract for everything that follows

Phase 2: Schema Migration

Key principles

Use ALTER TABLE ... ADD COLUMN IF NOT EXISTS for safety (idempotent)
New columns should have sensible defaults (NULL or DEFAULT value)
Add synced_at TIMESTAMPTZ to every table that will receive sync data
Add a foreign reference column for the source system's ID (e.g., mt_reservation_id)
Create new aggregate/snapshot tables if the source has analytics the target doesn't
Create a sync_log table for audit trail
Save migration as a numbered SQL file (e.g., migrations/001_sync_schema.sql)
Run in a transaction

Pitfall: ID conflicts

Source system IDs may collide with your auto-increment sequences
NEVER force source IDs into your primary key column
Instead: add a separate column (e.g., external_id) with a UNIQUE index
Let your sequences handle id, use external_id for UPSERT conflict detection

Phase 3: API Client

Porting from another language

Read the ENTIRE source file before writing a single line
Match behavior exactly — same auth flow, same retry logic, same response parsing
Key patterns to preserve:
Multi-auth fallback (try API key → OAuth password → refresh token)
Sync vs async fetch (some APIs cap row counts, need async/pagination)
Response format handling (JSON:API, list-of-dicts, headers+rows, S3 download)
Case-insensitive field accessor helper (_get(row, ...keys))
Currency/number parser helper (safeFloat(val) — strips $, commas)

Structure

src/sync/
├── api-client.ts      # API client class (auth, fetch, response parsing)
├── constants.ts       # Mappings, rates, classification functions
├── sync-{entity}.ts   # One file per entity type
├── index.ts          # Orchestrator

Phase 4: Sync Modules

One module per entity type

Each sync module should: 1. Accept a DB pool + array of raw API rows 2. Transform rows using classification helpers from constants.ts 3. Match related records (e.g., customer by email → customer_id) 4. UPSERT using ON CONFLICT (external_id) DO UPDATE 5. Batch in groups of 50-100 for performance 6. Log errors per-row but don't stop the sync 7. Return { inserted, updated, errors } counts

UPSERT pattern (PostgreSQL)

INSERT INTO reservations (col1, col2, ..., synced_at)
VALUES ($1, $2, ..., NOW())
ON CONFLICT (external_id) DO UPDATE SET
  col1 = EXCLUDED.col1,
  col2 = EXCLUDED.col2,
  synced_at = NOW()

Pitfall: Unique constraint violations

Check MAX(id) for ALL unique columns, not just the primary key
Example: memberships had id (PK, max 2644) and instance_id (UNIQUE, max 9652)
Using MAX(id)+1 for instance_id caused duplicate key errors
Solution: query MAX separately for each unique column, or let sequences handle it

Daily metrics / aggregation module

If the source system has aggregate analytics: - Build an aggregation module that processes raw synced data - Use same ID format as source for compatibility - UPSERT into dedicated metrics tables

Phase 5: Admin API Endpoints

Standard endpoint set

POST /api/admin/sync                    # Full sync (all entities)
POST /api/admin/sync/{entity}           # Single entity sync
GET  /api/admin/sync/status             # Check if credentials configured
GET  /api/admin/sync/log                # Sync history from sync_log table
GET  /api/admin/metrics                 # Query aggregate metrics

Key design decisions

Use dynamic import for sync modules (they may fail if credentials aren't set)
Accept { days, start_date, end_date } for date-range syncs
Default to last 7 days for incremental sync
Log every sync attempt to sync_log (started_at, completed_at, rows_synced, errors)

Phase 6: Credential Setup & Testing

Don't hardcode credentials

Use environment variables (MT_API_KEY, etc.)
Add a /api/admin/sync/status endpoint that reports which auth method is available
Gracefully degrade if credentials aren't set (503, not 500)

Testing without credentials

All admin endpoints should work (returning empty results or "not configured")
Schema migration should be testable independently
Verify new columns exist, new tables created, indexes in place

Pitfalls & Lessons Learned

OAuth user tokens ≠ API integration keys — but test before assuming: A SaaS platform's OAuth password grant (/o/token/ with username+password) may return a USER-scoped token that works for end-user endpoints but NOT for admin/reporting APIs. However, don't ASSUME this is the case until you've confirmed. In Mariana Tek's case, the password grant tokens DO work for report endpoints — early test failures were caused by stale tokens, not insufficient scope. ALWAYS test the actual data-fetching endpoint with a FRESHLY obtained token, in a single atomic operation (get token → immediately use it), before concluding the auth method doesn't work.
Deferred revenue ≠ sale price: Accounting fields like deferred_revenue track revenue recognition, not what the customer paid. Cross-reference with order_line_items or transaction records for actual sale prices.
Field name inconsistency: APIs often return different field names depending on the report version, fetch method (sync vs async), or even the time of day. The _get(row, ...keys) pattern with 3-5 fallback keys per field is essential, not optional.
500-row API limit: Some APIs cap sync responses. Need async pipeline (start job → poll → download from S3/CDN). Build BOTH sync and async paths from the start.
Classification logic is business logic: Functions like classifyBookingSource() and classifyClassType() encode domain knowledge. Extract these into constants.ts, not inline in sync modules.
Two-pass analysis: First pass finds the data. Second pass audits the findings — catches misunderstandings about field semantics, pricing calculations, and ID generation patterns.
Ground truth principle: When the external system is ground truth, your sync should UPDATE existing records, not just INSERT new ones. Use UPSERT everywhere.
Unique constraint violations on non-PK columns: Check MAX() for ALL unique columns, not just the primary key. Example: memberships had id (PK, max 2644) and instance_id (UNIQUE, max 9652). Using MAX(id)+1 for both caused duplicate key errors. Let sequences handle auto-increment PKs; query MAX separately for other unique columns.
Test auth end-to-end before building: Don't assume "auth works" because the token endpoint returns 200. Test with the ACTUAL data endpoint you need. Different API scopes, different auth mechanisms, or separate admin keys are common in SaaS platforms (Mariana Tek, Stripe, Shopify all have this pattern).
Token freshness matters: OAuth tokens can be short-lived. A token that works for one curl call may fail 30 seconds later if you stored it and reused it across multiple test commands. When debugging "auth not provided" errors, always get a FRESH token immediately before each test request. In the Mariana Tek case, repeated tests with stale tokens looked like the API was rejecting the auth method entirely, when it was just token expiry.
Check the SPA/admin app's config for OAuth client credentials: SaaS platforms with Ember/React admin dashboards often embed OAuth client_id and client_secret in their HTML meta tags or JS config. curl the admin page and look for <meta> tags with auth config — these credentials are "public" (embedded in client-side code) and are needed to get properly-scoped tokens. A bare password grant without client_id/secret may return a limited-scope token.
CDN/CloudFront is usually NOT the problem: When seeing 401s on GET endpoints but 200s on POST endpoints behind CloudFront, the instinct is to blame header stripping. Check the CORS preflight first — if access-control-allow-headers includes authorization, CloudFront IS forwarding it. The real issue is almost always token scope or expiry, not infrastructure.
Reference codebase may have dead code paths: When an existing codebase supports multiple auth modes (API key, password, refresh token), the production deployment may only use ONE of them. Don't assume all paths work equally. Check the deployment's actual environment variables to know which auth mode is active. The "preferred" method in comments may be the only one that works for certain endpoints.
Existing reference app's cron can invalidate shared credentials: If a reference app runs a cron job that rotates refresh tokens, the token you got from the user/admin dashboard will stop working after the next cron run. Prefer password grant (which generates fresh tokens each time) over refresh tokens that can be rotated out from under you.
PM2 env var gotchas: When restarting PM2 processes with new environment variables, pm2 restart does NOT update env vars. You must pm2 delete then re-start, OR use --update-env. The safest pattern: save credentials to a .env.mt file, then env $(cat .env.mt | xargs) pm2 start .... Also check that DB connection env vars (like DB_PORT) survive process restarts — a dashboard showing "zero data" is often just a stale PM2 process with missing env vars.
Async report jobs can take 2-5+ minutes: When a SaaS API's sync endpoint caps at N rows and kicks off an async export job, that job runs on THEIR infrastructure and may take minutes. Don't make this a synchronous HTTP request to your own server — it'll timeout. Either: (a) return a job ID immediately and poll from the client, (b) use a background worker, or (c) set very long HTTP timeouts. Exxir's approach: 300-second timeout with 3-second polling intervals.
Don't debug auth with multiple cached curl calls: When testing OAuth-protected endpoints, NEVER reuse a token variable across multiple separate terminal commands. Each test should get its own fresh token in the same command chain. In the Mariana Tek investigation, 15+ minutes were spent debugging "auth not provided" errors that were simply caused by token expiry between separate curl invocations. The fix: TOKEN=$(curl -s -X POST .../o/token/ ... | jq -r .access_token) && curl -s .../api/endpoint -H "Authorization: Bearer $TOKEN"
Download pages may hide files behind normal HTML forms: For authorized file acquisition, do not stop at anchor discovery. Inspect detail pages for hidden forms containing filenames, submit first-party forms statically, then parse intermediate HTML meta-refresh targets for the expiring CDN file URL. Verify with a local fake server before running the real crawler. See references/file-acquisition-hidden-form-meta-refresh.md.
Start at a useful pace, back off on evidence: For authorized bulk retrieval, Alex prefers a practical initial pace rather than very slow fixed sleeps. Use modest single-threaded delay/jitter, honor Retry-After, and slow down only on 429/5xx or connection errors. Do not use proxies, CAPTCHA bypass, login bypass, or other evasion.
One-item real-source probes de-risk bulk runs: Before launching a long manifest, run a one-candidate manifest against the real source and confirm the log writes downloaded, bytes land in quarantine, and scanner/catalog steps can consume them. If the full run appears stuck, switch to a batch runner that processes one candidate at a time and appends to a shared log so failures do not block all progress.
Deduplicate signed CDN URL variants before fetching: Authorized file sites often expose the same binary path many times with different query parameters (md5, expires, tracking params). Hash-based duplicate detection after download is safe but can waste minutes per item. Canonicalize candidate file URLs by host + unquoted path before issuing GETs, then hash the one fetched file for integrity. See references/authorized-file-acquisition-ops.md.
Estimate and optimize long downloads from logs: For long authorized retrieval jobs, parse [item i/N] timestamps for items/hour, count unique downloaded source URLs, and separate downloaded from duplicate/unsupported statuses. If the run is slow, first remove duplicate fetches, recompute a remaining manifest, and use a single long-lived container; keep one-candidate loops as the fallback for fragile sources.