Senior Data Engineer
This is a hands-on building role: you turn raw, messy fabrication data into the clean, well-modeled, AI-ready datasets that our AI/ML and analytics workloads run on 🚀
This is a hands-on building role: you turn raw, messy fabrication data into the clean, well-modeled, AI-ready datasets that our AI/ML and analytics workloads run on 🚀
🧑🏻‍💻 Responsibilities:
Build and operate ingestion, ELT/ETL, and orchestration pipelines that move data from our MongoDB Atlas operational store and other sources into our analytical and AI-serving layers
Implement layered (medallion-style) transformations with idempotent, backfillable, incrementally loaded jobs
Apply deduplication, normalization, and validation so downstream data is high-quality and trustworthy
Modernize legacy / homegrown data flows via incremental, strangler-fig migrations that keep production stable
Build embeddings and vector pipelines, and the feature/retrieval-ready datasets that RAG, semantic search, and agentic workloads depend on
Make production data AI-ready in practice: well-structured, lineage-tracked, and retrieval-friendly, in partnership with ML and application engineering
Implement real-time and change-data-capture flows from MongoDB (Change Streams / CDC) where workloads require fresh data
Implement the canonical data model, schemas, and data contracts defined by the Data Architect — enforced in-repo so other teams build against stable definitions
Exercise sound persistence judgment in execution: land data in the right store (document / NoSQL, vector, analytical) per the architectural direction
Contribute to build-vs-buy decisions by prototyping with proven, industry-standard tooling over custom development
Establish testing, data-quality, and lineage checks for the pipelines you own, with clear alerting and runbooks
Instrument pipeline observability (freshness, volume, schema-drift, cost) so failures are caught before consumers feel them
Use AI-assisted development tools (Claude Code, Copilot, Cursor) as a force multiplier for transformation logic, query tuning, and migration scripting
Partner with database engineering on extracting from and protecting the production store
Partner with the Data Architect on implementing target-state patterns and surfacing what's hard to build
Partner with ML, AI, and application engineers on the data they consume — shaping and governing it so it's safe and ready to build on
🤝 If you have:
5+ years of hands-on data engineering experience building and operating production data pipelines at scale
Strong programming and data skills: Python and SQL, with solid software-engineering fundamentals (version control, testing, CI) — shipping and maintaining production code, not just notebooks
Hands-on MongoDB at production scale (Atlas ideal): document modeling, aggregation framework, change streams / CDC, and extracting from a document store into analytical / AI-serving layers. Our stack is NoSQL / MongoDB, not relational, this is a core requirement, not an extra
Demonstrated experience with ELT/ETL pipeline design, transformation frameworks (dbt or equivalent), and orchestration (Airflow, Dagster, or Azure Data Factory)
Experience building on cloud-native data platforms and lake / lakehouse / warehouse architectures, with layered (medallion-style) modeling
Hands-on experience preparing data for AI/ML or analytical consumers — embeddings / vector pipelines, RAG-/feature-ready datasets, or equivalent — including deduplication, normalization, and validation
Familiarity with vector search and embeddings in production (MongoDB Atlas Vector Search or equivalent)
Demonstrated use of AI-assisted development tools (Claude Code, Copilot, Cursor) for data and pipeline work
Strong grasp of data quality, testing, lineage, and pipeline observability practices
Comfortable working in a complex, specialized domain. MEP / AEC / construction experience is a plus; appetite to learn the domain is required
🦾 It’s a plus:
Experience with the Azure data ecosystem (Data Factory, Synapse Analytics, Azure Functions, Event Grid)
Lakehouse platforms (Databricks, Snowflake) or open table formats (Iceberg, Delta, Hudi); feature stores (Feast or equivalent)
Streaming / event-driven data processing (Kafka, Event Hubs, Spark Structured Streaming)
CDC and cross-engine sync (MongoDB Change Streams, Debezium, or equivalent)
Experience with geometric / BIM / CAD data or other multi-modal, unstructured source data
Knowledge-graph, ontology, or semantic-layer exposure
Data governance for AI/agent access to production data: query-cost controls, read-path safety, lineage, audit
SOC 2 and data-classification awareness
This call is made within the framework of Law 19.691 on the Promotion of Employment for Persons with Disabilities, including individuals registered in the National Registry of Persons with Disabilities of the Ministry of Social Development
- Department
- Technology
- Locations
- Argentina, Colombia, Mexico, Peru, Uruguay, Chile, Ecuador, El Salvador
- Remote status
- Fully Remote