When you're loading data files into Panoply using the S3 data source, you might do it once, or you might want to reload similar files time and time again into your account. In these cases, you might not need all the data in those files to be inserted right away, but rather, add only the ones that aren't existing yet and update those that are already existing.
For example, if you are loading your customer list, you will want to have only one record per customer in Panoply and prevent any duplicates from existing. Exactly for cases like this, is why we've designed our ingestion process to make use of the upsert mechanism. It updates existing data instead of inserting all the records again and only adds the ones that don't exist.
This is where a primary key comes in. It's crucial to our upsert mechanism because it identifies the uniqueness of a row or record, and helps the upsert mechanism to determine whether to update an existing record or add a new one instead. Without it, the mechanism will insert all records, and you'll end up with duplicate data.
Depending on the data source, a primary key is either:
- provided by default (like API-based data sources);
- user-configured; or
- inferred by Panoply (your source data has an "id" field/column).
For the S3 data source, unless you have an "id" field in your data file, the ingestion process can't automatically identify the primary key. Therefore, you need to define it manually when creating the data source. Without specifying the primary key, the ingestion process will insert all the records from the loaded file.
For example, you have two records in your Panoply table, and you're trying to ingest two more (one of them already exists in your table), but you have no primary key defined. What happens is that the process will insert all records instead of doing an upsert. You'll end up with four rows instead of only having three.