When you're loading data into Panoply using the MySQL data source, you might do it once, or you might want to reload the same tables time and time again into your account. In these cases, you might not need all the data in those tables to be inserted every time, but rather, add only the ones that aren't existing yet and update those that are already existing.
For example, if you are loading your "customers" table, you will want to have only one record per customer in Panoply and prevent any duplicates from existing. Exactly for cases like this, is why we've designed our ingestion process to make use of the upsert mechanism. It updates existing data instead of inserting all the records again and only adds the ones that don't exist.
This is where a primary key comes in. It's crucial to our upsert mechanism because it identifies the uniqueness of a row or record, and helps the upsert mechanism to determine whether to update an existing record or add a new one instead. Without it, the mechanism will insert all records, and you'll end up with duplicate data.
Depending on the data source, a primary key is either:
- provided by default (like API-based data sources);
- user-configured; or
- inferred by Panoply (your source data has an "id" field/column).
For the MySQL data source, unless you have an "id" field in your table, the ingestion process can't automatically identify the primary key. Therefore, you need to define it manually when creating the data source. Without specifying the primary key, the ingestion process will insert all the records from the table.
For example, you have two records in your Panoply table, and you're trying to ingest two more (one of them already exists in your table), but you have no primary key defined. What happens is that the process will insert all records instead of doing an upsert. You'll end up with four rows instead of only having three.
Note that the selection of a primary key in a data source will affect all the selected tables in it.