OpenAI export POC: Just simply create the ability to import memories from OpenAI. Use what you learn to implement this for other providers as well. But for now just focus on ChatGPT.

Initial Brainstorming

Ideas to jot down:

  • use prompt engineering to ask it the current capacity of memory for the user. That can be used when creating the UI
  • cant currently directly export memories. You have some options.
      1. use prompt engineering to export them
      1. use web scraping to grab them

The UI(POC v0.1) will have to ask the user to copy/paste a prompt to ChatGPT, then copy/paste the result back to the UI.
Then you can ingest the data and create trends and etc.

Initial Plans

  1. Prompt-Engineered Memory Exporter (Clean and Deterministic)
    1. Send prompt to LLM and retrieve structured output in JSON. Low latency and can become internal representation schema that I create.
  2. HTML Scraper Cross-Reference (Scrape Memory Management HTML)
    1. Detect missing or dropped entries
    2. Catch OpenAI auto-summarized memories that were never written via prompt but are still part of OpenAIs memory model
    3. Build trust and accuracy in the export
  3. Snapshot Layer + Versioning
    1. Save each export as a snapshot. Then compare with previous snapshots
      1. Save prompt dump + save UI scrape.
      2. perform a diff between current combined snapshot with previous
      3. Flag added, removed, edited, merged entries
      4. Bonus: Triangulation Fingerprint (for detecting subtle deltas)
        1. You can assign each memory a fingerprint_hash:
          • SHA256(content + title) → stable identifier across sessions
          • Enables you to detect edits vs. rewrites vs. removals

What would the memory schema be for import/export? This is going to be hard to figure out.

For now, maybe I should make it the following:

  • title
  • content
  • fingerprint

NOTE: How Memories Work in OpenAI

The account management UI memories are automatically generated by OpenAIs memory system based on conversation history
These were not manually entered using @bio, ChatGPT may or may not have access to them in full detail depending on how the memory system syncs with context.

  1. ChatGPT Memory (your OpenAI account memory)
    • When you see a notification like “A new memory was saved”, that’s referring to your ChatGPT memory tied to your OpenAI account.
    • These can be viewed and managed at Settings > Personalization > Manage Memory.
    • These memories are not the same as @bio, though they sometimes overlap in content.
  2. @bio Memory (custom assistant memory)
    • This is my own internal working memory for you, built for continuity in our conversations.
    • I only update it when you explicitly ask, or when I confirm with you first.
    • It’s not visible in your ChatGPT UI unless I show or summarize it for you directly.

You must scrape the UI summaries separately if you want 100% coverage of OpenAI’s memory system, because:

  • My export won’t include some summaries that are only visible in the UI.
  • OpenAI doesn’t expose those UI memories to prompt-access directly unless they were added explicitly via @bio.

For more info, see here: ChatGPT

You can also then tag the data appropriately. Maybe use prompt chaining and constraint-driven tagging.
So you can create another prompt that runs on the generated snapshot and tells an llm to create tags based on the provided tags

Quick Export Test w/ OpenAI using prompt

First we need to see if prompt engineering is actually reliable in exporting everything correctly.
lets try with our own account. First

NOTE:
Saved memories are persistent user data stored via @bio

Give me a complete, unfiltered, up-to-date export of all of my saved 'notepad' memories (aka persistent memory, bio entries). Do not include chat context. Format it as structured JSON with a unique `id`, `title`, and `content` for each memory block.

I’m going to create a EchoForge Test User first with 15 randomly generated memories that can be categorized into 5 tags.

I have injected the memories into the new test account. Now let’s run the prompt 5 times and compare the exports:

Despite using a precise and structured prompt, the outputs show non-deterministic variation in:

  1. Titles – The model dynamically reformulates titles even with the same underlying memory content.
  2. Tone/Voice Shifts – Minor shifts in phrasing of the same memory suggest a lack of strict reproducibility.
  3. Stylistic Formatting – Inconsistent use of title casing, punctuation, and hyphenation implies stylistic nondeterminism.

But the content is largely the same, even thought he titles are different. It works just fine though.


Strategy/Plan

  1. Prompt-based export → for structured, explicit, LLM-usable memory.
  2. UI-scrape export → for OpenAI’s abstract, inferred, and possibly untracked memories.

First the prompt based export:

  1. Run this prompt with structured schema:
    Give me a complete, unfiltered, up-to-date export of all of my saved ‘notepad’ memories (aka persistent memory, bio entries). Do not include chat context. Format it as structured JSON with the fields: title and content.
  2. Save the results as memory_snapshot_01.json
  3. Run 4 more, name them accordingly
  4. Add fingerprints to each memory block by using hashlib.
  5. Compare fingerprints
    1. Check for determinism, if there are differences
    2. Check for semantic equivalence. If there are big differences
    3. Need to devise new plan

Deduplication Strategy

Deduplication Strategy for the same memories with non-deterministic outputs:

The slight variations in the content field are very minor and can be normalized to remove trivial differences:

def normalize_content(text):
    return re.sub(r'\s+', ' ', text.strip().lower())

Then you can geenrate a content-based ID by hashing it:

def content_hash(text):
    normalized = normalize_content(text)
    return hashlib.sha256(normalized.encode()).hexdigest()

This gives you a stable ID across exports which you can then use to compare against new imports.
If you see the same hash after normalizing the import, then you know you can skip (or update metadata). Otherwise, insert a fresh memory.

This allows idempotent memory ingestion meaning no duplication, no matter how many times the user imports/exports.


Database Strategy

For now we can skip the deduplication strategy and just import the memories. Now comes the database. We already had created the beginnings of ContextStore when creating the MultiQuery WebApp. Let’s see what the schema was for that. I was using mongoDB with a simple JSON schema for memories. Or maybe I was just storing prompts and responses, I can’t remember. If I wasn’t storing memories it makes things easier since I can start fresh.

For now this is all I have, which makes things easy:

I guess I will need to update that database to be called “ChatManager” or something. I’m not sure, but it’s unrelated to memories right now until chat history and session stuff get’s involved. So for now I’m just going to create another DB called ContextStore with a collection called EchoForge and another collection called Memories.

  • EchoForge stores all imports and raw integration data
  • Memories stored validated ingested memories tagged and prepped for ContextCore use
[External Export (e.g. OpenAI)] 
        ↓
   [EchoForge Collection]     ← store raw entries with source metadata
        ↓
 [Processing Pipeline]
        ↓
   [Memories Collection]      ← canonical, validated, tagged memory entries
        ↓
 [Used by ContextWeave, UI, etc.]

Memory Schema for EchoForge documents (for now)

{
  "source": "openai",
  "raw_title": "Journaling habit",
  "raw_content": "Started journaling again after a long break...",
  "imported_at": "2025-06-29T20:00:00Z",
  "processed": false,
}

I was going to go with this:

But now I’m rethinking. I don’t know how MongoDB design works. Should each collection be an “object”, which means “echoforge” should be it’s own DB or I should change the name to “imports” or something.


Development

Setting up Workspace

Im going to try using Gemini CLI to get the backend started. Let’s see how this goes…

From Gemini CLI:

  Here is the plan:
   1. Create a backend directory to house our FastAPI application.
   2. Create a Python virtual environment inside it to manage our project's dependencies in an isolated
      way.
   3. Install the necessary base libraries: fastapi for the web framework, uvicorn to run the server,
      pydantic for data validation, and motor as the asynchronous driver for MongoDB.
   4. Create an initial main.py with a simple "Hello World" endpoint to confirm everything is working.
   5. Create a requirements.txt file to list our dependencies.

Wow holy shit it worked:

To start the development server, you can run the following command from the /Users/naderbaradar/development_workspace/echoforge/backend directory:
source venv/bin/activate && uvicorn main:app --reload
   
This will start the server, and you can access the "Hello World" endpoint at http://127.0.0.1:8000.

I’ll keep trying to use GeminiCLI while coding. I don’t fully trust it enough to just say “make this app” but i’ll keep writing down dropdown notes when I use it for changes.


Getting DB Ready

  • 🗃️ Database Name: contextcore
  • 📁 Collections: memories, imports, users, logs, etc.
  • 🧱 ContextStore becomes a logical service or module, not the DB name itself.

NOTE: Memory Management Systems

Rethinking the mental model for memory management in ContextCore. Thinking it should look like this:
Subsystems:

  • EchoForge:
    • Handles external ingestion triggers
    • Pushes raw memory to imports collection
    • Kicks off ingestion pipeline
  • ContextStore:
    • MongoDB schema + access layer
    • No business logic
    • Used by both EchoForge and ContextWeave
  • ContextWeave:
    • Runtime memory retrieval
    • Optional search on embeddings or tag filters
  • MemoryIngestor: NEW SUBSYSTEM
    • Adds metadata
    • Generates embeddings
    • Tags from static taxonomy
    • Saves canonical memory to memories
SubsystemRole
EchoForgeResponsible for all external ingestion triggers (user import, health app sync, LLM provider sync). It receives raw memory.
ContextStoreStores final, structured memories + metadata. Acts as pure DB + schema definitions. Can expose simple data access APIs (CRUD, search). No business logic.
MemoryIngestor (new logical layer)Handles tagging, metadata enrichment, fingerprinting, embedding. Can be a background worker or service. Triggered by EchoForge, or run on a schedule.

You can implement this MemoryIngestor logic within the backend app as a set of services/workers, without needing to expose it as a separately deployed system (unless needed later).

NOTE: Change the name later


DB Collections needed

Phase 1: at a minimum for this POC, I need these collections:

Phase 2: Then when I add tagging to the memories I want to add

Phase 3: The finally after that I can create a users collection for now


Phase 1: Import Endpoint

Going to create an endpoint called “/import” that gets called when the user pastes their exported json file to the UI. They paste their raw JSON, they press submit, then /import get’s called from the frontend with the content of the textbox as the body, and then it’s stored in the DB. Should be pretty simple to implement. Will create the frontend after first creating a REST endpoint that I can hit using the Bruno testing API-client or some simple cURL commands.

If I’m to follow Domain Driven Design (DDD) then I should make /import the service endpoint name and then internally I will organize logic under the ingestion domain.

  UI[User Submits JSON] --> API[POST /import]
  API -->|Stores raw entries| ImportsDB[imports collection]
  ImportsDB -->|Triggers processing| IngestService[ingest_service.py]
 
  IngestService --> Normalizer
  IngestService --> Validator
  IngestService --> Hasher
  IngestService -->|Save canonical| MemoriesDB[memories collection]
  
  MemoriesDB --> Tagger[Auto-tagging]

normalizer.py would:

  • Strip whitespace
  • Lowercase (if needed for hashing)
  • Maybe remove punctuation
  • Flatten newlines
  • Normalizer = cleaner

validator.py would:

  • Ensure both title and content are non-empty after normalization
  • Maybe check that content length is above a certain threshold
  • Verify provider matches allowed list
  • Validator = gatekeeper

Design imports, prompts, logs DB Schemas

imports

{
  "_id": ObjectId,
  "user_id": ObjectId,
  "raw_content": [...],                  // the unprocessed JSON array of memories
  "source_provider": "openai",           // origin of export
  "import_type": "provider",             // provider | integration | manual_upload
  "import_trigger": "manual",            // manual | auto_response | scheduled
  "import_session": "sess_abc123",       // all imports from same UX flow share this
  "source_prompt_id": ObjectId,          // optional
  "num_memories": 15,                    // count of raw memory objects
 
  "status": "pending",                   // pending | processed | errored
  "processing_notes": null,             // useful for debugging errors
  "is_duplicate": false,                 // hash match of content (optional)
 
  "created_at": ISODate(),
  "processed_at": null
}

prompts

{
  "_id": ObjectId,
  "name": "Export OpenAI Memories",
  "content": "Give me a complete, unfiltered...",
  "description": "Prompt used by EchoForge...",
  "tags": ["echoforge", "system", "openai", "import"],
  "scope": "system",              // or "user", "akasha", "weave", etc.
  "owner_id": null,              // null for system prompts; ObjectId for user
  "persona_id": null,            // link to Akasha personality module
  "version": 1,
  "active": true,
  "created_at": ISODate,
  "updated_at": ISODate
}

logs

{
  "_id": ObjectId,
  "event_type": "import_processed",     // or 'error', 'memory_tagged'
  "user_id": ObjectId,                  // who initiated the action
  "import_id": ObjectId,                // optional, related import
  "memory_id": ObjectId,                // optional, related memory
  "prompt_id": ObjectId,                // optional, related prompt
 
  "details": {
    "message": "15 memories ingested, 2 duplicates skipped.",
    "tagging_mode": "auto",
    "trigger": "manual"
  },
 
  "level": "info",                      // info | warning | error
  "created_at": ISODate
}

Alright with some extensive planning and the help of my intern GeminiCLI, I’ve completed the /import endpoint.

We have now completed the full pipeline for the /import endpoint.

  Here's a quick summary of the data flow:
   1. A POST request with a list of memories hits the /api/v1/import endpoint in api/imports.py.
   2. The endpoint calls the IngestionService from services/ingest_service.py.
   3. The IngestionService orchestrates the import process. It currently has a placeholder for the user_id and
      calls the ImportService.
   4. The ImportService (from db/imports.py) creates a new Import document and saves it to the imports
      collection in your MongoDB database.

  You should now be able to run the application and test the endpoint.
  Before you do, make sure to:
   1. Install dependencies: Run pip install -r requirements.txt in your backend directory to make sure you have
      all the necessary libraries.
   2. Set your `MONGO_URI`: Make sure the .env file in your backend directory has a valid MongoDB connection
      string.
   3. Replace the placeholder `user_id`: In services/ingest_service.py, replace the hardcoded ObjectId with a
      real one from your users collection.

  Once you've done that, you can start the server with uvicorn main:app --reload from the backend directory.

  What would you like to do next?

I need to now go through the code and see if it even makes sense. Wow most of it made sense, I just had to make minor adjustments here and there.


Create users DB collection

Next let’s create a minimal schema for the users collection since we will need to save the user_id field in the imports documents we create.

{
  "_id": ObjectId,
  "email": "testuser@example.com",
  "name": "Test User",
  "created_at": ISODate()
}
FieldPurpose
_idPrimary key
emailLookup identity
nameDisplay/debug
created_atTraceability
Inserted with [[Mongosh Commandsmongosh]]
db.users.insertOne({
  email: "testuser@example.com",
  name: "Test User",
  created_at: new Date()
});

Now I can replace hardcoded IDs in my code with the real test user ID: 6869ab95ba829ecd62853fe2


Redesign Pydantic models

I tried running this request to the service, but it didn’t work.

curl --request POST \
  --url http://127.0.0.1:8000/api/import \
  --header 'content-type: application/json' \
  --data '{
  "memories": [
    {
      "title": "Title",
      "content": "memory data"
    }
  ]
}'

I am getting issues with malformed data. Let’s go back to the drawing board for a second, I think I trust AI too much and it made a bunch of random Pydantic models we don’t need.

We are using Pydantic in this project. What are the different models we have?

1. models/db_models.py

  • User
  • Import
  • Prompt
  • Log
  • Memory

2.schemas/requests.py

  • MemoryRequest (basic title + content, a single memory from within a users exported memories.)
  • ImportRequest (e.g. { memories: [MemoryRequest] }, exported memories that the user will send with the request. Contains multiple MemoryRequests)

3. schemas/responses.py

  • ImportResponse (e.g., { import_id, status, message })
  • UserResponse (e.g., { name, email, _id })

4. Optional: schemas/internal.py

  • NormalizedMemory
  • HashedMemory
  • TaggedMemory

Okay, with everything updated, the /import endpoint is working. It can take a raw JSON file and insert it into the DB. Next is actually processing the inserted data.

Phase 2: Implement ingestion with Tagging

Well actually the very first thing should be to parse through the users JSON file of exported memories RIGHT BEFORE it gets entered into the DB to make sure it is structured correctly, and if there’s something simple like it’s just missing the top “memories” field I think we should go ahead and add it, and then try again. Or any other simple fixes we can think of, like if instead of “memories” there’s a different word like memory, but the title/content remains the same.

Alright that’s done now too.

So next like I said, time to process the data. But how do we process the data for deduplication? We thought of the hashing method and heres the flow:
IngestService Normalizer
IngestService Validator
IngestService Hasher
IngestService |Save canonical| MemoriesDB[memories collection]

But hashing will only work for idempotent ingestion and detecting exact duplicates (after normalization).

For detecting “almost identical” or semantically similar memories, you would need to introduce a semantic
deduplication layer.

  • Embedding Generation: Converting memory content into numerical vectors (embeddings) using an LLM or a
    dedicated embedding model.
  • Similarity Search: Using vector databases or approximate nearest neighbor (ANN) algorithms to find
    memories whose embeddings are “close” to each other, indicating semantic similarity.

So for now let’s just implement the first layer: normalized deduplication and idempotency.

1. Receive `ImportRequest`: The IngestionService gets the ImportRequest (which contains a list of
      MemoryRequest objects) after import_parser.py has done its job.
   2. Iterate through `MemoryRequest`s: For each MemoryRequest in the list:
      a.  Validate: Apply the "Validator" rules. If a memory fails, it's logged as a skipped/failed memory.
      b.  Hash: Generate the SHA256 fingerprint using the normalized title and content.
      c.  Check for Duplicates: Query your memories collection to see if any existing memory already has
  this exact fingerprint.
      d.  Conditional Save:
           * If a duplicate is found, you log it as such and skip saving this memory to the memories
             collection (or update its metadata if needed).
           * If no duplicate, save the memory (including its fingerprint) to the memories collection.

Questions: Are fingerprints global? or user-scoped?


Design elements DB schema

Oh I guess before I continue, I’ll need to now design the schema for the elements collection and create it in the DB.

memories = elements

I want to actually call memories something else within the larger scope of “ContextCore.” I am going to change the naming scheme to elements, because an element could be a memory, OR many other things. Think of this scenario: What if you wanted to repurpose ContextCore to ingest a codebase and create elements out of certain things like files or coding conventions? That unlocks a whole new layer of functionality. I know ContextCore can solve a lot of problems, but my issue is clear language and explanations around what it is capable of.

So something like this for now?

{
  "_id": ObjectId,
  "user_id": ObjectId,
  "title": "string",                 // Optional, fallback to first N characters of content
  "content": "string",
  "tags": ["string", ...],          // Generated at ingestion time
  "source": {
    "provider": "openai",           // or "manual", "notion", etc.
    "import_id": ObjectId
  },
  "type": "memory",                 // Could be "memory", "note", etc.
  "created_at": ISODate
}

So elements will have
user TAGS that are user generated
dict TAGS that are picked from a dict
llm TAGS that are created with LLM using only existing tags, and suggest new tags for memories that the user has to accept
SHARDS that are created with LLM (embeddings) that can be new and pre-existing. But we aren’t including that right now.

A summarized plan from chatgpt based on some design convos:
🔹 Phase 1 (Now): Common Tags (Dict-based)

  • Simple keyword matching from a controlled vocab
  • No user input or LLMs needed
  • Enables basic filtering and early metadata population

🔹 Phase 2: User-Generated Tags

  • Users can add/edit/delete tags on an element
  • Tags can be freeform or chosen from a predefined tag set
  • Stored in the tags collection for reuse across elements
  • Enables personalized organization

🔹 Phase 3: LLM-Suggested Tags

  • LLM reviews element content
  • Recommends:
    • Existing tags (autocomplete-like UX)
    • New tags, which go into a pending state for user approval
  • Ensures high-quality semantic suggestions
  • Adds intelligent structure without overwhelming the user

🔹 Phase 4: Shards (LLM Embedding-Based)

  • Uses embedding vectors to:
    • Autogenerate fine-grained contextual tags
    • Link semantically similar elements
  • Powers search, memory compression, and similarity graphs
  • This is your deep memory indexing layer
  • Computationally expensive → worth deferring

Design tags DB schema

I think I’ll also create a tags collection

Tag Types:

Tag TypeSourceMutableDescription
dict tagsSystem logicCommon categories detected via keyword match
user tagsUser inputPersonalized tags created or selected by user
llm tagsLLM-suggestedProposed by LLM using full element text (user accepts)
shardsLLM+embeds❌/✅Ultra-specific, auto-clustered concepts