Solving Data Ingestion Problems with OpenAI

Adam Heunis - Fri Aug 23 2024

AI OpenAI Generative AI Data Ingestion CSV Files C#

As the adoption of Generative AI continues to rise, many businesses face a common challenge: getting a tangible return on investment (ROI). One area where this struggle is particularly evident is in data ingestion. For us, this manifested in a specific problem that seemed to have no end — user-uploaded CSV files.

The challenge: poorly formatted CSV files

We introduced a feature that allowed users to upload data in CSV format. While this seemed straightforward, we quickly ran into significant issues. Users were uploading CSV files with:

  • Badly named columns: The column names often didn’t match what our system expected.
  • Columns in the wrong order: Even when the column names were correct, they were frequently in the wrong sequence.

These issues caused endless problems, resulting in failed data ingestion and frustrating user experiences. Our team spent countless hours trying to resolve these issues manually, but it became clear that a more sophisticated solution was needed.

The solution: harnessing OpenAI with the C# SDK

To tackle this challenge, we turned to OpenAI and its newly released OpenAI SDK for C#. We developed an AI-driven solution that dynamically checks the column names in the uploaded CSV files and determines the best fit and correct column index.

How it works

  1. Dynamic column validation: The AI examines the column names in the CSV file, comparing them against a set of expected names. It then uses context and learned patterns to make intelligent guesses about which columns match our requirements.

  2. Intelligent column ordering: Once the columns are validated, the AI reorders them as necessary to fit our expected schema. This ensures that even if a user uploads a CSV file with columns in the wrong order, the data will still be ingested correctly.

The results: a 55% improvement in data ingestion

Since implementing this AI-driven solution, we’ve seen a 55% improvement in our data ingestion pipelines. This has not only reduced the time and resources spent on manual intervention but also significantly improved the user experience.

Final thoughts: ROI realized

Implementing AI solutions like this demonstrates the real-world benefits of Generative AI. By leveraging OpenAI and the C# SDK, we transformed a persistent problem into an opportunity for efficiency and growth. If you’re struggling to realize the ROI on your AI investments, consider how AI can be used to solve specific, high-impact problems within your organization.

Here’s the code snippet that made it all possible

internal static async Task<Dictionary<string, int>?> GetColumnIndices<
        TImporter>(List<string[]> data) where TImporter : CsvContact
    {
        const string instructions =
            """
            You will be given a JSON structure that represents a schema with 1) sample rows and 2) requested column definitions.
            The sample rows show example data entries, while the columns section outlines the requested data types and whether 
            each column is required or optional.

            Your job is to find the best-matched indices for the provided columns in the sample data.If any of the required columns 
            cannot be matched to the sample data, you must not continue and return an exception.
            If any of the non-required columns cannot be matched return a -1 for the index.

            Please respond with a JSON object only. Do not include any text outside of the JSON structure.
            The successful object must be a dictionary named "Fields" and the failed object must be named "exception" and 
            contain a property name "message" with the error message.
            """;

        var columns = BuildColumnListFromRecord<TImporter>();

        var inputObject = new MyClass(
            // Only take the first 5 rows for the sample data
            data.Take(5).ToList(), columns);

        // Serialize the input object to a JSON string
        var inputJson = JsonSerializer.Serialize(inputObject);

        // Get OpenAI API key from environment variable
        var openAiKey = Environment.GetEnvironmentVariable
            ("OPENAI_API_KEY") ?? throw new("OpenAI API key not found");

        ChatClient client = new(model: "gpt-4o", openAiKey);

        var options = new ChatCompletionOptions
        {
            ResponseFormat = ChatResponseFormat.JsonObject
        };
        ChatCompletion completion =
            await client.CompleteChatAsync([
                    new SystemChatMessage(instructions),
                    new UserChatMessage(inputJson)
                ], options
            );

        var response = new StringBuilder();
        foreach (var part in completion.Content)
        {
            response.Append(part.Text);
        }

        // Try to parse the response to a JSON object
        Success? jsonObject;
        try
        {
            jsonObject =
                JsonSerializer.Deserialize<Success>(
                    response.ToString());
        }
        catch (Exception e)
        {
            Console.WriteLine(e);
            throw;
        }

        return jsonObject?.Fields ?? new Dictionary<string, int>();
    }