How can i properly extract text from a PDF File, to store it in a Elastic-Search Index?

I'm working on a C#/Blazor project that extracts text from PDF files and stores it in an Elasticsearch index for full-text search.

The issue: when the PDF was originally a PowerPoint presentation, the layout and paragraph structure are not preserved. For example, bullet points and headings are extracted line-by-line without any clear structure.

Is there a way to preserve the logical layout when extracting text from these PDFs?

I'm using PdfTextExtractor.GetTextFromPage() from iText.Kernel.Pdf.Canvas.Parser, but it just reads every line individually. This makes the text hard to search and read once indexed.

try{    using var reader = new PdfReader(file.Path);    using var pdf = new PdfDocument(reader);    for (var i = 1; i <= pdf.GetNumberOfPages(); i++)    {        var pageText = PdfTextExtractor.GetTextFromPage(pdf.GetPage(i));        await elasticService.WriteAsync(new ElasticWriteModel        {            DocId = document.Id,            Page = i,            Text = pageText,            AccessLevel = document.AccessLevel == AccessLevel.Private ? 0 : 1,            Timestamp = DateTime.UtcNow        }, cancellationToken);    }}catch (Exception ex){    logger.LogError(ex, "Failed to process document with ID {DocumentId}", document.Id);}

This results then in a strange result like in the image: Here

I've also tried to use the File.ReadAllLines() class from System base class, but that didn't also work.

How can i properly extract text from a PDF File, to store it in a Elastic-Search Index?

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...