I'm working on a C#/Blazor project that extracts text from PDF files and stores it in an Elasticsearch index for full-text search.
The issue: when the PDF was originally a PowerPoint presentation, the layout and paragraph structure are not preserved. For example, bullet points and headings are extracted line-by-line without any clear structure.
Is there a way to preserve the logical layout when extracting text from these PDFs?
I'm using PdfTextExtractor.GetTextFromPage() from iText.Kernel.Pdf.Canvas.Parser, but it just reads every line individually. This makes the text hard to search and read once indexed.
try{ using var reader = new PdfReader(file.Path); using var pdf = new PdfDocument(reader); for (var i = 1; i <= pdf.GetNumberOfPages(); i++) { var pageText = PdfTextExtractor.GetTextFromPage(pdf.GetPage(i)); await elasticService.WriteAsync(new ElasticWriteModel { DocId = document.Id, Page = i, Text = pageText, AccessLevel = document.AccessLevel == AccessLevel.Private ? 0 : 1, Timestamp = DateTime.UtcNow }, cancellationToken); }}catch (Exception ex){ logger.LogError(ex, "Failed to process document with ID {DocumentId}", document.Id);}This results then in a strange result like in the image: 
I've also tried to use the File.ReadAllLines() class from System base class, but that didn't also work.